Sin2pi
/

asr-model

@@ -21,13 +21,23 @@ tags:
 ---
 ----
-### NLP/ASR multimodal modal with f0 modulated relative positional embeddings.
 For research/testing.
 ----
-### Questions:
    -How can we make attention mechanisms aware of speech-specific properties?
@@ -36,9 +46,10 @@ For research/testing.
    -Does pitch-conditioning improve speech recognition?
    Standard RoPE was designed for text: Text doesn't have pitch, timing, or acoustic properties.
-----
@@ -78,14 +89,15 @@ if f0 is not None:
 else:
     theta = self.theta
 freqs = (theta.unsqueeze(-1) / 220.0) * 700 * (
     torch.pow(10, torch.linspace(0, 2595 * torch.log10(torch.tensor(1 + 8000/700)),
             self.dim // 2, device=theta.device, dtype=theta.dtype) / 2595) - 1) / 1000
-## This seems to give superior results compared to the standard freqs = 1. / (theta ** (torch.arange(0, dim, 2)[:(dim // 2)].float() / dim)).
-## I thought a mel-scale version might be more perceptually meaningful for audio..
-## Using mel-scale to create a perceptually-relevant distance metric instead of Euclidean distance.
 t = torch.arange(ctx, device=device, dtype=dtype)
 freqs = t[:, None] * freqs  # dont repeat or use some other method here
@@ -97,7 +109,6 @@ else:
     radius = torch.ones_like(freqs)
     freqs = torch.polar(radius, freqs)
 ```
 A closer look at whats going on. Here is a slice of the actual radius values for one step

 ---
 ----
+NLP/ASR multimodal modal with f0 modulated relative positional embeddings.
 For research/testing.
+Moving beyond literal transcription toward something more artistic and creative.
+      On a road to build a "creative" speech-to-text model that can:
+      -Generate stories from audio
+      -Make poetic associations
+      -Fill in gaps with imagination
+      -Create richer, more expressive text..
+      -Seperate the sad Morrissey songs from the two that arn't..
 ----
+Questions:
    -How can we make attention mechanisms aware of speech-specific properties?
    -Does pitch-conditioning improve speech recognition?
    Standard RoPE was designed for text: Text doesn't have pitch, timing, or acoustic properties.
+   Xpos is an artificial generic decay. This takes its place with something more meaningful.
+---
 else:
     theta = self.theta
+## In text, theta=10,000 sets the base frequency for positional encoding, ensuring a wide range of periodicities for long sequences. I'm not sure if the specific number 10k was experimentally derived.
+## For audio, especially speech, the relevant periodicities are determined by the pitch (f0 neighborhood or f0 per frame) might be more meaningful.
 freqs = (theta.unsqueeze(-1) / 220.0) * 700 * (
     torch.pow(10, torch.linspace(0, 2595 * torch.log10(torch.tensor(1 + 8000/700)),
             self.dim // 2, device=theta.device, dtype=theta.dtype) / 2595) - 1) / 1000
+## This seems to give better results compared to the standard freqs = 1. / (theta ** (torch.arange(0, dim, 2)[:(dim // 2)].float() / dim)).
+## I thought a mel-scale version might be more perceptually meaningful for audio.. ie. using mel-scale to create a perceptually-relevant distance metric instead of Euclidean distance.
 t = torch.arange(ctx, device=device, dtype=dtype)
 freqs = t[:, None] * freqs  # dont repeat or use some other method here
     radius = torch.ones_like(freqs)
     freqs = torch.polar(radius, freqs)
 ```
 A closer look at whats going on. Here is a slice of the actual radius values for one step