Sin2pi
/

asr-model

@@ -20,19 +20,28 @@ tags:
 - new
 ---
-NLP/ASR multimodal modal with f0 modulated relative positional embeddings.
-For research/testing.
-Why?
-Because a significant portion of current AI research is focused on optimizing existing methods instead of exploring new approaches.
-<img width="780" alt="cc5" src="https://github.com/user-attachments/assets/ce9417de-a892-4811-b151-da612f31c0fb"  />
-(librispeech - clean).
 To explore the relationship between pitch and rotary embeddings, the model implements three complementary pitch based enhancements:
@@ -67,7 +76,7 @@ theta = f0_mean + self.theta
 freqs = (theta / 220.0) * 700 * (torch.pow(10, torch.linspace(0, 2595 * torch.log10(torch.tensor(1 + 8000/700)), self.dim // 2) / 2595) - 1) / 1000
 ## This seems to give superior results compared to the standard freqs = 1. / (theta ** (torch.arange(0, dim, 2)[:(dim // 2)].float() / dim)).
 ## I thought a mel-scale version might be more perceptually meaningful for audio..
-## Using mel-scale to create a perceptually-relevant distance metric.
 freqs = t[:, None] * freqs[None, :] # dont repeat or use some other method here
@@ -138,17 +147,12 @@ Narrow bands: More focus on nearby positions (good for local patterns)
 <img width="680" alt="cc2" src="https://github.com/user-attachments/assets/9089e806-966b-41aa-8793-bee03a6e6be1"  />
 ----
 This model sometimes uses :
 https://github.com/sine2pi/Maxfactor
-`MaxFactor` is a custom PyTorch optimizer with adaptive learning rates and specialized handling for matrix parameters. I wrote it for the model in the asr_model repository.
-I needed something that performs well and has a light memory foot print since I do everything from my laptop.
-----

 - new
 ---
+NLP/ASR multimodal modal with f0 modulated relative positional embeddings.
+For research/testing.
+----
+Questions:
+   -How can we make attention mechanisms aware of speech-specific properties?
+   -Can we incorporate acoustic information directly into positional encodings?
+   -Does pitch-conditioning improve speech recognition?
+   Standard RoPE was designed for text: Text doesn't have pitch, timing, or acoustic properties.
+----
+<img width="780" alt="cc5" src="https://github.com/user-attachments/assets/106ebe75-f1db-4f85-bdae-818b114fedd2"  />
+This plot illustrates the pattern similiarity of pitch waveform and spectrogram. (librispeech - clean).
 To explore the relationship between pitch and rotary embeddings, the model implements three complementary pitch based enhancements:
 freqs = (theta / 220.0) * 700 * (torch.pow(10, torch.linspace(0, 2595 * torch.log10(torch.tensor(1 + 8000/700)), self.dim // 2) / 2595) - 1) / 1000
 ## This seems to give superior results compared to the standard freqs = 1. / (theta ** (torch.arange(0, dim, 2)[:(dim // 2)].float() / dim)).
 ## I thought a mel-scale version might be more perceptually meaningful for audio..
+## Using mel-scale to create a perceptually-relevant distance metric instead of Euclidean distance.
 freqs = t[:, None] * freqs[None, :] # dont repeat or use some other method here
 <img width="680" alt="cc2" src="https://github.com/user-attachments/assets/9089e806-966b-41aa-8793-bee03a6e6be1"  />
 ----
+https://huggingface.co/Sin2pi/Echo17/tensorboard?params=scalars
+----
 This model sometimes uses :
 https://github.com/sine2pi/Maxfactor
+MaxFactor is a custom PyTorch optimizer with adaptive learning rates and specialized handling for matrix parameters. I wrote it for the model in the asr_model repository. I needed something that performs well and has a light memory foot print since I do everything from my laptop.