Sin2pi
/

asr-model

@@ -21,7 +21,9 @@ tags:
 ---
-ASR model that uses audio frequencies instead of spectrograms. + pitch aware relative positional embeddings.
 Questions:
@@ -35,8 +37,6 @@ Questions:
-<img width="780" alt="cc5" src="https:github.comuser-attachmentsassets106ebe75-f1db-4f85-bdae-818b114fedd2"  >
 To explore the relationship between pitch and rotary embeddings, the model implements three complementary pitch based enhancements:
@@ -44,6 +44,16 @@ To explore the relationship between pitch and rotary embeddings, the model imple
 2. Direct similarity bias: A pitch based similarity bias is added directly to the attention mechanism.
 3. Variable radii in torch.polar: The unit circle radius 1.0 in the torch.polar calculation is replaced with variable radii derived from f0. This creates acoustically-weighted positional encodings, so each position in the embedding space reflects the acoustic prominence in the original speech. This approach effectively adds phase and amplitutde information without significant computational overhead.
 The function `torch.polar` constructs a complex tensor from polar coordinates:
 ````python
@@ -296,3 +306,5 @@ MaxFactor is a custom PyTorch optimizer with adaptive learning rates and special
 ** this model deviates in a lot of ways from standard transformer models.

 ---
+ASR model that uses audio frequencies instead of spectrograms + pitch aware relative positional embeddings.
 Questions:
 To explore the relationship between pitch and rotary embeddings, the model implements three complementary pitch based enhancements:
 2. Direct similarity bias: A pitch based similarity bias is added directly to the attention mechanism.
 3. Variable radii in torch.polar: The unit circle radius 1.0 in the torch.polar calculation is replaced with variable radii derived from f0. This creates acoustically-weighted positional encodings, so each position in the embedding space reflects the acoustic prominence in the original speech. This approach effectively adds phase and amplitutde information without significant computational overhead.
+4. Initial findings suggest that f0 is a superior input to spectrograms for ASR models cutting training time almost by half without negatively effecting other metrics such as WER or loss.
+<img width="1816" height="707" alt="pitchee2" src="https://github.com/user-attachments/assets/17c89ebf-3373-4dd5-b510-95fa96774ec1" />
 The function `torch.polar` constructs a complex tensor from polar coordinates:
 ````python
 ** this model deviates in a lot of ways from standard transformer models.