Update README.md
Browse files
README.md
CHANGED
@@ -21,7 +21,9 @@ tags:
|
|
21 |
|
22 |
---
|
23 |
|
24 |
-
ASR model that uses audio frequencies instead of spectrograms
|
|
|
|
|
25 |
|
26 |
Questions:
|
27 |
|
@@ -35,8 +37,6 @@ Questions:
|
|
35 |
|
36 |
|
37 |
|
38 |
-
<img width="780" alt="cc5" src="https:github.comuser-attachmentsassets106ebe75-f1db-4f85-bdae-818b114fedd2" >
|
39 |
-
|
40 |
|
41 |
To explore the relationship between pitch and rotary embeddings, the model implements three complementary pitch based enhancements:
|
42 |
|
@@ -44,6 +44,16 @@ To explore the relationship between pitch and rotary embeddings, the model imple
|
|
44 |
2. Direct similarity bias: A pitch based similarity bias is added directly to the attention mechanism.
|
45 |
3. Variable radii in torch.polar: The unit circle radius 1.0 in the torch.polar calculation is replaced with variable radii derived from f0. This creates acoustically-weighted positional encodings, so each position in the embedding space reflects the acoustic prominence in the original speech. This approach effectively adds phase and amplitutde information without significant computational overhead.
|
46 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
47 |
The function `torch.polar` constructs a complex tensor from polar coordinates:
|
48 |
|
49 |
````python
|
@@ -296,3 +306,5 @@ MaxFactor is a custom PyTorch optimizer with adaptive learning rates and special
|
|
296 |
|
297 |
** this model deviates in a lot of ways from standard transformer models.
|
298 |
|
|
|
|
|
|
21 |
|
22 |
---
|
23 |
|
24 |
+
ASR model that uses audio frequencies instead of spectrograms + pitch aware relative positional embeddings.
|
25 |
+
|
26 |
+
|
27 |
|
28 |
Questions:
|
29 |
|
|
|
37 |
|
38 |
|
39 |
|
|
|
|
|
40 |
|
41 |
To explore the relationship between pitch and rotary embeddings, the model implements three complementary pitch based enhancements:
|
42 |
|
|
|
44 |
2. Direct similarity bias: A pitch based similarity bias is added directly to the attention mechanism.
|
45 |
3. Variable radii in torch.polar: The unit circle radius 1.0 in the torch.polar calculation is replaced with variable radii derived from f0. This creates acoustically-weighted positional encodings, so each position in the embedding space reflects the acoustic prominence in the original speech. This approach effectively adds phase and amplitutde information without significant computational overhead.
|
46 |
|
47 |
+
4. Initial findings suggest that f0 is a superior input to spectrograms for ASR models cutting training time almost by half without negatively effecting other metrics such as WER or loss.
|
48 |
+
|
49 |
+
|
50 |
+
<img width="1816" height="707" alt="pitchee2" src="https://github.com/user-attachments/assets/17c89ebf-3373-4dd5-b510-95fa96774ec1" />
|
51 |
+
|
52 |
+
|
53 |
+
|
54 |
+
|
55 |
+
|
56 |
+
|
57 |
The function `torch.polar` constructs a complex tensor from polar coordinates:
|
58 |
|
59 |
````python
|
|
|
306 |
|
307 |
** this model deviates in a lot of ways from standard transformer models.
|
308 |
|
309 |
+
|
310 |
+
|