Sin2pi commited on
Commit
0fd1326
·
verified ·
1 Parent(s): 5c4028e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -3
README.md CHANGED
@@ -21,7 +21,9 @@ tags:
21
 
22
  ---
23
 
24
- ASR model that uses audio frequencies instead of spectrograms. + pitch aware relative positional embeddings.
 
 
25
 
26
  Questions:
27
 
@@ -35,8 +37,6 @@ Questions:
35
 
36
 
37
 
38
- <img width="780" alt="cc5" src="https:github.comuser-attachmentsassets106ebe75-f1db-4f85-bdae-818b114fedd2" >
39
-
40
 
41
  To explore the relationship between pitch and rotary embeddings, the model implements three complementary pitch based enhancements:
42
 
@@ -44,6 +44,16 @@ To explore the relationship between pitch and rotary embeddings, the model imple
44
  2. Direct similarity bias: A pitch based similarity bias is added directly to the attention mechanism.
45
  3. Variable radii in torch.polar: The unit circle radius 1.0 in the torch.polar calculation is replaced with variable radii derived from f0. This creates acoustically-weighted positional encodings, so each position in the embedding space reflects the acoustic prominence in the original speech. This approach effectively adds phase and amplitutde information without significant computational overhead.
46
 
 
 
 
 
 
 
 
 
 
 
47
  The function `torch.polar` constructs a complex tensor from polar coordinates:
48
 
49
  ````python
@@ -296,3 +306,5 @@ MaxFactor is a custom PyTorch optimizer with adaptive learning rates and special
296
 
297
  ** this model deviates in a lot of ways from standard transformer models.
298
 
 
 
 
21
 
22
  ---
23
 
24
+ ASR model that uses audio frequencies instead of spectrograms + pitch aware relative positional embeddings.
25
+
26
+
27
 
28
  Questions:
29
 
 
37
 
38
 
39
 
 
 
40
 
41
  To explore the relationship between pitch and rotary embeddings, the model implements three complementary pitch based enhancements:
42
 
 
44
  2. Direct similarity bias: A pitch based similarity bias is added directly to the attention mechanism.
45
  3. Variable radii in torch.polar: The unit circle radius 1.0 in the torch.polar calculation is replaced with variable radii derived from f0. This creates acoustically-weighted positional encodings, so each position in the embedding space reflects the acoustic prominence in the original speech. This approach effectively adds phase and amplitutde information without significant computational overhead.
46
 
47
+ 4. Initial findings suggest that f0 is a superior input to spectrograms for ASR models cutting training time almost by half without negatively effecting other metrics such as WER or loss.
48
+
49
+
50
+ <img width="1816" height="707" alt="pitchee2" src="https://github.com/user-attachments/assets/17c89ebf-3373-4dd5-b510-95fa96774ec1" />
51
+
52
+
53
+
54
+
55
+
56
+
57
  The function `torch.polar` constructs a complex tensor from polar coordinates:
58
 
59
  ````python
 
306
 
307
  ** this model deviates in a lot of ways from standard transformer models.
308
 
309
+
310
+