Update README.md
Browse files
README.md
CHANGED
@@ -21,10 +21,6 @@ tags:
|
|
21 |
---
|
22 |
|
23 |
|
24 |
-
|
25 |
-

|
26 |
-
|
27 |
-
|
28 |
## Echo - NLP/ASR model with acoustic variable radii relative position embedding (vRoPE) that maps pitch to token. And some other stuff...
|
29 |
|
30 |
https://github.com/sine2pi/asr_model_echo
|
@@ -40,20 +36,19 @@ To highlight the relationship between pitch and rotary embeddings echo implement
|
|
40 |
2. The second adds direct similarity bias to attention
|
41 |
3. Variable radii added in place of unit circle radius(1.0) associated with torch.polar. The frequencies (f0) are time aligned with tokens creating acoustically-weighted positional encodings where the "loudness" of each position in the embedding space reflects the acoustic prominence in the original speech.
|
42 |
|
43 |
-
|
44 |
|
|
|
45 |
|
|
|
46 |
|
47 |
-
|
48 |
|
49 |
|
50 |
-
1000 steps no f0
|
51 |
|
52 |
-
|
53 |
|
54 |
-
1000 steps with f0
|
55 |
|
56 |
-
<img width="470" alt="321" src="https://github.com/user-attachments/assets/24a68910-b316-4cfc-8927-5c6fd846b919" />
|
57 |
|
58 |
The patterns below show how positions "see" each other in relation to theta and f0.
|
59 |
|
@@ -83,13 +78,7 @@ The theoretical foundation:
|
|
83 |
- Varying the rotation frequency based on pitch creates a more speech-aware positional encoding
|
84 |
|
85 |
---
|
86 |
-
|
87 |
-
### Diagnostic test run with google/fleurs - Spectrogram + f0_rotary:
|
88 |
-
|
89 |
-
<img width="689" alt="graph" src="https://github.com/user-attachments/assets/c161a89d-539c-4983-8d24-12ec41ebc859" />
|
90 |
-
|
91 |
-
<img width="277" alt="321" src="https://github.com/user-attachments/assets/4cc71b43-3e48-4241-b381-5bda17ed9d0d" />
|
92 |
-
|
93 |
|
94 |
## The F0-Conditioned Rotation Mechanism
|
95 |
|
|
|
21 |
---
|
22 |
|
23 |
|
|
|
|
|
|
|
|
|
24 |
## Echo - NLP/ASR model with acoustic variable radii relative position embedding (vRoPE) that maps pitch to token. And some other stuff...
|
25 |
|
26 |
https://github.com/sine2pi/asr_model_echo
|
|
|
36 |
2. The second adds direct similarity bias to attention
|
37 |
3. Variable radii added in place of unit circle radius(1.0) associated with torch.polar. The frequencies (f0) are time aligned with tokens creating acoustically-weighted positional encodings where the "loudness" of each position in the embedding space reflects the acoustic prominence in the original speech.
|
38 |
|
39 |
+
1000 steps no f0:
|
40 |
|
41 |
+
<img width="470" alt="123" src="https://github.com/user-attachments/assets/1b3ca1e8-0b7d-47dd-802b-5eda9537ae13" />
|
42 |
|
43 |
+
1000 steps with f0 / theta substitutions:
|
44 |
|
45 |
+
<img width="470" alt="321" src="https://github.com/user-attachments/assets/24a68910-b316-4cfc-8927-5c6fd846b919" />
|
46 |
|
47 |
|
|
|
48 |
|
49 |
+
By modulating the RoPE frequencies based on pitch (F0), we are essentially telling the model to pay attention to the acoustic features relate to sequence position in a way that's proportional to the voice characteristics. This approach creates a more speech-aware positional representation that helps the model better understand the relationship between acoustic features and text.
|
50 |
|
|
|
51 |
|
|
|
52 |
|
53 |
The patterns below show how positions "see" each other in relation to theta and f0.
|
54 |
|
|
|
78 |
- Varying the rotation frequency based on pitch creates a more speech-aware positional encoding
|
79 |
|
80 |
---
|
81 |
+

|
|
|
|
|
|
|
|
|
|
|
|
|
82 |
|
83 |
## The F0-Conditioned Rotation Mechanism
|
84 |
|