Update README.md
Browse files
README.md
CHANGED
@@ -23,16 +23,16 @@ tags:
|
|
23 |
|
24 |
|
25 |
|
26 |
-
NLP/ASR multimodal modal with f0
|
27 |
|
28 |
<img width="780" alt="cc5" src="https://github.com/user-attachments/assets/ce9417de-a892-4811-b151-da612f31c0fb" />
|
29 |
|
30 |
-
|
31 |
|
32 |
-
To explore the relationship between pitch and rotary embeddings, the model implements three complementary pitch
|
33 |
|
34 |
-
1. Pitch
|
35 |
-
2. Direct similarity bias: A pitch
|
36 |
3. Variable radii in torch.polar: The unit circle radius (1.0) in the torch.polar calculation is replaced with variable radii derived from f0. This creates acoustically-weighted positional encodings, so each position in the embedding space reflects the acoustic prominence in the original speech. This approach effectively adds phase and amplitutde information without significant computational overhead.
|
37 |
|
38 |
The function `torch.polar` constructs a complex tensor from polar coordinates:
|
@@ -43,8 +43,8 @@ result = magnitude * (torch.cos(angle) + 1j * torch.sin(angle))
|
|
43 |
````
|
44 |
|
45 |
So, for each element:
|
46 |
-
-
|
47 |
-
-
|
48 |
- The result is: `r * exp(i * theta) = r * (cos(theta) + i * sin(theta))`
|
49 |
|
50 |
Reference: [PyTorch Documentation - torch.polar](https://pytorch.org/docs/stable/generated/torch.polar.html)
|
@@ -70,6 +70,7 @@ freqs = torch.polar(radius, freqs)
|
|
70 |
|
71 |
```
|
72 |
Approximation methods like using cos/sin projections or fixed rotation matrices typically assume a unit circle (radius=1.0) or only rotate, not scale. When we introduce a variable radius (amplitude modulation), those approximations break down and can't represent the scaling effect, only the rotation. When using a variable radius, we must use true complex multiplication to get correct results. Approximations that ignore the radius or scale after the rotation don't seem to capture the intended effect, leading to degraded or incorrect representations.
|
|
|
73 |
```python
|
74 |
|
75 |
### Do not approximate:
|
@@ -94,19 +95,19 @@ Each figure shows 4 subplots (one for each of the first 4 dimensions of your emb
|
|
94 |
|
95 |
In each subplot:
|
96 |
|
97 |
-
-
|
98 |
-
-
|
99 |
-
-
|
100 |
-
-
|
101 |
|
102 |
|
103 |
-
1.
|
104 |
|
105 |
-
2.
|
106 |
|
107 |
-
3.
|
108 |
|
109 |
-
4.
|
110 |
|
111 |
The patterns below show how positions "see" each other in relation to theta and f0.
|
112 |
|
@@ -133,6 +134,3 @@ Narrow bands: More focus on nearby positions (good for local patterns)
|
|
133 |
<img width="84" alt="4" src="https://github.com/user-attachments/assets/6d2c640a-3e01-4632-9cc2-7ced3249f8c5" />
|
134 |
|
135 |
------
|
136 |
-
|
137 |
-
|
138 |
-
|
|
|
23 |
|
24 |
|
25 |
|
26 |
+
NLP/ASR multimodal modal with f0 modulated relative positional embeddings. For research/testing.
|
27 |
|
28 |
<img width="780" alt="cc5" src="https://github.com/user-attachments/assets/ce9417de-a892-4811-b151-da612f31c0fb" />
|
29 |
|
30 |
+
This plot illustrates the pattern similiarity of pitch and spectrogram. (librispeech - clean).
|
31 |
|
32 |
+
To explore the relationship between pitch and rotary embeddings, the model implements three complementary pitch based enhancements:
|
33 |
|
34 |
+
1. Pitch modulated theta Pitch (f0) is used to modify the theta parameter, dynamically adjusting the rotary frequency.
|
35 |
+
2. Direct similarity bias: A pitch based similarity bias is added directly to the attention mechanism.
|
36 |
3. Variable radii in torch.polar: The unit circle radius (1.0) in the torch.polar calculation is replaced with variable radii derived from f0. This creates acoustically-weighted positional encodings, so each position in the embedding space reflects the acoustic prominence in the original speech. This approach effectively adds phase and amplitutde information without significant computational overhead.
|
37 |
|
38 |
The function `torch.polar` constructs a complex tensor from polar coordinates:
|
|
|
43 |
````
|
44 |
|
45 |
So, for each element:
|
46 |
+
- magnitude is the modulus (radius, r)
|
47 |
+
- angle is the phase (theta, in radians)
|
48 |
- The result is: `r * exp(i * theta) = r * (cos(theta) + i * sin(theta))`
|
49 |
|
50 |
Reference: [PyTorch Documentation - torch.polar](https://pytorch.org/docs/stable/generated/torch.polar.html)
|
|
|
70 |
|
71 |
```
|
72 |
Approximation methods like using cos/sin projections or fixed rotation matrices typically assume a unit circle (radius=1.0) or only rotate, not scale. When we introduce a variable radius (amplitude modulation), those approximations break down and can't represent the scaling effect, only the rotation. When using a variable radius, we must use true complex multiplication to get correct results. Approximations that ignore the radius or scale after the rotation don't seem to capture the intended effect, leading to degraded or incorrect representations.
|
73 |
+
|
74 |
```python
|
75 |
|
76 |
### Do not approximate:
|
|
|
95 |
|
96 |
In each subplot:
|
97 |
|
98 |
+
- Thick solid lines: Standard RoPE rotations for even dimensions (no F0 adaptation)
|
99 |
+
- Thick dashed lines: Standard RoPE rotations for odd dimensions (no F0 adaptation)
|
100 |
+
- Thin solid lines: F0 RoPE rotations for even dimensions
|
101 |
+
- Thin dashed lines: F0 RoPE rotations for odd dimensions
|
102 |
|
103 |
|
104 |
+
1. Differences between thick and thin lines: This shows how much the F0 information is modifying the standard position encodings. Larger differences indicate stronger F0 adaptation.
|
105 |
|
106 |
+
2. Pattern changes: The standard RoPE (thick lines) show regular sinusoidal patterns, while the F0 RoPE (thin lines) show variations that correspond to the audio's pitch contour.
|
107 |
|
108 |
+
3. Dimension specific effects: Compared across four subplots to see if F0 affects different dimensions differently.
|
109 |
|
110 |
+
4. Position specific variations: In standard RoPE, frequency decreases with dimension index, but F0 adaptation modify this pattern.
|
111 |
|
112 |
The patterns below show how positions "see" each other in relation to theta and f0.
|
113 |
|
|
|
134 |
<img width="84" alt="4" src="https://github.com/user-attachments/assets/6d2c640a-3e01-4632-9cc2-7ced3249f8c5" />
|
135 |
|
136 |
------
|
|
|
|
|
|