Sin2pi commited on
Commit
36e60da
·
verified ·
1 Parent(s): 781042c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -18
README.md CHANGED
@@ -23,16 +23,16 @@ tags:
23
 
24
 
25
 
26
- NLP/ASR multimodal modal with f0-modulated relative positional embeddings. For research/testing.
27
 
28
  <img width="780" alt="cc5" src="https://github.com/user-attachments/assets/ce9417de-a892-4811-b151-da612f31c0fb" />
29
 
30
- **This plot illustrates the pattern similiarity of pitch and spectrogram. (librispeech - clean).
31
 
32
- To explore the relationship between pitch and rotary embeddings, the model implements three complementary pitch-based enhancements:
33
 
34
- 1. Pitch-modulated theta Pitch (f0) is used to modify the theta parameter, dynamically adjusting the rotary frequency.
35
- 2. Direct similarity bias: A pitch-based similarity bias is added directly to the attention mechanism.
36
  3. Variable radii in torch.polar: The unit circle radius (1.0) in the torch.polar calculation is replaced with variable radii derived from f0. This creates acoustically-weighted positional encodings, so each position in the embedding space reflects the acoustic prominence in the original speech. This approach effectively adds phase and amplitutde information without significant computational overhead.
37
 
38
  The function `torch.polar` constructs a complex tensor from polar coordinates:
@@ -43,8 +43,8 @@ result = magnitude * (torch.cos(angle) + 1j * torch.sin(angle))
43
  ````
44
 
45
  So, for each element:
46
- - **magnitude** is the modulus (radius, r)
47
- - **angle** is the phase (theta, in radians)
48
  - The result is: `r * exp(i * theta) = r * (cos(theta) + i * sin(theta))`
49
 
50
  Reference: [PyTorch Documentation - torch.polar](https://pytorch.org/docs/stable/generated/torch.polar.html)
@@ -70,6 +70,7 @@ freqs = torch.polar(radius, freqs)
70
 
71
  ```
72
  Approximation methods like using cos/sin projections or fixed rotation matrices typically assume a unit circle (radius=1.0) or only rotate, not scale. When we introduce a variable radius (amplitude modulation), those approximations break down and can't represent the scaling effect, only the rotation. When using a variable radius, we must use true complex multiplication to get correct results. Approximations that ignore the radius or scale after the rotation don't seem to capture the intended effect, leading to degraded or incorrect representations.
 
73
  ```python
74
 
75
  ### Do not approximate:
@@ -94,19 +95,19 @@ Each figure shows 4 subplots (one for each of the first 4 dimensions of your emb
94
 
95
  In each subplot:
96
 
97
- - **Thick solid lines**: Standard RoPE rotations for even dimensions (no F0 adaptation)
98
- - **Thick dashed lines**: Standard RoPE rotations for odd dimensions (no F0 adaptation)
99
- - **Thin solid lines**: F0-adapted RoPE rotations for even dimensions
100
- - **Thin dashed lines**: F0-adapted RoPE rotations for odd dimensions
101
 
102
 
103
- 1. **Differences between thick and thin lines**: This shows how much the F0 information is modifying the standard position encodings. Larger differences indicate stronger F0 adaptation.
104
 
105
- 2. **Pattern changes**: The standard RoPE (thick lines) show regular sinusoidal patterns, while the F0-adapted RoPE (thin lines) show variations that correspond to the audio's pitch contour.
106
 
107
- 3. **Dimension-specific effects**: Compared across four subplots to see if F0 affects different dimensions differently.
108
 
109
- 4. **Position-specific variations**: In standard RoPE, frequency decreases with dimension index, but F0 adaptation modify this pattern.
110
 
111
  The patterns below show how positions "see" each other in relation to theta and f0.
112
 
@@ -133,6 +134,3 @@ Narrow bands: More focus on nearby positions (good for local patterns)
133
  <img width="84" alt="4" src="https://github.com/user-attachments/assets/6d2c640a-3e01-4632-9cc2-7ced3249f8c5" />
134
 
135
  ------
136
-
137
-
138
-
 
23
 
24
 
25
 
26
+ NLP/ASR multimodal modal with f0 modulated relative positional embeddings. For research/testing.
27
 
28
  <img width="780" alt="cc5" src="https://github.com/user-attachments/assets/ce9417de-a892-4811-b151-da612f31c0fb" />
29
 
30
+ This plot illustrates the pattern similiarity of pitch and spectrogram. (librispeech - clean).
31
 
32
+ To explore the relationship between pitch and rotary embeddings, the model implements three complementary pitch based enhancements:
33
 
34
+ 1. Pitch modulated theta Pitch (f0) is used to modify the theta parameter, dynamically adjusting the rotary frequency.
35
+ 2. Direct similarity bias: A pitch based similarity bias is added directly to the attention mechanism.
36
  3. Variable radii in torch.polar: The unit circle radius (1.0) in the torch.polar calculation is replaced with variable radii derived from f0. This creates acoustically-weighted positional encodings, so each position in the embedding space reflects the acoustic prominence in the original speech. This approach effectively adds phase and amplitutde information without significant computational overhead.
37
 
38
  The function `torch.polar` constructs a complex tensor from polar coordinates:
 
43
  ````
44
 
45
  So, for each element:
46
+ - magnitude is the modulus (radius, r)
47
+ - angle is the phase (theta, in radians)
48
  - The result is: `r * exp(i * theta) = r * (cos(theta) + i * sin(theta))`
49
 
50
  Reference: [PyTorch Documentation - torch.polar](https://pytorch.org/docs/stable/generated/torch.polar.html)
 
70
 
71
  ```
72
  Approximation methods like using cos/sin projections or fixed rotation matrices typically assume a unit circle (radius=1.0) or only rotate, not scale. When we introduce a variable radius (amplitude modulation), those approximations break down and can't represent the scaling effect, only the rotation. When using a variable radius, we must use true complex multiplication to get correct results. Approximations that ignore the radius or scale after the rotation don't seem to capture the intended effect, leading to degraded or incorrect representations.
73
+
74
  ```python
75
 
76
  ### Do not approximate:
 
95
 
96
  In each subplot:
97
 
98
+ - Thick solid lines: Standard RoPE rotations for even dimensions (no F0 adaptation)
99
+ - Thick dashed lines: Standard RoPE rotations for odd dimensions (no F0 adaptation)
100
+ - Thin solid lines: F0 RoPE rotations for even dimensions
101
+ - Thin dashed lines: F0 RoPE rotations for odd dimensions
102
 
103
 
104
+ 1. Differences between thick and thin lines: This shows how much the F0 information is modifying the standard position encodings. Larger differences indicate stronger F0 adaptation.
105
 
106
+ 2. Pattern changes: The standard RoPE (thick lines) show regular sinusoidal patterns, while the F0 RoPE (thin lines) show variations that correspond to the audio's pitch contour.
107
 
108
+ 3. Dimension specific effects: Compared across four subplots to see if F0 affects different dimensions differently.
109
 
110
+ 4. Position specific variations: In standard RoPE, frequency decreases with dimension index, but F0 adaptation modify this pattern.
111
 
112
  The patterns below show how positions "see" each other in relation to theta and f0.
113
 
 
134
  <img width="84" alt="4" src="https://github.com/user-attachments/assets/6d2c640a-3e01-4632-9cc2-7ced3249f8c5" />
135
 
136
  ------