Update README.md
Browse files
README.md
CHANGED
@@ -22,17 +22,14 @@ tags:
|
|
22 |
|
23 |
|
24 |
NLP/ASR multimodal pitch aware model.
|
|
|
|
|
25 |
|
26 |
-
|
27 |
-
This plot illusrates the pattern similiarity of pitch and spectrogram. (librispeech)
|
28 |
-
|
29 |
-
Pitch-Aware Processing: Integrates F0/pitch information throughout the processing pipeline, making the model sensitive to prosodic features of speech.
|
30 |
-
|
31 |
-
To highlight the relationship between pitch and rotary embeddings echo implements two complementary pitch-based enhancements:
|
32 |
|
33 |
1. The first uses pitch to modify theta (rotary frequency)*
|
34 |
2. The second adds direct similarity bias to attention
|
35 |
-
3. Variable radii added in place of unit circle radius(1.0)
|
36 |
|
37 |
By modulating the RoPE frequencies based on pitch (F0), we are essentially telling the model to pay attention to the acoustic features relate to sequence position in a way that's proportional to the voice characteristics. This approach creates a more speech-aware positional representation that helps the model better understand the relationship between acoustic features and text.
|
38 |
|
@@ -41,7 +38,6 @@ By modulating the RoPE frequencies based on pitch (F0), we are essentially telli
|
|
41 |
|
42 |
These visualizations show how F0 (fundamental frequency/pitch) information affects the model's rotary position embeddings (RoPE)
|
43 |
|
44 |
-
|
45 |
Each figure shows 4 subplots (one for each of the first 4 dimensions of your embeddings in the test run). These visualizations show how pitch information modifies position encoding patterns in the model.
|
46 |
|
47 |
In each subplot:
|
@@ -60,19 +56,12 @@ In each subplot:
|
|
60 |
|
61 |
4. **Position-specific variations**: In standard RoPE, frequency decreases with dimension index, but F0 adaptation modify this pattern.
|
62 |
|
63 |
-
|
64 |
-
These visualizations help confirm that:
|
65 |
-
- F0 information is being properly integrated into the model
|
66 |
-
- The adaptation creates meaningful variations in the position encodings
|
67 |
-
- The signal is strong enough to potentially help the model understand pitch-sensitive aspects of speech
|
68 |
|
69 |
----
|
70 |
#### Domain-Specific ASR/NLP.
|
71 |
|
72 |
#### freqs = (theta / 220.0) * 700 * (torch.pow(10, torch.linspace(0, 2595 * torch.log10(torch.tensor(1 + 8000/700)), dim // 2, device=device, dtype=dtype) / 2595) - 1) / 1000
|
73 |
|
74 |
-
#### Static frequency's are perfectly fine for text models but not for NLP
|
75 |
-
|
76 |
----
|
77 |
|
78 |
The patterns below show how positions "see" each other in relation to theta and f0.
|
@@ -85,21 +74,6 @@ Narrow bands: More focus on nearby positions (good for local patterns)
|
|
85 |
<img width="470" alt="cc2" src="https://github.com/user-attachments/assets/9089e806-966b-41aa-8793-bee03a6e6be1" />
|
86 |
|
87 |
|
88 |
-
|
89 |
-
|
90 |
-
Pitch bias
|
91 |
-
|
92 |
-
The pitch bias implementation creates an attention bias matrix:
|
93 |
-
This makes tokens with similar pitch attend to each other more, which helps:
|
94 |
-
|
95 |
-
- Track speaker consistency
|
96 |
-
- Maintain coherent pitch patterns
|
97 |
-
- Group harmonically related segments
|
98 |
-
|
99 |
-
The theoretical foundation:
|
100 |
-
- Both position and pitch can be represented as frequencies
|
101 |
-
- Speech has inherent rhythmic and tonal patterns that correlate with semantic content
|
102 |
-
- Varying the rotation frequency based on pitch creates a more speech-aware positional encoding
|
103 |
|
104 |
-
---
|
105 |
|
|
|
22 |
|
23 |
|
24 |
NLP/ASR multimodal pitch aware model.
|
25 |
+
<img width="670" alt="cc" src="https://github.com/user-attachments/assets/ce9417de-a892-4811-b151-da612f31c0fb" />
|
26 |
+
**This plot illustrates the pattern similiarity of pitch and spectrogram. (librispeech)
|
27 |
|
28 |
+
To highlight the relationship between pitch and rotary embeddings the model implements three complementary pitch-based enhancements:
|
|
|
|
|
|
|
|
|
|
|
29 |
|
30 |
1. The first uses pitch to modify theta (rotary frequency)*
|
31 |
2. The second adds direct similarity bias to attention
|
32 |
+
3. Variable radii added in place of unit circle radius(1.0) of torch.polar. The frequencies (f0) are time aligned with tokens creating acoustically-weighted positional encodings where the "loudness" of each position in the embedding space reflects the acoustic prominence in the original speech.
|
33 |
|
34 |
By modulating the RoPE frequencies based on pitch (F0), we are essentially telling the model to pay attention to the acoustic features relate to sequence position in a way that's proportional to the voice characteristics. This approach creates a more speech-aware positional representation that helps the model better understand the relationship between acoustic features and text.
|
35 |
|
|
|
38 |
|
39 |
These visualizations show how F0 (fundamental frequency/pitch) information affects the model's rotary position embeddings (RoPE)
|
40 |
|
|
|
41 |
Each figure shows 4 subplots (one for each of the first 4 dimensions of your embeddings in the test run). These visualizations show how pitch information modifies position encoding patterns in the model.
|
42 |
|
43 |
In each subplot:
|
|
|
56 |
|
57 |
4. **Position-specific variations**: In standard RoPE, frequency decreases with dimension index, but F0 adaptation modify this pattern.
|
58 |
|
|
|
|
|
|
|
|
|
|
|
59 |
|
60 |
----
|
61 |
#### Domain-Specific ASR/NLP.
|
62 |
|
63 |
#### freqs = (theta / 220.0) * 700 * (torch.pow(10, torch.linspace(0, 2595 * torch.log10(torch.tensor(1 + 8000/700)), dim // 2, device=device, dtype=dtype) / 2595) - 1) / 1000
|
64 |
|
|
|
|
|
65 |
----
|
66 |
|
67 |
The patterns below show how positions "see" each other in relation to theta and f0.
|
|
|
74 |
<img width="470" alt="cc2" src="https://github.com/user-attachments/assets/9089e806-966b-41aa-8793-bee03a6e6be1" />
|
75 |
|
76 |
|
77 |
+
The models rotary implementation maps the perceptual properties of audio to the mathematical properties of the rotary embeddings, creating a more adaptive and context-aware representation system. Pitch is optionally extracted from audio in the data processing pipeline and can be used for an additional feature along with spectrograms and or used to inform the rotary and or pitch bias.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
78 |
|
|
|
79 |
|