Sin2pi commited on
Commit
4d76b89
·
verified ·
1 Parent(s): 7751a8a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +62 -17
README.md CHANGED
@@ -20,41 +20,70 @@ tags:
20
  - new
21
  ---
22
 
 
 
23
 
24
- ## Echo - NLP/ASR model with acoustic variable radii relative position embedding (vRoPE) that maps pitch to token. And some other stuff...
25
 
26
- https://github.com/sine2pi/asr_model_echo
27
 
28
- freqs = (theta / 220.0) * 700 * (torch.pow(10, torch.linspace(0, 2595 * torch.log10(torch.tensor(1 + 8000/700)), dim // 2, device=device, dtype=dtype) / 2595) - 1) / 1000
 
 
29
 
30
- The Magic of Domain-Specific Knowledge in ML..
31
 
32
- Experimental - research model. Some of the modules and functions in the code are not part of the active model just yet.
33
 
34
- Pitch-Aware Processing: Integrates F0/pitch information throughout the processing pipeline, making the model sensitive to prosodic features of speech.
35
 
 
36
 
37
- To highlight the relationship between pitch and rotary embeddings echo implements two complementary pitch-based enhancements:
38
 
39
- 1. The first uses pitch to modify theta (rotary frequency)*
40
- 2. The second adds direct similarity bias to attention
41
- 3. Variable radii added in place of unit circle radius(1.0) associated with torch.polar. The frequencies (f0) are time aligned with tokens creating acoustically-weighted positional encodings where the "loudness" of each position in the embedding space reflects the acoustic prominence in the original speech.
42
 
43
- 1000 steps no f0:
44
 
45
- <img width="470" alt="123" src="https://github.com/user-attachments/assets/1b3ca1e8-0b7d-47dd-802b-5eda9537ae13" />
 
 
 
46
 
47
- 1000 steps with f0 / theta substitutions:
48
 
49
- <img width="470" alt="rhg" src="https://github.com/user-attachments/assets/ddfad0c5-21b5-4f1d-879f-ae41411444a8" />
50
 
 
51
 
 
52
 
 
53
 
54
 
55
- By modulating the RoPE frequencies based on pitch (F0), we are essentially telling the model to pay attention to the acoustic features relate to sequence position in a way that's proportional to the voice characteristics. This approach creates a more speech-aware positional representation that helps the model better understand the relationship between acoustic features and text.
 
 
 
 
 
 
56
 
 
57
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
 
59
  The patterns below show how positions "see" each other in relation to theta and f0.
60
 
@@ -65,10 +94,13 @@ Narrow bands: More focus on nearby positions (good for local patterns)
65
  <img width="470" alt="cc" src="https://github.com/user-attachments/assets/28d00fc5-2676-41ed-a971-e4d857af43f8" />
66
  <img width="470" alt="cc2" src="https://github.com/user-attachments/assets/9089e806-966b-41aa-8793-bee03a6e6be1" />
67
 
 
 
 
68
  Pitch bias
69
 
70
- The pitch bias implementation creates an attention bias matrix.
71
- This makes tokens with similar pitch attend to each other more.
72
 
73
  - Track speaker consistency
74
  - Maintain coherent pitch patterns
@@ -79,4 +111,17 @@ The theoretical foundation:
79
  - Speech has inherent rhythmic and tonal patterns that correlate with semantic content
80
  - Varying the rotation frequency based on pitch creates a more speech-aware positional encoding
81
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82
  ---
 
20
  - new
21
  ---
22
 
23
+ ### NLP/ASR model with acoustic variable radii relative position embedding (vRoPE) that maps pitch to token.
24
+ ----
25
 
26
+ Pitch-Aware Processing: Integrates F0/pitch information throughout the processing pipeline, making the model sensitive to prosodic features of speech.
27
 
28
+ To highlight the relationship between pitch and rotary embeddings echo implements two complementary pitch-based enhancements:
29
 
30
+ 1. The first uses pitch to modify theta (rotary frequency)*
31
+ 2. The second adds direct similarity bias to attention
32
+ 3. Variable radii added in place of unit circle radius(1.0) associated with torch.polar. The frequencies (f0) are time aligned with tokens creating acoustically-weighted positional encodings where the "loudness" of each position in the embedding space reflects the acoustic prominence in the original speech.
33
 
34
+ By modulating the RoPE frequencies based on pitch (F0), we are essentially telling the model to pay attention to the acoustic features relate to sequence position in a way that's proportional to the voice characteristics. This approach creates a more speech-aware positional representation that helps the model better understand the relationship between acoustic features and text.
35
 
36
+ ![rotpatterns](https://github.com/user-attachments/assets/165a3f18-659a-4e2e-a154-a3456b667bae)
37
 
 
38
 
39
+ These visualizations show how F0 (fundamental frequency/pitch) information affects the model's rotary position embeddings (RoPE)
40
 
 
41
 
42
+ Each figure shows 4 subplots (one for each of the first 4 dimensions of your embeddings in the test run). These visualizations show how pitch information modifies position encoding patterns in the model.
 
 
43
 
44
+ In each subplot:
45
 
46
+ - **Thick solid lines**: Standard RoPE rotations for even dimensions (no F0 adaptation)
47
+ - **Thick dashed lines**: Standard RoPE rotations for odd dimensions (no F0 adaptation)
48
+ - **Thin solid lines**: F0-adapted RoPE rotations for even dimensions
49
+ - **Thin dashed lines**: F0-adapted RoPE rotations for odd dimensions
50
 
 
51
 
52
+ 1. **Differences between thick and thin lines**: This shows how much the F0 information is modifying the standard position encodings. Larger differences indicate stronger F0 adaptation.
53
 
54
+ 2. **Pattern changes**: The standard RoPE (thick lines) show regular sinusoidal patterns, while the F0-adapted RoPE (thin lines) show variations that correspond to the audio's pitch contour.
55
 
56
+ 3. **Dimension-specific effects**: Compared across four subplots to see if F0 affects different dimensions differently.
57
 
58
+ 4. **Position-specific variations**: In standard RoPE, frequency decreases with dimension index, but F0 adaptation modify this pattern.
59
 
60
 
61
+ These visualizations help confirm that:
62
+ - F0 information is being properly integrated into the model
63
+ - The adaptation creates meaningful variations in the position encodings
64
+ - The signal is strong enough to potentially help the model understand pitch-sensitive aspects of speech
65
+
66
+ ----
67
+ #### Domain-Specific ASR model.
68
 
69
+ #### freqs = (theta / 220.0) * 700 * (torch.pow(10, torch.linspace(0, 2595 * torch.log10(torch.tensor(1 + 8000/700)), dim // 2, device=device, dtype=dtype) / 2595) - 1) / 1000
70
 
71
+ #### Static frequency's are perfectly fine for text models but not for NLP.
72
+ -----
73
+
74
+ 1000 steps no f0:
75
+
76
+ <img width="470" alt="123" src="https://github.com/user-attachments/assets/1b3ca1e8-0b7d-47dd-802b-5eda9537ae13" />
77
+
78
+ 1000 steps with f0 / theta substitutions:
79
+
80
+ <img width="470" alt="rhg" src="https://github.com/user-attachments/assets/ddfad0c5-21b5-4f1d-879f-ae41411444a8" />
81
+
82
+ https://huggingface.co/Sin2pi/Echo17/tensorboard?params=scalars#frame
83
+
84
+ <img width="470" alt="4345" src="https://github.com/user-attachments/assets/b918fa73-fde0-40ca-9f09-bfc4e97e1ddf" />
85
+
86
+ ----
87
 
88
  The patterns below show how positions "see" each other in relation to theta and f0.
89
 
 
94
  <img width="470" alt="cc" src="https://github.com/user-attachments/assets/28d00fc5-2676-41ed-a971-e4d857af43f8" />
95
  <img width="470" alt="cc2" src="https://github.com/user-attachments/assets/9089e806-966b-41aa-8793-bee03a6e6be1" />
96
 
97
+
98
+ Echos rotary implementation maps the perceptual properties of audio to the mathematical properties of the rotary embeddings, creating a more adaptive and context-aware representation system. Pitch is optionally extracted from audio in the data processing pipeline and can be used for an additional feature along with spectrograms and or used to inform the rotary and or pitch bias.
99
+
100
  Pitch bias
101
 
102
+ The pitch bias implementation creates an attention bias matrix:
103
+ This makes tokens with similar pitch attend to each other more, which helps:
104
 
105
  - Track speaker consistency
106
  - Maintain coherent pitch patterns
 
111
  - Speech has inherent rhythmic and tonal patterns that correlate with semantic content
112
  - Varying the rotation frequency based on pitch creates a more speech-aware positional encoding
113
 
114
+ ---
115
+
116
+ <img width="470" alt="cc2" src="https://github.com/user-attachments/assets/d52a48b1-8717-4d29-9452-cfdf43c92fe8" />
117
+
118
+ ## The F0-Conditioned Rotation Mechanism
119
+
120
+ The high gate usage validates the fundamental frequency conditioning approach:
121
+
122
+ - Pitch-adaptive rotary embeddings are providing meaningful signal that the gates are actively utilizing
123
+ - The decoder is learning to selectively attend to pitch-relevant patterns
124
+ - The gates are functioning as a kind of "pitch-aware filter" that determines which information should flow through the network
125
+
126
+
127
  ---