Update README.md
Browse files
README.md
CHANGED
@@ -22,7 +22,7 @@ tags:
|
|
22 |
|
23 |
|
24 |
NLP/ASR multimodal pitch aware model.
|
25 |
-
<img width="670" alt="
|
26 |
**This plot illustrates the pattern similiarity of pitch and spectrogram. (librispeech)
|
27 |
|
28 |
To highlight the relationship between pitch and rotary embeddings the model implements three complementary pitch-based enhancements:
|
@@ -33,7 +33,7 @@ To highlight the relationship between pitch and rotary embeddings the model impl
|
|
33 |
|
34 |
By modulating the RoPE frequencies based on pitch (F0), we are essentially telling the model to pay attention to the acoustic features relate to sequence position in a way that's proportional to the voice characteristics. This approach creates a more speech-aware positional representation that helps the model better understand the relationship between acoustic features and text.
|
35 |
|
36 |
-
<img width="
|
37 |
|
38 |
Each figure shows 4 subplots (one for each of the first 4 dimensions of your embeddings in the test run). These visualizations show how pitch information modifies position encoding patterns in the model.
|
39 |
|
@@ -67,8 +67,8 @@ Bright diagonal line: Each position matches itself perfectly.
|
|
67 |
Wider bright bands: Positions can "see" farther (good for long dependencies) but can be noisy.
|
68 |
Narrow bands: More focus on nearby positions (good for local patterns)
|
69 |
|
70 |
-
<img width="
|
71 |
-
<img width="
|
72 |
|
73 |
|
74 |
The models rotary implementation maps the perceptual properties of audio to the mathematical properties of the rotary embeddings, creating a more adaptive and context-aware representation system. Pitch is optionally extracted from audio in the data processing pipeline and can be used for an additional feature along with spectrograms and or used to inform the rotary and or pitch bias.
|
|
|
22 |
|
23 |
|
24 |
NLP/ASR multimodal pitch aware model.
|
25 |
+
<img width="670" alt="cc5" src="https://github.com/user-attachments/assets/ce9417de-a892-4811-b151-da612f31c0fb" />
|
26 |
**This plot illustrates the pattern similiarity of pitch and spectrogram. (librispeech)
|
27 |
|
28 |
To highlight the relationship between pitch and rotary embeddings the model implements three complementary pitch-based enhancements:
|
|
|
33 |
|
34 |
By modulating the RoPE frequencies based on pitch (F0), we are essentially telling the model to pay attention to the acoustic features relate to sequence position in a way that's proportional to the voice characteristics. This approach creates a more speech-aware positional representation that helps the model better understand the relationship between acoustic features and text.
|
35 |
|
36 |
+
<img width="670" alt="cc4" src="https://github.com/user-attachments/assets/165a3f18-659a-4e2e-a154-a3456b667bae" />
|
37 |
|
38 |
Each figure shows 4 subplots (one for each of the first 4 dimensions of your embeddings in the test run). These visualizations show how pitch information modifies position encoding patterns in the model.
|
39 |
|
|
|
67 |
Wider bright bands: Positions can "see" farther (good for long dependencies) but can be noisy.
|
68 |
Narrow bands: More focus on nearby positions (good for local patterns)
|
69 |
|
70 |
+
<img width="670" alt="cc" src="https://github.com/user-attachments/assets/28d00fc5-2676-41ed-a971-e4d857af43f8" />
|
71 |
+
<img width="670" alt="cc2" src="https://github.com/user-attachments/assets/9089e806-966b-41aa-8793-bee03a6e6be1" />
|
72 |
|
73 |
|
74 |
The models rotary implementation maps the perceptual properties of audio to the mathematical properties of the rotary embeddings, creating a more adaptive and context-aware representation system. Pitch is optionally extracted from audio in the data processing pipeline and can be used for an additional feature along with spectrograms and or used to inform the rotary and or pitch bias.
|