Update README.md
Browse files
README.md
CHANGED
@@ -20,19 +20,28 @@ tags:
|
|
20 |
- new
|
21 |
---
|
22 |
|
|
|
|
|
|
|
23 |
|
|
|
24 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
25 |
|
26 |
-
NLP/ASR multimodal modal with f0 modulated relative positional embeddings.
|
27 |
-
For research/testing.
|
28 |
|
29 |
-
Why?
|
30 |
-
Because a significant portion of current AI research is focused on optimizing existing methods instead of exploring new approaches.
|
31 |
|
32 |
|
33 |
-
<img width="780" alt="cc5" src="https://github.com/user-attachments/assets/
|
34 |
|
35 |
-
(librispeech - clean).
|
36 |
|
37 |
To explore the relationship between pitch and rotary embeddings, the model implements three complementary pitch based enhancements:
|
38 |
|
@@ -67,7 +76,7 @@ theta = f0_mean + self.theta
|
|
67 |
freqs = (theta / 220.0) * 700 * (torch.pow(10, torch.linspace(0, 2595 * torch.log10(torch.tensor(1 + 8000/700)), self.dim // 2) / 2595) - 1) / 1000
|
68 |
## This seems to give superior results compared to the standard freqs = 1. / (theta ** (torch.arange(0, dim, 2)[:(dim // 2)].float() / dim)).
|
69 |
## I thought a mel-scale version might be more perceptually meaningful for audio..
|
70 |
-
## Using mel-scale to create a perceptually-relevant distance metric.
|
71 |
|
72 |
freqs = t[:, None] * freqs[None, :] # dont repeat or use some other method here
|
73 |
|
@@ -138,17 +147,12 @@ Narrow bands: More focus on nearby positions (good for local patterns)
|
|
138 |
<img width="680" alt="cc2" src="https://github.com/user-attachments/assets/9089e806-966b-41aa-8793-bee03a6e6be1" />
|
139 |
|
140 |
----
|
|
|
141 |
|
|
|
142 |
|
143 |
This model sometimes uses :
|
144 |
|
145 |
https://github.com/sine2pi/Maxfactor
|
146 |
|
147 |
-
|
148 |
-
`MaxFactor` is a custom PyTorch optimizer with adaptive learning rates and specialized handling for matrix parameters. I wrote it for the model in the asr_model repository.
|
149 |
-
I needed something that performs well and has a light memory foot print since I do everything from my laptop.
|
150 |
-
|
151 |
-
|
152 |
-
----
|
153 |
-
|
154 |
-
|
|
|
20 |
- new
|
21 |
---
|
22 |
|
23 |
+
NLP/ASR multimodal modal with f0 modulated relative positional embeddings.
|
24 |
+
For research/testing.
|
25 |
+
----
|
26 |
|
27 |
+
Questions:
|
28 |
|
29 |
+
-How can we make attention mechanisms aware of speech-specific properties?
|
30 |
+
|
31 |
+
-Can we incorporate acoustic information directly into positional encodings?
|
32 |
+
|
33 |
+
-Does pitch-conditioning improve speech recognition?
|
34 |
+
|
35 |
+
Standard RoPE was designed for text: Text doesn't have pitch, timing, or acoustic properties.
|
36 |
+
|
37 |
+
----
|
38 |
|
|
|
|
|
39 |
|
|
|
|
|
40 |
|
41 |
|
42 |
+
<img width="780" alt="cc5" src="https://github.com/user-attachments/assets/106ebe75-f1db-4f85-bdae-818b114fedd2" />
|
43 |
|
44 |
+
This plot illustrates the pattern similiarity of pitch waveform and spectrogram. (librispeech - clean).
|
45 |
|
46 |
To explore the relationship between pitch and rotary embeddings, the model implements three complementary pitch based enhancements:
|
47 |
|
|
|
76 |
freqs = (theta / 220.0) * 700 * (torch.pow(10, torch.linspace(0, 2595 * torch.log10(torch.tensor(1 + 8000/700)), self.dim // 2) / 2595) - 1) / 1000
|
77 |
## This seems to give superior results compared to the standard freqs = 1. / (theta ** (torch.arange(0, dim, 2)[:(dim // 2)].float() / dim)).
|
78 |
## I thought a mel-scale version might be more perceptually meaningful for audio..
|
79 |
+
## Using mel-scale to create a perceptually-relevant distance metric instead of Euclidean distance.
|
80 |
|
81 |
freqs = t[:, None] * freqs[None, :] # dont repeat or use some other method here
|
82 |
|
|
|
147 |
<img width="680" alt="cc2" src="https://github.com/user-attachments/assets/9089e806-966b-41aa-8793-bee03a6e6be1" />
|
148 |
|
149 |
----
|
150 |
+
https://huggingface.co/Sin2pi/Echo17/tensorboard?params=scalars
|
151 |
|
152 |
+
----
|
153 |
|
154 |
This model sometimes uses :
|
155 |
|
156 |
https://github.com/sine2pi/Maxfactor
|
157 |
|
158 |
+
MaxFactor is a custom PyTorch optimizer with adaptive learning rates and specialized handling for matrix parameters. I wrote it for the model in the asr_model repository. I needed something that performs well and has a light memory foot print since I do everything from my laptop.
|
|
|
|
|
|
|
|
|
|
|
|
|
|