Sin2pi commited on
Commit
e38891a
·
verified ·
1 Parent(s): 3ec2e71

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -15
README.md CHANGED
@@ -20,19 +20,28 @@ tags:
20
  - new
21
  ---
22
 
 
 
 
23
 
 
24
 
 
 
 
 
 
 
 
 
 
25
 
26
- NLP/ASR multimodal modal with f0 modulated relative positional embeddings.
27
- For research/testing.
28
 
29
- Why?
30
- Because a significant portion of current AI research is focused on optimizing existing methods instead of exploring new approaches.
31
 
32
 
33
- <img width="780" alt="cc5" src="https://github.com/user-attachments/assets/ce9417de-a892-4811-b151-da612f31c0fb" />
34
 
35
- (librispeech - clean).
36
 
37
  To explore the relationship between pitch and rotary embeddings, the model implements three complementary pitch based enhancements:
38
 
@@ -67,7 +76,7 @@ theta = f0_mean + self.theta
67
  freqs = (theta / 220.0) * 700 * (torch.pow(10, torch.linspace(0, 2595 * torch.log10(torch.tensor(1 + 8000/700)), self.dim // 2) / 2595) - 1) / 1000
68
  ## This seems to give superior results compared to the standard freqs = 1. / (theta ** (torch.arange(0, dim, 2)[:(dim // 2)].float() / dim)).
69
  ## I thought a mel-scale version might be more perceptually meaningful for audio..
70
- ## Using mel-scale to create a perceptually-relevant distance metric.
71
 
72
  freqs = t[:, None] * freqs[None, :] # dont repeat or use some other method here
73
 
@@ -138,17 +147,12 @@ Narrow bands: More focus on nearby positions (good for local patterns)
138
  <img width="680" alt="cc2" src="https://github.com/user-attachments/assets/9089e806-966b-41aa-8793-bee03a6e6be1" />
139
 
140
  ----
 
141
 
 
142
 
143
  This model sometimes uses :
144
 
145
  https://github.com/sine2pi/Maxfactor
146
 
147
-
148
- `MaxFactor` is a custom PyTorch optimizer with adaptive learning rates and specialized handling for matrix parameters. I wrote it for the model in the asr_model repository.
149
- I needed something that performs well and has a light memory foot print since I do everything from my laptop.
150
-
151
-
152
- ----
153
-
154
-
 
20
  - new
21
  ---
22
 
23
+ NLP/ASR multimodal modal with f0 modulated relative positional embeddings.
24
+ For research/testing.
25
+ ----
26
 
27
+ Questions:
28
 
29
+ -How can we make attention mechanisms aware of speech-specific properties?
30
+
31
+ -Can we incorporate acoustic information directly into positional encodings?
32
+
33
+ -Does pitch-conditioning improve speech recognition?
34
+
35
+ Standard RoPE was designed for text: Text doesn't have pitch, timing, or acoustic properties.
36
+
37
+ ----
38
 
 
 
39
 
 
 
40
 
41
 
42
+ <img width="780" alt="cc5" src="https://github.com/user-attachments/assets/106ebe75-f1db-4f85-bdae-818b114fedd2" />
43
 
44
+ This plot illustrates the pattern similiarity of pitch waveform and spectrogram. (librispeech - clean).
45
 
46
  To explore the relationship between pitch and rotary embeddings, the model implements three complementary pitch based enhancements:
47
 
 
76
  freqs = (theta / 220.0) * 700 * (torch.pow(10, torch.linspace(0, 2595 * torch.log10(torch.tensor(1 + 8000/700)), self.dim // 2) / 2595) - 1) / 1000
77
  ## This seems to give superior results compared to the standard freqs = 1. / (theta ** (torch.arange(0, dim, 2)[:(dim // 2)].float() / dim)).
78
  ## I thought a mel-scale version might be more perceptually meaningful for audio..
79
+ ## Using mel-scale to create a perceptually-relevant distance metric instead of Euclidean distance.
80
 
81
  freqs = t[:, None] * freqs[None, :] # dont repeat or use some other method here
82
 
 
147
  <img width="680" alt="cc2" src="https://github.com/user-attachments/assets/9089e806-966b-41aa-8793-bee03a6e6be1" />
148
 
149
  ----
150
+ https://huggingface.co/Sin2pi/Echo17/tensorboard?params=scalars
151
 
152
+ ----
153
 
154
  This model sometimes uses :
155
 
156
  https://github.com/sine2pi/Maxfactor
157
 
158
+ MaxFactor is a custom PyTorch optimizer with adaptive learning rates and specialized handling for matrix parameters. I wrote it for the model in the asr_model repository. I needed something that performs well and has a light memory foot print since I do everything from my laptop.