Update README.md
Browse files
README.md
CHANGED
@@ -21,13 +21,23 @@ tags:
|
|
21 |
|
22 |
---
|
23 |
----
|
24 |
-
|
25 |
-
### NLP/ASR multimodal modal with f0 modulated relative positional embeddings.
|
26 |
For research/testing.
|
27 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
28 |
----
|
29 |
|
30 |
-
|
31 |
|
32 |
-How can we make attention mechanisms aware of speech-specific properties?
|
33 |
|
@@ -36,9 +46,10 @@ For research/testing.
|
|
36 |
-Does pitch-conditioning improve speech recognition?
|
37 |
|
38 |
Standard RoPE was designed for text: Text doesn't have pitch, timing, or acoustic properties.
|
|
|
39 |
|
40 |
-
|
41 |
-
|
42 |
|
43 |
|
44 |
|
@@ -78,14 +89,15 @@ if f0 is not None:
|
|
78 |
else:
|
79 |
theta = self.theta
|
80 |
|
|
|
|
|
|
|
81 |
freqs = (theta.unsqueeze(-1) / 220.0) * 700 * (
|
82 |
torch.pow(10, torch.linspace(0, 2595 * torch.log10(torch.tensor(1 + 8000/700)),
|
83 |
self.dim // 2, device=theta.device, dtype=theta.dtype) / 2595) - 1) / 1000
|
84 |
|
85 |
-
## This seems to give
|
86 |
-
## I thought a mel-scale version might be more perceptually meaningful for audio..
|
87 |
-
## Using mel-scale to create a perceptually-relevant distance metric instead of Euclidean distance.
|
88 |
-
|
89 |
|
90 |
t = torch.arange(ctx, device=device, dtype=dtype)
|
91 |
freqs = t[:, None] * freqs # dont repeat or use some other method here
|
@@ -97,7 +109,6 @@ else:
|
|
97 |
radius = torch.ones_like(freqs)
|
98 |
freqs = torch.polar(radius, freqs)
|
99 |
|
100 |
-
|
101 |
```
|
102 |
|
103 |
A closer look at whats going on. Here is a slice of the actual radius values for one step
|
|
|
21 |
|
22 |
---
|
23 |
----
|
24 |
+
NLP/ASR multimodal modal with f0 modulated relative positional embeddings.
|
|
|
25 |
For research/testing.
|
26 |
|
27 |
+
Moving beyond literal transcription toward something more artistic and creative.
|
28 |
+
|
29 |
+
On a road to build a "creative" speech-to-text model that can:
|
30 |
+
|
31 |
+
-Generate stories from audio
|
32 |
+
-Make poetic associations
|
33 |
+
-Fill in gaps with imagination
|
34 |
+
-Create richer, more expressive text..
|
35 |
+
-Seperate the sad Morrissey songs from the two that arn't..
|
36 |
+
|
37 |
+
|
38 |
----
|
39 |
|
40 |
+
Questions:
|
41 |
|
42 |
-How can we make attention mechanisms aware of speech-specific properties?
|
43 |
|
|
|
46 |
-Does pitch-conditioning improve speech recognition?
|
47 |
|
48 |
Standard RoPE was designed for text: Text doesn't have pitch, timing, or acoustic properties.
|
49 |
+
Xpos is an artificial generic decay. This takes its place with something more meaningful.
|
50 |
|
51 |
+
|
52 |
+
---
|
53 |
|
54 |
|
55 |
|
|
|
89 |
else:
|
90 |
theta = self.theta
|
91 |
|
92 |
+
## In text, theta=10,000 sets the base frequency for positional encoding, ensuring a wide range of periodicities for long sequences. I'm not sure if the specific number 10k was experimentally derived.
|
93 |
+
## For audio, especially speech, the relevant periodicities are determined by the pitch (f0 neighborhood or f0 per frame) might be more meaningful.
|
94 |
+
|
95 |
freqs = (theta.unsqueeze(-1) / 220.0) * 700 * (
|
96 |
torch.pow(10, torch.linspace(0, 2595 * torch.log10(torch.tensor(1 + 8000/700)),
|
97 |
self.dim // 2, device=theta.device, dtype=theta.dtype) / 2595) - 1) / 1000
|
98 |
|
99 |
+
## This seems to give better results compared to the standard freqs = 1. / (theta ** (torch.arange(0, dim, 2)[:(dim // 2)].float() / dim)).
|
100 |
+
## I thought a mel-scale version might be more perceptually meaningful for audio.. ie. using mel-scale to create a perceptually-relevant distance metric instead of Euclidean distance.
|
|
|
|
|
101 |
|
102 |
t = torch.arange(ctx, device=device, dtype=dtype)
|
103 |
freqs = t[:, None] * freqs # dont repeat or use some other method here
|
|
|
109 |
radius = torch.ones_like(freqs)
|
110 |
freqs = torch.polar(radius, freqs)
|
111 |
|
|
|
112 |
```
|
113 |
|
114 |
A closer look at whats going on. Here is a slice of the actual radius values for one step
|