Sin2pi commited on
Commit
25a309a
·
verified ·
1 Parent(s): fc4dff3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +21 -10
README.md CHANGED
@@ -21,13 +21,23 @@ tags:
21
 
22
  ---
23
  ----
24
-
25
- ### NLP/ASR multimodal modal with f0 modulated relative positional embeddings.
26
  For research/testing.
27
 
 
 
 
 
 
 
 
 
 
 
 
28
  ----
29
 
30
- ### Questions:
31
 
32
  -How can we make attention mechanisms aware of speech-specific properties?
33
 
@@ -36,9 +46,10 @@ For research/testing.
36
  -Does pitch-conditioning improve speech recognition?
37
 
38
  Standard RoPE was designed for text: Text doesn't have pitch, timing, or acoustic properties.
 
39
 
40
- ----
41
-
42
 
43
 
44
 
@@ -78,14 +89,15 @@ if f0 is not None:
78
  else:
79
  theta = self.theta
80
 
 
 
 
81
  freqs = (theta.unsqueeze(-1) / 220.0) * 700 * (
82
  torch.pow(10, torch.linspace(0, 2595 * torch.log10(torch.tensor(1 + 8000/700)),
83
  self.dim // 2, device=theta.device, dtype=theta.dtype) / 2595) - 1) / 1000
84
 
85
- ## This seems to give superior results compared to the standard freqs = 1. / (theta ** (torch.arange(0, dim, 2)[:(dim // 2)].float() / dim)).
86
- ## I thought a mel-scale version might be more perceptually meaningful for audio..
87
- ## Using mel-scale to create a perceptually-relevant distance metric instead of Euclidean distance.
88
-
89
 
90
  t = torch.arange(ctx, device=device, dtype=dtype)
91
  freqs = t[:, None] * freqs # dont repeat or use some other method here
@@ -97,7 +109,6 @@ else:
97
  radius = torch.ones_like(freqs)
98
  freqs = torch.polar(radius, freqs)
99
 
100
-
101
  ```
102
 
103
  A closer look at whats going on. Here is a slice of the actual radius values for one step
 
21
 
22
  ---
23
  ----
24
+ NLP/ASR multimodal modal with f0 modulated relative positional embeddings.
 
25
  For research/testing.
26
 
27
+ Moving beyond literal transcription toward something more artistic and creative.
28
+
29
+ On a road to build a "creative" speech-to-text model that can:
30
+
31
+ -Generate stories from audio
32
+ -Make poetic associations
33
+ -Fill in gaps with imagination
34
+ -Create richer, more expressive text..
35
+ -Seperate the sad Morrissey songs from the two that arn't..
36
+
37
+
38
  ----
39
 
40
+ Questions:
41
 
42
  -How can we make attention mechanisms aware of speech-specific properties?
43
 
 
46
  -Does pitch-conditioning improve speech recognition?
47
 
48
  Standard RoPE was designed for text: Text doesn't have pitch, timing, or acoustic properties.
49
+ Xpos is an artificial generic decay. This takes its place with something more meaningful.
50
 
51
+
52
+ ---
53
 
54
 
55
 
 
89
  else:
90
  theta = self.theta
91
 
92
+ ## In text, theta=10,000 sets the base frequency for positional encoding, ensuring a wide range of periodicities for long sequences. I'm not sure if the specific number 10k was experimentally derived.
93
+ ## For audio, especially speech, the relevant periodicities are determined by the pitch (f0 neighborhood or f0 per frame) might be more meaningful.
94
+
95
  freqs = (theta.unsqueeze(-1) / 220.0) * 700 * (
96
  torch.pow(10, torch.linspace(0, 2595 * torch.log10(torch.tensor(1 + 8000/700)),
97
  self.dim // 2, device=theta.device, dtype=theta.dtype) / 2595) - 1) / 1000
98
 
99
+ ## This seems to give better results compared to the standard freqs = 1. / (theta ** (torch.arange(0, dim, 2)[:(dim // 2)].float() / dim)).
100
+ ## I thought a mel-scale version might be more perceptually meaningful for audio.. ie. using mel-scale to create a perceptually-relevant distance metric instead of Euclidean distance.
 
 
101
 
102
  t = torch.arange(ctx, device=device, dtype=dtype)
103
  freqs = t[:, None] * freqs # dont repeat or use some other method here
 
109
  radius = torch.ones_like(freqs)
110
  freqs = torch.polar(radius, freqs)
111
 
 
112
  ```
113
 
114
  A closer look at whats going on. Here is a slice of the actual radius values for one step