Audio-Text-to-Text
glap_model
richermans commited on
Commit
e4066f7
·
verified ·
1 Parent(s): 579e1a3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +207 -207
README.md CHANGED
@@ -1,207 +1,207 @@
1
- ---
2
- license: apache-2.0
3
- ---
4
-
5
-
6
- <div align="center">
7
- <h1>
8
- GLAP (Generalized Language Audio Pretraining)
9
- </h1>
10
- <p>
11
- Official PyTorch code for <b>GLAP</b> <br>
12
- <b><em>Generalized Language Audio Pretraining</em></b>
13
- </p>
14
- </p>
15
- <a href="https://arxiv.org/abs/2406.06992"><img src="https://img.shields.io/badge/" alt="version"></a>
16
- <a href="https://github.com/xiaomi/glap"><img src="https://img.shields.io/badge/Platform-linux-lightgrey" alt="version"></a>
17
- <a href="https://www.python.org"><img src="https://img.shields.io/badge/Python-3.10+-orange" alt="version"></a>
18
- <a href="https://pytorch.org"><img src="https://img.shields.io/badge/PyTorch-2.0+-brightgreen" alt="python"></a>
19
- <a href="https://www.apache.org/licenses/LICENSE-2.0"><img src="https://img.shields.io/badge/License-Apache%202.0-blue.svg" alt="mit"></a>
20
- <img src="https://img.shields.io/pypi/dm/glap_model" alt="PyPI Downloads">
21
-
22
- </div>
23
-
24
-
25
-
26
-
27
- # GLAP (Generalized Language Audio Pretraining)
28
-
29
-
30
- <img src="capabilities.png" alt="GLAP capabiltiies" style="height: 600px;">
31
-
32
-
33
- ## Features
34
-
35
-
36
- * *First* all-in-one solution for general audio-text retrieval.
37
- * Multilingual (8 + Languages) Speech, Music and Sound retrieval.
38
- * Music and Sound retrieval performance in English matches previous baselines, while also **supporting** Languages like Japanese, German, Spanish, Chinese, Dutch and more.
39
-
40
-
41
- ## Usage
42
-
43
-
44
- ```bash
45
- pip install glap_model
46
- ```
47
-
48
-
49
- ### Scoring audio-text pairs
50
-
51
- We provide a simple commandline tool:
52
-
53
- ```bash
54
- score_glap audio_input_file text1;text2;text3
55
- ```
56
-
57
- Or in Python:
58
-
59
- ```python
60
- import torch
61
- from glap_model import glap_inference
62
-
63
- audio = torch.randn(1, 160000).tanh() # 10s of heavy noise
64
-
65
- glap_model = glap_inference()
66
-
67
- score = glap_model.score_forward(audio, text=["the sound of noise","a car is driving","a person is speaking"])
68
- print(score)
69
- ```
70
-
71
-
72
-
73
- ### Recommended Prompts
74
-
75
- | Task | Prompt |
76
- |--------|-----------------------------------------|
77
- | Speech | {label} |
78
- | Music | The music in the style of {label}. |
79
- | Sound | The sound of {label} can be heard. |
80
-
81
-
82
- ### Batched scoring
83
-
84
-
85
- ```python
86
- import torch
87
- from glap_model import glap_inference
88
-
89
- glap_model = glap_inference()
90
- audio = torch.randn(1, 64000).tanh()
91
- prefix = "The sound of"
92
- labels = [ f"{prefix} {label}" for label in ("Cat","Dog","Water","Noise")]
93
- text_embeds = glap_model.encode_text(labels)
94
- audio_embeds = glap_model.encode_audio(audio)
95
- scores = glap_model.score(audio_embeds, text_embeds)
96
- for label_name, score in zip(labels, scores):
97
- print(label_name,score)
98
-
99
-
100
- ```
101
-
102
- ## Development
103
-
104
-
105
- ### UV (Recommended)
106
-
107
- ```bash
108
- git clone https://github.com/xiaomi-research/GLAP
109
- cd GLAP
110
- uv venv --python 3.10
111
- source activate .venv/bin/activate
112
- uv sync
113
-
114
- #python3 -m pip install .
115
- # Additionally, sndfile is needed
116
- # conda install -c conda-forge libsndfile==1.0.31
117
- ```
118
-
119
- ### Pip
120
-
121
- ```bash
122
- git clone https://github.com/xiaomi-research/GLAP
123
- cd GLAP
124
- python3 -m pip install .
125
- # Additionally, sndfile is needed
126
- # conda install -c conda-forge libsndfile==1.0.31
127
- # Or if you have root, use your package manager
128
- ```
129
-
130
-
131
- ### Prepare data
132
-
133
-
134
- Data needs to be in `tar/tar.gz` format:
135
-
136
- ```
137
- # tar -tf a.tar
138
- 908-31957-0013.flac
139
- 908-31957-0013.json
140
- 2961-960-0013.flac
141
- 2961-960-0013.json
142
- ```
143
-
144
-
145
- Each `.json` should have one of three fields `caption`, `captions` or `text`.
146
- Data preparation can be done using the `wavlist_to_tar` script, which is provided in the `dasheng` dependency.
147
- Further information how to process data can be seen [here](https://github.com/XiaoMi/dasheng?tab=readme-ov-file#3-training).
148
-
149
- ### Training
150
-
151
-
152
- For reference, we provide our original training config for GLAP `configs/train/multilingual_dasheng_asr_sound2_sigmoidloss_balanced.yaml`.
153
-
154
-
155
- ```bash
156
- accelerate launch --mixed-precision='fp16' run.py train configs/train/multilingual_dasheng_asr_sound2_sigmoidloss_balanced.yaml
157
- ```
158
-
159
-
160
- ### Zeroshot eval (one sample)
161
-
162
-
163
- ```bash
164
- # There ; is a separator for different text keys
165
- python3 run.py zeroshot pretrained_checkpoint/glap_checkpoint.pt PATH_TO_WAV_FLAC_MP3_SAMPLE.wav "The sound of a horse;Car;Mama;The sound of music;somebody is speaking;The sound of ein Pferd;一只马;Music is played;音乐的声音;Musik ist zu hoeren";Zero;One;Two;Three"
166
- ```
167
-
168
- ### Retrieval scoring
169
-
170
- ```bash
171
- # Should be run on a single GPU
172
- accelerate launch --mixed-precision='fp16' run.py evaluate PATH_TO_CHECKPOINT
173
- ```
174
-
175
-
176
-
177
- ### Notes on DDP
178
-
179
- Using uneven training datasets without `resample=True` is not recommended
180
-
181
-
182
- ## Translating data into a target language
183
-
184
- For our experiments we used SONAR to translate audio captions into seven target languages. This can be reproduced using our code:
185
-
186
-
187
- ```bash
188
- python3 run.py translate_sonar data/WavCaps/freesound/freesound_train_sample_0000* --output_path data/translations/WavCaps/freesound/
189
- ```
190
-
191
- DDP is also supported:
192
-
193
- ```bash
194
- accelerate launch run.py translate_sonar data/WavCaps/freesound/freesound_train_sample_0000* --output_path data/translations/WavCaps/freesound/
195
- ```
196
-
197
-
198
- ## Citation
199
-
200
- TODO
201
- ```bibtex
202
- @inproceedings{dinkel2025glap,
203
- title={GLAP: General contrastive audio-text pretraining across domains and languages},
204
- year={2025}
205
- }
206
- ```
207
-
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+
5
+
6
+ <div align="center">
7
+ <h1>
8
+ GLAP (Generalized Language Audio Pretraining)
9
+ </h1>
10
+ <p>
11
+ Official PyTorch code for <b>GLAP</b> <br>
12
+ <b><em>Generalized Language Audio Pretraining</em></b>
13
+ </p>
14
+ </p>
15
+ <a href="https://arxiv.org/abs/"><img src="https://img.shields.io/badge/" alt="version"></a>
16
+ <a href="https://github.com/xiaomi/glap"><img src="https://img.shields.io/badge/Platform-linux-lightgrey" alt="version"></a>
17
+ <a href="https://www.python.org"><img src="https://img.shields.io/badge/Python-3.10+-orange" alt="version"></a>
18
+ <a href="https://pytorch.org"><img src="https://img.shields.io/badge/PyTorch-2.0+-brightgreen" alt="python"></a>
19
+ <a href="https://www.apache.org/licenses/LICENSE-2.0"><img src="https://img.shields.io/badge/License-Apache%202.0-blue.svg" alt="mit"></a>
20
+ <img src="https://img.shields.io/pypi/dm/glap_model" alt="PyPI Downloads">
21
+
22
+ </div>
23
+
24
+
25
+
26
+
27
+ # GLAP (Generalized Language Audio Pretraining)
28
+
29
+
30
+ <img src="capabilities.png" alt="GLAP capabiltiies" style="height: 600px;">
31
+
32
+
33
+ ## Features
34
+
35
+
36
+ * *First* all-in-one solution for general audio-text retrieval.
37
+ * Multilingual (8 + Languages) Speech, Music and Sound retrieval.
38
+ * Music and Sound retrieval performance in English matches previous baselines, while also **supporting** Languages like Japanese, German, Spanish, Chinese, Dutch and more.
39
+
40
+
41
+ ## Usage
42
+
43
+
44
+ ```bash
45
+ pip install glap_model
46
+ ```
47
+
48
+
49
+ ### Scoring audio-text pairs
50
+
51
+ We provide a simple commandline tool:
52
+
53
+ ```bash
54
+ score_glap audio_input_file text1;text2;text3
55
+ ```
56
+
57
+ Or in Python:
58
+
59
+ ```python
60
+ import torch
61
+ from glap_model import glap_inference
62
+
63
+ audio = torch.randn(1, 160000).tanh() # 10s of heavy noise
64
+
65
+ glap_model = glap_inference()
66
+
67
+ score = glap_model.score_forward(audio, text=["the sound of noise","a car is driving","a person is speaking"])
68
+ print(score)
69
+ ```
70
+
71
+
72
+
73
+ ### Recommended Prompts
74
+
75
+ | Task | Prompt |
76
+ |--------|-----------------------------------------|
77
+ | Speech | {label} |
78
+ | Music | The music in the style of {label}. |
79
+ | Sound | The sound of {label} can be heard. |
80
+
81
+
82
+ ### Batched scoring
83
+
84
+
85
+ ```python
86
+ import torch
87
+ from glap_model import glap_inference
88
+
89
+ glap_model = glap_inference()
90
+ audio = torch.randn(1, 64000).tanh()
91
+ prefix = "The sound of"
92
+ labels = [ f"{prefix} {label}" for label in ("Cat","Dog","Water","Noise")]
93
+ text_embeds = glap_model.encode_text(labels)
94
+ audio_embeds = glap_model.encode_audio(audio)
95
+ scores = glap_model.score(audio_embeds, text_embeds)
96
+ for label_name, score in zip(labels, scores):
97
+ print(label_name,score)
98
+
99
+
100
+ ```
101
+
102
+ ## Development
103
+
104
+
105
+ ### UV (Recommended)
106
+
107
+ ```bash
108
+ git clone https://github.com/xiaomi-research/GLAP
109
+ cd GLAP
110
+ uv venv --python 3.10
111
+ source activate .venv/bin/activate
112
+ uv sync
113
+
114
+ #python3 -m pip install .
115
+ # Additionally, sndfile is needed
116
+ # conda install -c conda-forge libsndfile==1.0.31
117
+ ```
118
+
119
+ ### Pip
120
+
121
+ ```bash
122
+ git clone https://github.com/xiaomi-research/GLAP
123
+ cd GLAP
124
+ python3 -m pip install .
125
+ # Additionally, sndfile is needed
126
+ # conda install -c conda-forge libsndfile==1.0.31
127
+ # Or if you have root, use your package manager
128
+ ```
129
+
130
+
131
+ ### Prepare data
132
+
133
+
134
+ Data needs to be in `tar/tar.gz` format:
135
+
136
+ ```
137
+ # tar -tf a.tar
138
+ 908-31957-0013.flac
139
+ 908-31957-0013.json
140
+ 2961-960-0013.flac
141
+ 2961-960-0013.json
142
+ ```
143
+
144
+
145
+ Each `.json` should have one of three fields `caption`, `captions` or `text`.
146
+ Data preparation can be done using the `wavlist_to_tar` script, which is provided in the `dasheng` dependency.
147
+ Further information how to process data can be seen [here](https://github.com/XiaoMi/dasheng?tab=readme-ov-file#3-training).
148
+
149
+ ### Training
150
+
151
+
152
+ For reference, we provide our original training config for GLAP `configs/train/multilingual_dasheng_asr_sound2_sigmoidloss_balanced.yaml`.
153
+
154
+
155
+ ```bash
156
+ accelerate launch --mixed-precision='fp16' run.py train configs/train/multilingual_dasheng_asr_sound2_sigmoidloss_balanced.yaml
157
+ ```
158
+
159
+
160
+ ### Zeroshot eval (one sample)
161
+
162
+
163
+ ```bash
164
+ # There ; is a separator for different text keys
165
+ python3 run.py zeroshot pretrained_checkpoint/glap_checkpoint.pt PATH_TO_WAV_FLAC_MP3_SAMPLE.wav "The sound of a horse;Car;Mama;The sound of music;somebody is speaking;The sound of ein Pferd;一只马;Music is played;音乐的声音;Musik ist zu hoeren";Zero;One;Two;Three"
166
+ ```
167
+
168
+ ### Retrieval scoring
169
+
170
+ ```bash
171
+ # Should be run on a single GPU
172
+ accelerate launch --mixed-precision='fp16' run.py evaluate PATH_TO_CHECKPOINT
173
+ ```
174
+
175
+
176
+
177
+ ### Notes on DDP
178
+
179
+ Using uneven training datasets without `resample=True` is not recommended
180
+
181
+
182
+ ## Translating data into a target language
183
+
184
+ For our experiments we used SONAR to translate audio captions into seven target languages. This can be reproduced using our code:
185
+
186
+
187
+ ```bash
188
+ python3 run.py translate_sonar data/WavCaps/freesound/freesound_train_sample_0000* --output_path data/translations/WavCaps/freesound/
189
+ ```
190
+
191
+ DDP is also supported:
192
+
193
+ ```bash
194
+ accelerate launch run.py translate_sonar data/WavCaps/freesound/freesound_train_sample_0000* --output_path data/translations/WavCaps/freesound/
195
+ ```
196
+
197
+
198
+ ## Citation
199
+
200
+ TODO
201
+ ```bibtex
202
+ @inproceedings{dinkel2025glap,
203
+ title={GLAP: General contrastive audio-text pretraining across domains and languages},
204
+ year={2025}
205
+ }
206
+ ```
207
+