Update README.md
Browse files
README.md
CHANGED
@@ -1,207 +1,207 @@
|
|
1 |
-
---
|
2 |
-
license: apache-2.0
|
3 |
-
---
|
4 |
-
|
5 |
-
|
6 |
-
<div align="center">
|
7 |
-
<h1>
|
8 |
-
GLAP (Generalized Language Audio Pretraining)
|
9 |
-
</h1>
|
10 |
-
<p>
|
11 |
-
Official PyTorch code for <b>GLAP</b> <br>
|
12 |
-
<b><em>Generalized Language Audio Pretraining</em></b>
|
13 |
-
</p>
|
14 |
-
</p>
|
15 |
-
<a href="https://arxiv.org/abs/
|
16 |
-
<a href="https://github.com/xiaomi/glap"><img src="https://img.shields.io/badge/Platform-linux-lightgrey" alt="version"></a>
|
17 |
-
<a href="https://www.python.org"><img src="https://img.shields.io/badge/Python-3.10+-orange" alt="version"></a>
|
18 |
-
<a href="https://pytorch.org"><img src="https://img.shields.io/badge/PyTorch-2.0+-brightgreen" alt="python"></a>
|
19 |
-
<a href="https://www.apache.org/licenses/LICENSE-2.0"><img src="https://img.shields.io/badge/License-Apache%202.0-blue.svg" alt="mit"></a>
|
20 |
-
<img src="https://img.shields.io/pypi/dm/glap_model" alt="PyPI Downloads">
|
21 |
-
|
22 |
-
</div>
|
23 |
-
|
24 |
-
|
25 |
-
|
26 |
-
|
27 |
-
# GLAP (Generalized Language Audio Pretraining)
|
28 |
-
|
29 |
-
|
30 |
-
<img src="capabilities.png" alt="GLAP capabiltiies" style="height: 600px;">
|
31 |
-
|
32 |
-
|
33 |
-
## Features
|
34 |
-
|
35 |
-
|
36 |
-
* *First* all-in-one solution for general audio-text retrieval.
|
37 |
-
* Multilingual (8 + Languages) Speech, Music and Sound retrieval.
|
38 |
-
* Music and Sound retrieval performance in English matches previous baselines, while also **supporting** Languages like Japanese, German, Spanish, Chinese, Dutch and more.
|
39 |
-
|
40 |
-
|
41 |
-
## Usage
|
42 |
-
|
43 |
-
|
44 |
-
```bash
|
45 |
-
pip install glap_model
|
46 |
-
```
|
47 |
-
|
48 |
-
|
49 |
-
### Scoring audio-text pairs
|
50 |
-
|
51 |
-
We provide a simple commandline tool:
|
52 |
-
|
53 |
-
```bash
|
54 |
-
score_glap audio_input_file text1;text2;text3
|
55 |
-
```
|
56 |
-
|
57 |
-
Or in Python:
|
58 |
-
|
59 |
-
```python
|
60 |
-
import torch
|
61 |
-
from glap_model import glap_inference
|
62 |
-
|
63 |
-
audio = torch.randn(1, 160000).tanh() # 10s of heavy noise
|
64 |
-
|
65 |
-
glap_model = glap_inference()
|
66 |
-
|
67 |
-
score = glap_model.score_forward(audio, text=["the sound of noise","a car is driving","a person is speaking"])
|
68 |
-
print(score)
|
69 |
-
```
|
70 |
-
|
71 |
-
|
72 |
-
|
73 |
-
### Recommended Prompts
|
74 |
-
|
75 |
-
| Task | Prompt |
|
76 |
-
|--------|-----------------------------------------|
|
77 |
-
| Speech | {label} |
|
78 |
-
| Music | The music in the style of {label}. |
|
79 |
-
| Sound | The sound of {label} can be heard. |
|
80 |
-
|
81 |
-
|
82 |
-
### Batched scoring
|
83 |
-
|
84 |
-
|
85 |
-
```python
|
86 |
-
import torch
|
87 |
-
from glap_model import glap_inference
|
88 |
-
|
89 |
-
glap_model = glap_inference()
|
90 |
-
audio = torch.randn(1, 64000).tanh()
|
91 |
-
prefix = "The sound of"
|
92 |
-
labels = [ f"{prefix} {label}" for label in ("Cat","Dog","Water","Noise")]
|
93 |
-
text_embeds = glap_model.encode_text(labels)
|
94 |
-
audio_embeds = glap_model.encode_audio(audio)
|
95 |
-
scores = glap_model.score(audio_embeds, text_embeds)
|
96 |
-
for label_name, score in zip(labels, scores):
|
97 |
-
print(label_name,score)
|
98 |
-
|
99 |
-
|
100 |
-
```
|
101 |
-
|
102 |
-
## Development
|
103 |
-
|
104 |
-
|
105 |
-
### UV (Recommended)
|
106 |
-
|
107 |
-
```bash
|
108 |
-
git clone https://github.com/xiaomi-research/GLAP
|
109 |
-
cd GLAP
|
110 |
-
uv venv --python 3.10
|
111 |
-
source activate .venv/bin/activate
|
112 |
-
uv sync
|
113 |
-
|
114 |
-
#python3 -m pip install .
|
115 |
-
# Additionally, sndfile is needed
|
116 |
-
# conda install -c conda-forge libsndfile==1.0.31
|
117 |
-
```
|
118 |
-
|
119 |
-
### Pip
|
120 |
-
|
121 |
-
```bash
|
122 |
-
git clone https://github.com/xiaomi-research/GLAP
|
123 |
-
cd GLAP
|
124 |
-
python3 -m pip install .
|
125 |
-
# Additionally, sndfile is needed
|
126 |
-
# conda install -c conda-forge libsndfile==1.0.31
|
127 |
-
# Or if you have root, use your package manager
|
128 |
-
```
|
129 |
-
|
130 |
-
|
131 |
-
### Prepare data
|
132 |
-
|
133 |
-
|
134 |
-
Data needs to be in `tar/tar.gz` format:
|
135 |
-
|
136 |
-
```
|
137 |
-
# tar -tf a.tar
|
138 |
-
908-31957-0013.flac
|
139 |
-
908-31957-0013.json
|
140 |
-
2961-960-0013.flac
|
141 |
-
2961-960-0013.json
|
142 |
-
```
|
143 |
-
|
144 |
-
|
145 |
-
Each `.json` should have one of three fields `caption`, `captions` or `text`.
|
146 |
-
Data preparation can be done using the `wavlist_to_tar` script, which is provided in the `dasheng` dependency.
|
147 |
-
Further information how to process data can be seen [here](https://github.com/XiaoMi/dasheng?tab=readme-ov-file#3-training).
|
148 |
-
|
149 |
-
### Training
|
150 |
-
|
151 |
-
|
152 |
-
For reference, we provide our original training config for GLAP `configs/train/multilingual_dasheng_asr_sound2_sigmoidloss_balanced.yaml`.
|
153 |
-
|
154 |
-
|
155 |
-
```bash
|
156 |
-
accelerate launch --mixed-precision='fp16' run.py train configs/train/multilingual_dasheng_asr_sound2_sigmoidloss_balanced.yaml
|
157 |
-
```
|
158 |
-
|
159 |
-
|
160 |
-
### Zeroshot eval (one sample)
|
161 |
-
|
162 |
-
|
163 |
-
```bash
|
164 |
-
# There ; is a separator for different text keys
|
165 |
-
python3 run.py zeroshot pretrained_checkpoint/glap_checkpoint.pt PATH_TO_WAV_FLAC_MP3_SAMPLE.wav "The sound of a horse;Car;Mama;The sound of music;somebody is speaking;The sound of ein Pferd;一只马;Music is played;音乐的声音;Musik ist zu hoeren";Zero;One;Two;Three"
|
166 |
-
```
|
167 |
-
|
168 |
-
### Retrieval scoring
|
169 |
-
|
170 |
-
```bash
|
171 |
-
# Should be run on a single GPU
|
172 |
-
accelerate launch --mixed-precision='fp16' run.py evaluate PATH_TO_CHECKPOINT
|
173 |
-
```
|
174 |
-
|
175 |
-
|
176 |
-
|
177 |
-
### Notes on DDP
|
178 |
-
|
179 |
-
Using uneven training datasets without `resample=True` is not recommended
|
180 |
-
|
181 |
-
|
182 |
-
## Translating data into a target language
|
183 |
-
|
184 |
-
For our experiments we used SONAR to translate audio captions into seven target languages. This can be reproduced using our code:
|
185 |
-
|
186 |
-
|
187 |
-
```bash
|
188 |
-
python3 run.py translate_sonar data/WavCaps/freesound/freesound_train_sample_0000* --output_path data/translations/WavCaps/freesound/
|
189 |
-
```
|
190 |
-
|
191 |
-
DDP is also supported:
|
192 |
-
|
193 |
-
```bash
|
194 |
-
accelerate launch run.py translate_sonar data/WavCaps/freesound/freesound_train_sample_0000* --output_path data/translations/WavCaps/freesound/
|
195 |
-
```
|
196 |
-
|
197 |
-
|
198 |
-
## Citation
|
199 |
-
|
200 |
-
TODO
|
201 |
-
```bibtex
|
202 |
-
@inproceedings{dinkel2025glap,
|
203 |
-
title={GLAP: General contrastive audio-text pretraining across domains and languages},
|
204 |
-
year={2025}
|
205 |
-
}
|
206 |
-
```
|
207 |
-
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
---
|
4 |
+
|
5 |
+
|
6 |
+
<div align="center">
|
7 |
+
<h1>
|
8 |
+
GLAP (Generalized Language Audio Pretraining)
|
9 |
+
</h1>
|
10 |
+
<p>
|
11 |
+
Official PyTorch code for <b>GLAP</b> <br>
|
12 |
+
<b><em>Generalized Language Audio Pretraining</em></b>
|
13 |
+
</p>
|
14 |
+
</p>
|
15 |
+
<a href="https://arxiv.org/abs/"><img src="https://img.shields.io/badge/" alt="version"></a>
|
16 |
+
<a href="https://github.com/xiaomi/glap"><img src="https://img.shields.io/badge/Platform-linux-lightgrey" alt="version"></a>
|
17 |
+
<a href="https://www.python.org"><img src="https://img.shields.io/badge/Python-3.10+-orange" alt="version"></a>
|
18 |
+
<a href="https://pytorch.org"><img src="https://img.shields.io/badge/PyTorch-2.0+-brightgreen" alt="python"></a>
|
19 |
+
<a href="https://www.apache.org/licenses/LICENSE-2.0"><img src="https://img.shields.io/badge/License-Apache%202.0-blue.svg" alt="mit"></a>
|
20 |
+
<img src="https://img.shields.io/pypi/dm/glap_model" alt="PyPI Downloads">
|
21 |
+
|
22 |
+
</div>
|
23 |
+
|
24 |
+
|
25 |
+
|
26 |
+
|
27 |
+
# GLAP (Generalized Language Audio Pretraining)
|
28 |
+
|
29 |
+
|
30 |
+
<img src="capabilities.png" alt="GLAP capabiltiies" style="height: 600px;">
|
31 |
+
|
32 |
+
|
33 |
+
## Features
|
34 |
+
|
35 |
+
|
36 |
+
* *First* all-in-one solution for general audio-text retrieval.
|
37 |
+
* Multilingual (8 + Languages) Speech, Music and Sound retrieval.
|
38 |
+
* Music and Sound retrieval performance in English matches previous baselines, while also **supporting** Languages like Japanese, German, Spanish, Chinese, Dutch and more.
|
39 |
+
|
40 |
+
|
41 |
+
## Usage
|
42 |
+
|
43 |
+
|
44 |
+
```bash
|
45 |
+
pip install glap_model
|
46 |
+
```
|
47 |
+
|
48 |
+
|
49 |
+
### Scoring audio-text pairs
|
50 |
+
|
51 |
+
We provide a simple commandline tool:
|
52 |
+
|
53 |
+
```bash
|
54 |
+
score_glap audio_input_file text1;text2;text3
|
55 |
+
```
|
56 |
+
|
57 |
+
Or in Python:
|
58 |
+
|
59 |
+
```python
|
60 |
+
import torch
|
61 |
+
from glap_model import glap_inference
|
62 |
+
|
63 |
+
audio = torch.randn(1, 160000).tanh() # 10s of heavy noise
|
64 |
+
|
65 |
+
glap_model = glap_inference()
|
66 |
+
|
67 |
+
score = glap_model.score_forward(audio, text=["the sound of noise","a car is driving","a person is speaking"])
|
68 |
+
print(score)
|
69 |
+
```
|
70 |
+
|
71 |
+
|
72 |
+
|
73 |
+
### Recommended Prompts
|
74 |
+
|
75 |
+
| Task | Prompt |
|
76 |
+
|--------|-----------------------------------------|
|
77 |
+
| Speech | {label} |
|
78 |
+
| Music | The music in the style of {label}. |
|
79 |
+
| Sound | The sound of {label} can be heard. |
|
80 |
+
|
81 |
+
|
82 |
+
### Batched scoring
|
83 |
+
|
84 |
+
|
85 |
+
```python
|
86 |
+
import torch
|
87 |
+
from glap_model import glap_inference
|
88 |
+
|
89 |
+
glap_model = glap_inference()
|
90 |
+
audio = torch.randn(1, 64000).tanh()
|
91 |
+
prefix = "The sound of"
|
92 |
+
labels = [ f"{prefix} {label}" for label in ("Cat","Dog","Water","Noise")]
|
93 |
+
text_embeds = glap_model.encode_text(labels)
|
94 |
+
audio_embeds = glap_model.encode_audio(audio)
|
95 |
+
scores = glap_model.score(audio_embeds, text_embeds)
|
96 |
+
for label_name, score in zip(labels, scores):
|
97 |
+
print(label_name,score)
|
98 |
+
|
99 |
+
|
100 |
+
```
|
101 |
+
|
102 |
+
## Development
|
103 |
+
|
104 |
+
|
105 |
+
### UV (Recommended)
|
106 |
+
|
107 |
+
```bash
|
108 |
+
git clone https://github.com/xiaomi-research/GLAP
|
109 |
+
cd GLAP
|
110 |
+
uv venv --python 3.10
|
111 |
+
source activate .venv/bin/activate
|
112 |
+
uv sync
|
113 |
+
|
114 |
+
#python3 -m pip install .
|
115 |
+
# Additionally, sndfile is needed
|
116 |
+
# conda install -c conda-forge libsndfile==1.0.31
|
117 |
+
```
|
118 |
+
|
119 |
+
### Pip
|
120 |
+
|
121 |
+
```bash
|
122 |
+
git clone https://github.com/xiaomi-research/GLAP
|
123 |
+
cd GLAP
|
124 |
+
python3 -m pip install .
|
125 |
+
# Additionally, sndfile is needed
|
126 |
+
# conda install -c conda-forge libsndfile==1.0.31
|
127 |
+
# Or if you have root, use your package manager
|
128 |
+
```
|
129 |
+
|
130 |
+
|
131 |
+
### Prepare data
|
132 |
+
|
133 |
+
|
134 |
+
Data needs to be in `tar/tar.gz` format:
|
135 |
+
|
136 |
+
```
|
137 |
+
# tar -tf a.tar
|
138 |
+
908-31957-0013.flac
|
139 |
+
908-31957-0013.json
|
140 |
+
2961-960-0013.flac
|
141 |
+
2961-960-0013.json
|
142 |
+
```
|
143 |
+
|
144 |
+
|
145 |
+
Each `.json` should have one of three fields `caption`, `captions` or `text`.
|
146 |
+
Data preparation can be done using the `wavlist_to_tar` script, which is provided in the `dasheng` dependency.
|
147 |
+
Further information how to process data can be seen [here](https://github.com/XiaoMi/dasheng?tab=readme-ov-file#3-training).
|
148 |
+
|
149 |
+
### Training
|
150 |
+
|
151 |
+
|
152 |
+
For reference, we provide our original training config for GLAP `configs/train/multilingual_dasheng_asr_sound2_sigmoidloss_balanced.yaml`.
|
153 |
+
|
154 |
+
|
155 |
+
```bash
|
156 |
+
accelerate launch --mixed-precision='fp16' run.py train configs/train/multilingual_dasheng_asr_sound2_sigmoidloss_balanced.yaml
|
157 |
+
```
|
158 |
+
|
159 |
+
|
160 |
+
### Zeroshot eval (one sample)
|
161 |
+
|
162 |
+
|
163 |
+
```bash
|
164 |
+
# There ; is a separator for different text keys
|
165 |
+
python3 run.py zeroshot pretrained_checkpoint/glap_checkpoint.pt PATH_TO_WAV_FLAC_MP3_SAMPLE.wav "The sound of a horse;Car;Mama;The sound of music;somebody is speaking;The sound of ein Pferd;一只马;Music is played;音乐的声音;Musik ist zu hoeren";Zero;One;Two;Three"
|
166 |
+
```
|
167 |
+
|
168 |
+
### Retrieval scoring
|
169 |
+
|
170 |
+
```bash
|
171 |
+
# Should be run on a single GPU
|
172 |
+
accelerate launch --mixed-precision='fp16' run.py evaluate PATH_TO_CHECKPOINT
|
173 |
+
```
|
174 |
+
|
175 |
+
|
176 |
+
|
177 |
+
### Notes on DDP
|
178 |
+
|
179 |
+
Using uneven training datasets without `resample=True` is not recommended
|
180 |
+
|
181 |
+
|
182 |
+
## Translating data into a target language
|
183 |
+
|
184 |
+
For our experiments we used SONAR to translate audio captions into seven target languages. This can be reproduced using our code:
|
185 |
+
|
186 |
+
|
187 |
+
```bash
|
188 |
+
python3 run.py translate_sonar data/WavCaps/freesound/freesound_train_sample_0000* --output_path data/translations/WavCaps/freesound/
|
189 |
+
```
|
190 |
+
|
191 |
+
DDP is also supported:
|
192 |
+
|
193 |
+
```bash
|
194 |
+
accelerate launch run.py translate_sonar data/WavCaps/freesound/freesound_train_sample_0000* --output_path data/translations/WavCaps/freesound/
|
195 |
+
```
|
196 |
+
|
197 |
+
|
198 |
+
## Citation
|
199 |
+
|
200 |
+
TODO
|
201 |
+
```bibtex
|
202 |
+
@inproceedings{dinkel2025glap,
|
203 |
+
title={GLAP: General contrastive audio-text pretraining across domains and languages},
|
204 |
+
year={2025}
|
205 |
+
}
|
206 |
+
```
|
207 |
+
|