|
--- |
|
license: ms-pl |
|
tags: |
|
- contrastive audio language pretraining |
|
- audio |
|
- music |
|
- emotion |
|
- sound events |
|
- bioacoustics |
|
- retrieval |
|
- captioning |
|
- zero-shot |
|
- audio-text |
|
- CLAP |
|
--- |
|
|
|
###### [Overview](#CLAP) | [Setup](#Setup) | [CLAP weights](#CLAP-weights) | [Usage](#Usage) | [Examples](#Examples) | [Citation](#Citation) |
|
|
|
# CLAP |
|
|
|
CLAP (Contrastive Language-Audio Pretraining) is a model that learns acoustic concepts from natural language supervision and enables “Zero-Shot” inference. The model has been extensively evaluated in 26 audio downstream tasks achieving SoTA in several of them including classification, retrieval, and captioning. |
|
|
|
<img width="832" alt="clap_diagrams" src="docs/clap2_diagram.png"> |
|
|
|
## Setup |
|
|
|
First, install python 3.8 or higher (3.11 recommended). Then, install CLAP using either of the following: |
|
|
|
```shell |
|
# Install pypi pacakge |
|
pip install msclap |
|
|
|
# Or Install latest (unstable) git source |
|
pip install git+https://github.com/microsoft/CLAP.git |
|
``` |
|
|
|
## NEW CLAP weights |
|
CLAP weights: versions _2022_, _2023_, and _clapcap_ |
|
|
|
_clapcap_ is the audio captioning model that uses the 2023 encoders. |
|
|
|
## Usage |
|
|
|
CLAP code is in https://github.com/microsoft/CLAP |
|
|
|
- Zero-Shot Classification and Retrieval |
|
```python |
|
from msclap import CLAP |
|
|
|
# Load model (Choose between versions '2022' or '2023') |
|
clap_model = CLAP("<PATH TO WEIGHTS>", version = '2023', use_cuda=False) |
|
|
|
# Extract text embeddings |
|
text_embeddings = clap_model.get_text_embeddings(class_labels: List[str]) |
|
|
|
# Extract audio embeddings |
|
audio_embeddings = clap_model.get_audio_embeddings(file_paths: List[str]) |
|
|
|
# Compute similarity between audio and text embeddings |
|
similarities = clap_model.compute_similarity(audio_embeddings, text_embeddings) |
|
``` |
|
|
|
- Audio Captioning |
|
```python |
|
from msclap import CLAP |
|
|
|
# Load model (Choose version 'clapcap') |
|
clap_model = CLAP("<PATH TO WEIGHTS>", version = 'clapcap', use_cuda=False) |
|
|
|
# Generate audio captions |
|
captions = clap_model.generate_caption(file_paths: List[str]) |
|
``` |
|
|
|
|
|
## Citation |
|
|
|
Kindly cite our work if you find it useful. |
|
|
|
[CLAP: Learning Audio Concepts from Natural Language Supervision](https://ieeexplore.ieee.org/abstract/document/10095889) |
|
``` |
|
@inproceedings{CLAP2022, |
|
title={Clap learning audio concepts from natural language supervision}, |
|
author={Elizalde, Benjamin and Deshmukh, Soham and Al Ismail, Mahmoud and Wang, Huaming}, |
|
booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, |
|
pages={1--5}, |
|
year={2023}, |
|
organization={IEEE} |
|
} |
|
``` |
|
|
|
[Natural Language Supervision for General-Purpose Audio Representations](https://arxiv.org/abs/2309.05767) |
|
``` |
|
@misc{CLAP2023, |
|
title={Natural Language Supervision for General-Purpose Audio Representations}, |
|
author={Benjamin Elizalde and Soham Deshmukh and Huaming Wang}, |
|
year={2023}, |
|
eprint={2309.05767}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.SD}, |
|
url={https://arxiv.org/abs/2309.05767} |
|
} |
|
``` |
|
|
|
## Trademarks |
|
|
|
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft |
|
trademarks or logos is subject to and must follow |
|
[Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general). |
|
Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. |
|
Any use of third-party trademarks or logos are subject to those third-party's policies. |
|
|