Spaces:
Runtime error
Runtime error
Update README.md
Browse files
README.md
CHANGED
|
@@ -11,4 +11,140 @@ license: other
|
|
| 11 |
short_description: Audio Flamingo 3 demo for multi-turn multi-audio chat
|
| 12 |
---
|
| 13 |
|
| 14 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
short_description: Audio Flamingo 3 demo for multi-turn multi-audio chat
|
| 12 |
---
|
| 13 |
|
| 14 |
+
|
| 15 |
+
<div align="center" style="display: flex; justify-content: center; align-items: center; text-align: center;">
|
| 16 |
+
<a href="https://github.com/NVIDIA/audio-flamingo" style="margin-right: 20px; text-decoration: none; display: flex; align-items: center;">
|
| 17 |
+
<img src="static/logo-no-bg.png" alt="Audio Flamingo 3 π₯ππ₯" width="120">
|
| 18 |
+
</a>
|
| 19 |
+
</div>
|
| 20 |
+
<div align="center" style="display: flex; justify-content: center; align-items: center; text-align: center;">
|
| 21 |
+
<h2>
|
| 22 |
+
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio-Language Models
|
| 23 |
+
</h2>
|
| 24 |
+
</div>
|
| 25 |
+
|
| 26 |
+
<div align="center" style="display: flex; justify-content: center; margin-top: 10px;">
|
| 27 |
+
<a href=""><img src="https://img.shields.io/badge/arXiv-2503.03983-AD1C18" style="margin-right: 5px;"></a>
|
| 28 |
+
<a href="https://research.nvidia.com/labs/adlr/AF3/"><img src="https://img.shields.io/badge/Demo page-228B22" style="margin-right: 5px;"></a>
|
| 29 |
+
<a href="https://github.com/NVIDIA/audio-flamingo"><img src='https://img.shields.io/badge/Github-Audio Flamingo 3-9C276A' style="margin-right: 5px;"></a>
|
| 30 |
+
<a href="https://github.com/NVIDIA/audio-flamingo/stargazers"><img src="https://img.shields.io/github/stars/NVIDIA/audio-flamingo.svg?style=social"></a>
|
| 31 |
+
</div>
|
| 32 |
+
|
| 33 |
+
<div align="center" style="display: flex; justify-content: center; margin-top: 10px; flex-wrap: wrap; gap: 5px;">
|
| 34 |
+
<a href="https://huggingface.co/nvidia/audio-flamingo-3">
|
| 35 |
+
<img src="https://img.shields.io/badge/π€-Checkpoints-ED5A22.svg">
|
| 36 |
+
</a>
|
| 37 |
+
<a href="https://huggingface.co/nvidia/audio-flamingo-3-chat">
|
| 38 |
+
<img src="https://img.shields.io/badge/π€-Checkpoints (Chat)-ED5A22.svg">
|
| 39 |
+
</a>
|
| 40 |
+
<a href="https://huggingface.co/datasets/nvidia/AudioSkills">
|
| 41 |
+
<img src="https://img.shields.io/badge/π€-Dataset: AudioSkills--XL-ED5A22.svg">
|
| 42 |
+
</a>
|
| 43 |
+
<a href="https://huggingface.co/datasets/nvidia/LongAudio">
|
| 44 |
+
<img src="https://img.shields.io/badge/π€-Dataset: LongAudio--XL-ED5A22.svg">
|
| 45 |
+
</a>
|
| 46 |
+
<a href="https://huggingface.co/datasets/nvidia/AF-Chat">
|
| 47 |
+
<img src="https://img.shields.io/badge/π€-Dataset: AF--Chat-ED5A22.svg">
|
| 48 |
+
</a>
|
| 49 |
+
<a href="https://huggingface.co/datasets/nvidia/AF-Think">
|
| 50 |
+
<img src="https://img.shields.io/badge/π€-Dataset: AF--Think-ED5A22.svg">
|
| 51 |
+
</a>
|
| 52 |
+
</div>
|
| 53 |
+
|
| 54 |
+
<div align="center" style="display: flex; justify-content: center; margin-top: 10px;">
|
| 55 |
+
<a href="https://huggingface.co/spaces/nvidia/audio_flamingo_3"><img src="https://img.shields.io/badge/π€-Gradio Demo (7B)-5F9EA0.svg" style="margin-right: 5px;"></a>
|
| 56 |
+
</div>
|
| 57 |
+
|
| 58 |
+
## Overview
|
| 59 |
+
|
| 60 |
+
This repo contains the PyTorch implementation of [Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio-Language Models](). Audio Flamingo 3 (AF3) is a fully open, state-of-the-art Large Audio-Language Model (LALM) that advances reasoning and understanding across speech, sounds, and music. AF3 builds on previous work with innovations in:
|
| 61 |
+
|
| 62 |
+
- Unified audio representation learning (speech, sound, music)
|
| 63 |
+
- Flexible, on-demand chain-of-thought reasoning (Thinking in Audio)
|
| 64 |
+
- Long-context audio comprehension (including speech and up to 10 minutes)
|
| 65 |
+
- Multi-turn, multi-audio conversational dialogue (AF3-Chat)
|
| 66 |
+
- Voice-to-voice interaction (AF3-Chat)
|
| 67 |
+
|
| 68 |
+
Extensive evaluations confirm AF3βs effectiveness, setting new benchmarks on over 20 public audio understanding and reasoning tasks.
|
| 69 |
+
|
| 70 |
+
|
| 71 |
+
## Main Results
|
| 72 |
+
|
| 73 |
+
Audio Flamingo 3 outperforms prior SOTA models including GAMA, Audio Flamingo, Audio Flamingo 2, Qwen-Audio, Qwen2-Audio, Qwen2.5-Omni.LTU, LTU-AS, SALMONN, AudioGPT, Gemini Flash v2 and Gemini Pro v1.5 on a number of understanding and reasoning benchmarks.
|
| 74 |
+
|
| 75 |
+
<div align="center">
|
| 76 |
+
<img class="img-full" src="static/af3_radial-1.png" width="300">
|
| 77 |
+
</div>
|
| 78 |
+
|
| 79 |
+
<div align="center">
|
| 80 |
+
<img class="img-full" src="static/af3_sota.png" width="400">
|
| 81 |
+
</div>
|
| 82 |
+
|
| 83 |
+
## Audio Flamingo 3 Architecture
|
| 84 |
+
|
| 85 |
+
Audio Flamingo 3 uses AF-Whisper unified audio encoder, MLP-based audio adaptor, Decoder-only LLM backbone (Qwen2.5-7B), and Streaming TTS module (AF3-Chat).
|
| 86 |
+
Audio Flamingo 3 can take up to 10 minutes of audio inputs.
|
| 87 |
+
|
| 88 |
+
<div align="center">
|
| 89 |
+
<img class="img-full" src="static/af3_main_diagram-1.png" width="800">
|
| 90 |
+
</div>
|
| 91 |
+
|
| 92 |
+
## Installation
|
| 93 |
+
|
| 94 |
+
```bash
|
| 95 |
+
./environment_setup.sh af3
|
| 96 |
+
```
|
| 97 |
+
|
| 98 |
+
## Code Structure
|
| 99 |
+
|
| 100 |
+
- The folder ```audio_flamingo_3/``` contains the main training and inference code of Audio Flamingo 3.
|
| 101 |
+
- The folder ```audio_flamingo_3/scripts``` contains the inference scripts of Audio Flamingo 3 in case you would like to use our pretrained checkpoints on HuggingFace.
|
| 102 |
+
|
| 103 |
+
Each folder is self-contained and we expect no cross dependencies between these folders. This repo does not contain the code for Streaming-TTS pipeline which will released in the near future.
|
| 104 |
+
|
| 105 |
+
## Single Line Inference
|
| 106 |
+
|
| 107 |
+
To infer stage 3 model directly, run the command below:
|
| 108 |
+
```bash
|
| 109 |
+
python llava/cli/infer_audio.py --model-base /path/to/checkpoint/af3-7b --conv-mode auto --text "Please describe the audio in detail" --media static/audio1.wav
|
| 110 |
+
```
|
| 111 |
+
|
| 112 |
+
To infer the model in stage 3.5 model, run the command below:
|
| 113 |
+
```bash
|
| 114 |
+
python llava/cli/infer_audio.py --model-base /path/to/checkpoint/af3-7b --model-path /path/to/checkpoint/af3-7b/stage35 --conv-mode auto --text "Please describe the audio in detail" --media static/audio1.wav --peft-mode
|
| 115 |
+
```
|
| 116 |
+
|
| 117 |
+
## References
|
| 118 |
+
|
| 119 |
+
The main training and inferencing code within each folder are modified from [NVILA](https://github.com/NVlabs/VILA/tree/main) [Apache license](incl_licenses/License_1.md).
|
| 120 |
+
|
| 121 |
+
## License
|
| 122 |
+
|
| 123 |
+
- The code in this repo is under [MIT license](incl_licenses/MIT_license.md).
|
| 124 |
+
- The checkpoints are for non-commercial use only [NVIDIA OneWay Noncommercial License](incl_licenses/NVIDIA_OneWay_Noncommercial_License.docx). They are also subject to the [Qwen Research license](https://huggingface.co/Qwen/Qwen2.5-7B/blob/main/LICENSE), the [Terms of Use](https://openai.com/policies/terms-of-use) of the data generated by OpenAI, and the original licenses accompanying each training dataset.
|
| 125 |
+
- Notice: Audio Flamingo 3 is built with Qwen-2.5. Qwen is licensed under the Qwen RESEARCH LICENSE AGREEMENT, Copyright (c) Alibaba Cloud. All Rights Reserved.
|
| 126 |
+
|
| 127 |
+
|
| 128 |
+
## Citation
|
| 129 |
+
|
| 130 |
+
- Audio Flamingo 2
|
| 131 |
+
```
|
| 132 |
+
@article{ghosh2025audio,
|
| 133 |
+
title={Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities},
|
| 134 |
+
author={Ghosh, Sreyan and Kong, Zhifeng and Kumar, Sonal and Sakshi, S and Kim, Jaehyeon and Ping, Wei and Valle, Rafael and Manocha, Dinesh and Catanzaro, Bryan},
|
| 135 |
+
journal={arXiv preprint arXiv:2503.03983},
|
| 136 |
+
year={2025}
|
| 137 |
+
}
|
| 138 |
+
```
|
| 139 |
+
|
| 140 |
+
- Audio Flamingo
|
| 141 |
+
```
|
| 142 |
+
@inproceedings{kong2024audio,
|
| 143 |
+
title={Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities},
|
| 144 |
+
author={Kong, Zhifeng and Goel, Arushi and Badlani, Rohan and Ping, Wei and Valle, Rafael and Catanzaro, Bryan},
|
| 145 |
+
booktitle={International Conference on Machine Learning},
|
| 146 |
+
pages={25125--25148},
|
| 147 |
+
year={2024},
|
| 148 |
+
organization={PMLR}
|
| 149 |
+
}
|
| 150 |
+
```
|