Spaces:
Running
on
A100
Running
on
A100
File size: 7,086 Bytes
39e87e6 174ae06 39e87e6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 |
---
license: other
title: Audio Flamingo 3 Demo
sdk: gradio
emoji: π
colorFrom: green
colorTo: green
pinned: true
short_description: Audio Flamingo 3 Demo
---
<div align="center" style="display: flex; justify-content: center; align-items: center; text-align: center;">
<a href="https://github.com/NVIDIA/audio-flamingo" style="margin-right: 20px; text-decoration: none; display: flex; align-items: center;">
<img src="static/logo-no-bg.png" alt="Audio Flamingo 3 π₯ππ₯" width="120">
</a>
</div>
<div align="center" style="display: flex; justify-content: center; align-items: center; text-align: center;">
<h2>
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio-Language Models
</h2>
</div>
<div align="center" style="display: flex; justify-content: center; margin-top: 10px;">
<a href=""><img src="https://img.shields.io/badge/arXiv-2503.03983-AD1C18" style="margin-right: 5px;"></a>
<a href="https://research.nvidia.com/labs/adlr/AF3/"><img src="https://img.shields.io/badge/Demo page-228B22" style="margin-right: 5px;"></a>
<a href="https://github.com/NVIDIA/audio-flamingo"><img src='https://img.shields.io/badge/Github-Audio Flamingo 3-9C276A' style="margin-right: 5px;"></a>
<a href="https://github.com/NVIDIA/audio-flamingo/stargazers"><img src="https://img.shields.io/github/stars/NVIDIA/audio-flamingo.svg?style=social"></a>
</div>
<div align="center" style="display: flex; justify-content: center; margin-top: 10px; flex-wrap: wrap; gap: 5px;">
<a href="https://huggingface.co/nvidia/audio-flamingo-3">
<img src="https://img.shields.io/badge/π€-Checkpoints-ED5A22.svg">
</a>
<a href="https://huggingface.co/nvidia/audio-flamingo-3-chat">
<img src="https://img.shields.io/badge/π€-Checkpoints (Chat)-ED5A22.svg">
</a>
<a href="https://huggingface.co/datasets/nvidia/AudioSkills">
<img src="https://img.shields.io/badge/π€-Dataset: AudioSkills--XL-ED5A22.svg">
</a>
<a href="https://huggingface.co/datasets/nvidia/LongAudio">
<img src="https://img.shields.io/badge/π€-Dataset: LongAudio--XL-ED5A22.svg">
</a>
<a href="https://huggingface.co/datasets/nvidia/AF-Chat">
<img src="https://img.shields.io/badge/π€-Dataset: AF--Chat-ED5A22.svg">
</a>
<a href="https://huggingface.co/datasets/nvidia/AF-Think">
<img src="https://img.shields.io/badge/π€-Dataset: AF--Think-ED5A22.svg">
</a>
</div>
<div align="center" style="display: flex; justify-content: center; margin-top: 10px;">
<a href="https://huggingface.co/spaces/nvidia/audio_flamingo_3"><img src="https://img.shields.io/badge/π€-Gradio Demo (7B)-5F9EA0.svg" style="margin-right: 5px;"></a>
</div>
## Overview
This repo contains the PyTorch implementation of [Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio-Language Models](). Audio Flamingo 3 (AF3) is a fully open, state-of-the-art Large Audio-Language Model (LALM) that advances reasoning and understanding across speech, sounds, and music. AF3 builds on previous work with innovations in:
- Unified audio representation learning (speech, sound, music)
- Flexible, on-demand chain-of-thought reasoning (Thinking in Audio)
- Long-context audio comprehension (including speech and up to 10 minutes)
- Multi-turn, multi-audio conversational dialogue (AF3-Chat)
- Voice-to-voice interaction (AF3-Chat)
Extensive evaluations confirm AF3βs effectiveness, setting new benchmarks on over 20 public audio understanding and reasoning tasks.
## Main Results
Audio Flamingo 3 outperforms prior SOTA models including GAMA, Audio Flamingo, Audio Flamingo 2, Qwen-Audio, Qwen2-Audio, Qwen2.5-Omni.LTU, LTU-AS, SALMONN, AudioGPT, Gemini Flash v2 and Gemini Pro v1.5 on a number of understanding and reasoning benchmarks.
<div align="center">
<img class="img-full" src="static/af3_radial-1.png" width="300">
</div>
<div align="center">
<img class="img-full" src="static/af3_sota.png" width="400">
</div>
## Audio Flamingo 3 Architecture
Audio Flamingo 3 uses AF-Whisper unified audio encoder, MLP-based audio adaptor, Decoder-only LLM backbone (Qwen2.5-7B), and Streaming TTS module (AF3-Chat).
Audio Flamingo 3 can take up to 10 minutes of audio inputs.
<div align="center">
<img class="img-full" src="static/af3_main_diagram-1.png" width="800">
</div>
## Installation
```bash
./environment_setup.sh af3
```
## Code Structure
- The folder ```audio_flamingo_3/``` contains the main training and inference code of Audio Flamingo 3.
- The folder ```audio_flamingo_3/scripts``` contains the inference scripts of Audio Flamingo 3 in case you would like to use our pretrained checkpoints on HuggingFace.
Each folder is self-contained and we expect no cross dependencies between these folders. This repo does not contain the code for Streaming-TTS pipeline which will released in the near future.
## Single Line Inference
To infer stage 3 model directly, run the command below:
```bash
python llava/cli/infer_audio.py --model-base /path/to/checkpoint/af3-7b --conv-mode auto --text "Please describe the audio in detail" --media static/audio1.wav
```
To infer the model in stage 3.5 model, run the command below:
```bash
python llava/cli/infer_audio.py --model-base /path/to/checkpoint/af3-7b --model-path /path/to/checkpoint/af3-7b/stage35 --conv-mode auto --text "Please describe the audio in detail" --media static/audio1.wav --peft-mode
```
## References
The main training and inferencing code within each folder are modified from [NVILA](https://github.com/NVlabs/VILA/tree/main) [Apache license](incl_licenses/License_1.md).
## License
- The code in this repo is under [MIT license](incl_licenses/MIT_license.md).
- The checkpoints are for non-commercial use only [NVIDIA OneWay Noncommercial License](incl_licenses/NVIDIA_OneWay_Noncommercial_License.docx). They are also subject to the [Qwen Research license](https://huggingface.co/Qwen/Qwen2.5-7B/blob/main/LICENSE), the [Terms of Use](https://openai.com/policies/terms-of-use) of the data generated by OpenAI, and the original licenses accompanying each training dataset.
- Notice: Audio Flamingo 3 is built with Qwen-2.5. Qwen is licensed under the Qwen RESEARCH LICENSE AGREEMENT, Copyright (c) Alibaba Cloud. All Rights Reserved.
## Citation
- Audio Flamingo 2
```
@article{ghosh2025audio,
title={Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities},
author={Ghosh, Sreyan and Kong, Zhifeng and Kumar, Sonal and Sakshi, S and Kim, Jaehyeon and Ping, Wei and Valle, Rafael and Manocha, Dinesh and Catanzaro, Bryan},
journal={arXiv preprint arXiv:2503.03983},
year={2025}
}
```
- Audio Flamingo
```
@inproceedings{kong2024audio,
title={Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities},
author={Kong, Zhifeng and Goel, Arushi and Badlani, Rohan and Ping, Wei and Valle, Rafael and Catanzaro, Bryan},
booktitle={International Conference on Machine Learning},
pages={25125--25148},
year={2024},
organization={PMLR}
}
``` |