Spaces:
Restarting
on
A10G
A newer version of the Gradio SDK is available:
5.36.2
license: other
title: Audio Flamingo 3 Demo
sdk: gradio
emoji: π
colorFrom: green
colorTo: green
pinned: true
short_description: Audio Flamingo 3 Demo
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio-Language Models
Overview
This repo contains the PyTorch implementation of Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio-Language Models. Audio Flamingo 3 (AF3) is a fully open, state-of-the-art Large Audio-Language Model (LALM) that advances reasoning and understanding across speech, sounds, and music. AF3 builds on previous work with innovations in:
- Unified audio representation learning (speech, sound, music)
- Flexible, on-demand chain-of-thought reasoning (Thinking in Audio)
- Long-context audio comprehension (including speech and up to 10 minutes)
- Multi-turn, multi-audio conversational dialogue (AF3-Chat)
- Voice-to-voice interaction (AF3-Chat)
Extensive evaluations confirm AF3βs effectiveness, setting new benchmarks on over 20 public audio understanding and reasoning tasks.
Main Results
Audio Flamingo 3 outperforms prior SOTA models including GAMA, Audio Flamingo, Audio Flamingo 2, Qwen-Audio, Qwen2-Audio, Qwen2.5-Omni.LTU, LTU-AS, SALMONN, AudioGPT, Gemini Flash v2 and Gemini Pro v1.5 on a number of understanding and reasoning benchmarks.


Audio Flamingo 3 Architecture
Audio Flamingo 3 uses AF-Whisper unified audio encoder, MLP-based audio adaptor, Decoder-only LLM backbone (Qwen2.5-7B), and Streaming TTS module (AF3-Chat). Audio Flamingo 3 can take up to 10 minutes of audio inputs.

Installation
./environment_setup.sh af3
Code Structure
- The folder
audio_flamingo_3/
contains the main training and inference code of Audio Flamingo 3. - The folder
audio_flamingo_3/scripts
contains the inference scripts of Audio Flamingo 3 in case you would like to use our pretrained checkpoints on HuggingFace.
Each folder is self-contained and we expect no cross dependencies between these folders. This repo does not contain the code for Streaming-TTS pipeline which will released in the near future.
Single Line Inference
To infer stage 3 model directly, run the command below:
python llava/cli/infer_audio.py --model-base /path/to/checkpoint/af3-7b --conv-mode auto --text "Please describe the audio in detail" --media static/audio1.wav
To infer the model in stage 3.5 model, run the command below:
python llava/cli/infer_audio.py --model-base /path/to/checkpoint/af3-7b --model-path /path/to/checkpoint/af3-7b/stage35 --conv-mode auto --text "Please describe the audio in detail" --media static/audio1.wav --peft-mode
References
The main training and inferencing code within each folder are modified from NVILA Apache license.
License
- The code in this repo is under MIT license.
- The checkpoints are for non-commercial use only NVIDIA OneWay Noncommercial License. They are also subject to the Qwen Research license, the Terms of Use of the data generated by OpenAI, and the original licenses accompanying each training dataset.
- Notice: Audio Flamingo 3 is built with Qwen-2.5. Qwen is licensed under the Qwen RESEARCH LICENSE AGREEMENT, Copyright (c) Alibaba Cloud. All Rights Reserved.
Citation
- Audio Flamingo 2
@article{ghosh2025audio,
title={Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities},
author={Ghosh, Sreyan and Kong, Zhifeng and Kumar, Sonal and Sakshi, S and Kim, Jaehyeon and Ping, Wei and Valle, Rafael and Manocha, Dinesh and Catanzaro, Bryan},
journal={arXiv preprint arXiv:2503.03983},
year={2025}
}
- Audio Flamingo
@inproceedings{kong2024audio,
title={Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities},
author={Kong, Zhifeng and Goel, Arushi and Badlani, Rohan and Ping, Wei and Valle, Rafael and Catanzaro, Bryan},
booktitle={International Conference on Machine Learning},
pages={25125--25148},
year={2024},
organization={PMLR}
}