README.md · nvidia/audio-flamingo-3 at main

metadata

license: other
title: Audio Flamingo 3 Demo
sdk: gradio
emoji: 🚀
colorFrom: green
colorTo: green
pinned: true
short_description: Audio Flamingo 3 Demo

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio-Language Models

Overview

This repo contains the PyTorch implementation of Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio-Language Models. Audio Flamingo 3 (AF3) is a fully open, state-of-the-art Large Audio-Language Model (LALM) that advances reasoning and understanding across speech, sounds, and music. AF3 builds on previous work with innovations in:

Unified audio representation learning (speech, sound, music)
Flexible, on-demand chain-of-thought reasoning (Thinking in Audio)
Long-context audio comprehension (including speech and up to 10 minutes)
Multi-turn, multi-audio conversational dialogue (AF3-Chat)
Voice-to-voice interaction (AF3-Chat)

Extensive evaluations confirm AF3’s effectiveness, setting new benchmarks on over 20 public audio understanding and reasoning tasks.

Main Results

Audio Flamingo 3 outperforms prior SOTA models including GAMA, Audio Flamingo, Audio Flamingo 2, Qwen-Audio, Qwen2-Audio, Qwen2.5-Omni.LTU, LTU-AS, SALMONN, AudioGPT, Gemini Flash v2 and Gemini Pro v1.5 on a number of understanding and reasoning benchmarks.

Audio Flamingo 3 Architecture

Audio Flamingo 3 uses AF-Whisper unified audio encoder, MLP-based audio adaptor, Decoder-only LLM backbone (Qwen2.5-7B), and Streaming TTS module (AF3-Chat). Audio Flamingo 3 can take up to 10 minutes of audio inputs.

Installation

./environment_setup.sh af3

Code Structure

The folder audio_flamingo_3/ contains the main training and inference code of Audio Flamingo 3.
The folder audio_flamingo_3/scripts contains the inference scripts of Audio Flamingo 3 in case you would like to use our pretrained checkpoints on HuggingFace.

Each folder is self-contained and we expect no cross dependencies between these folders. This repo does not contain the code for Streaming-TTS pipeline which will released in the near future.

Single Line Inference

To infer stage 3 model directly, run the command below:

python llava/cli/infer_audio.py --model-base /path/to/checkpoint/af3-7b --conv-mode auto --text "Please describe the audio in detail" --media static/audio1.wav

To infer the model in stage 3.5 model, run the command below:

python llava/cli/infer_audio.py --model-base /path/to/checkpoint/af3-7b --model-path /path/to/checkpoint/af3-7b/stage35 --conv-mode auto --text "Please describe the audio in detail" --media static/audio1.wav --peft-mode

References

The main training and inferencing code within each folder are modified from NVILA Apache license.

License

The code in this repo is under MIT license.
The checkpoints are for non-commercial use only NVIDIA OneWay Noncommercial License. They are also subject to the Qwen Research license, the Terms of Use of the data generated by OpenAI, and the original licenses accompanying each training dataset.

Citation

Audio Flamingo 2

@article{ghosh2025audio,
  title={Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities},
  author={Ghosh, Sreyan and Kong, Zhifeng and Kumar, Sonal and Sakshi, S and Kim, Jaehyeon and Ping, Wei and Valle, Rafael and Manocha, Dinesh and Catanzaro, Bryan},
  journal={arXiv preprint arXiv:2503.03983},
  year={2025}
}

Audio Flamingo

@inproceedings{kong2024audio,
  title={Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities},
  author={Kong, Zhifeng and Goel, Arushi and Badlani, Rohan and Ping, Wei and Valle, Rafael and Catanzaro, Bryan},
  booktitle={International Conference on Machine Learning},
  pages={25125--25148},
  year={2024},
  organization={PMLR}
}