audio

zzfive 's Collections

world_model

VLA

RolePlaying

dLLM

industry

RAG

ssm

safety

inference optimization

updated Apr 29

Upvote

SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation

Paper • 2405.18503 • Published May 28, 2024 • 9
DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation

Paper • 2405.20289 • Published May 30, 2024 • 11
LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes

Paper • 2406.02897 • Published Jun 5, 2024 • 16
Audio Mamba: Bidirectional State Space Model for Audio Representation Learning

Paper • 2406.03344 • Published Jun 5, 2024 • 22
GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities

Paper • 2406.11768 • Published Jun 17, 2024 • 24
Towards Robust Speech Representation Learning for Thousands of Languages

Paper • 2407.00837 • Published Jun 30, 2024 • 11
FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds

Paper • 2407.01494 • Published Jul 1, 2024 • 16
PicoAudio: Enabling Precise Timestamp and Frequency Controllability of Audio Events in Text-to-audio Generation

Paper • 2407.02869 • Published Jul 3, 2024 • 21
FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

Paper • 2407.04051 • Published Jul 4, 2024 • 40
Video-to-Audio Generation with Hidden Alignment

Paper • 2407.07464 • Published Jul 10, 2024 • 17
Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity

Paper • 2407.10387 • Published Jul 15, 2024 • 8
Qwen2-Audio Technical Report

Paper • 2407.10759 • Published Jul 15, 2024 • 64
Audio Conditioning for Music Generation via Discrete Bottleneck Features

Paper • 2407.12563 • Published Jul 17, 2024 • 7
Stable Audio Open

Paper • 2407.14358 • Published Jul 19, 2024 • 27
Efficient Audio Captioning with Encoder-Level Knowledge Distillation

Paper • 2407.14329 • Published Jul 19, 2024 • 5
MusiConGen: Rhythm and Chord Control for Transformer-Based Text-to-Music Generation

Paper • 2407.15060 • Published Jul 21, 2024 • 9
Towards Achieving Human Parity on End-to-end Simultaneous Speech Translation via LLM Agent

Paper • 2407.21646 • Published Jul 31, 2024 • 18
Open-Vocabulary Audio-Visual Semantic Segmentation

Paper • 2407.21721 • Published Jul 31, 2024 • 9
MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models

Paper • 2408.01337 • Published Aug 2, 2024 • 11
AVESFormer: Efficient Transformer Design for Real-Time Audio-Visual Segmentation

Paper • 2408.01708 • Published Aug 3, 2024 • 4
Facing the Music: Tackling Singing Voice Separation in Cinematic Audio Source Separation

Paper • 2408.03588 • Published Aug 7, 2024 • 8
MulliVC: Multi-lingual Voice Conversion With Cycle Consistency

Paper • 2408.04708 • Published Aug 8, 2024 • 8
PeriodWave: Multi-Period Flow Matching for High-Fidelity Waveform Generation

Paper • 2408.07547 • Published Aug 14, 2024 • 9
Accelerating High-Fidelity Waveform Generation via Adversarial Flow Matching Optimization

Paper • 2408.08019 • Published Aug 15, 2024 • 11
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling

Paper • 2408.16532 • Published Aug 29, 2024 • 50
The VoxCeleb Speaker Recognition Challenge: A Retrospective

Paper • 2408.14886 • Published Aug 27, 2024 • 11
FLUX that Plays Music

Paper • 2409.00587 • Published Sep 1, 2024 • 33
Density Adaptive Attention-based Speech Network: Enhancing Feature Understanding for Mental Health Disorders

Paper • 2409.00391 • Published Aug 31, 2024 • 5
FastVoiceGrad: One-step Diffusion-Based Voice Conversion with Adversarial Conditional Diffusion Distillation

Paper • 2409.02245 • Published Sep 3, 2024 • 10
LLaMA-Omni: Seamless Speech Interaction with Large Language Models

Paper • 2409.06666 • Published Sep 10, 2024 • 60
SongCreator: Lyrics-based Universal Song Generation

Paper • 2409.06029 • Published Sep 9, 2024 • 22
Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis

Paper • 2409.06135 • Published Sep 10, 2024 • 16
Seed-Music: A Unified Framework for High Quality and Controlled Music Generation

Paper • 2409.09214 • Published Sep 13, 2024 • 53
EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer

Paper • 2409.10819 • Published Sep 17, 2024 • 18
PDMX: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music Processing

Paper • 2409.10831 • Published Sep 17, 2024 • 6
Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models

Paper • 2409.12139 • Published Sep 18, 2024 • 12
SoloAudio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer

Paper • 2409.08425 • Published Sep 12, 2024 • 10
CLAIR-A: Leveraging Large Language Models to Judge Audio Captions

Paper • 2409.12962 • Published Sep 19, 2024 • 2
MuCodec: Ultra Low-Bitrate Music Codec

Paper • 2409.13216 • Published Sep 20, 2024 • 22
Temporally Aligned Audio for Video with Autoregression

Paper • 2409.13689 • Published Sep 20, 2024 • 9
Distilling an End-to-End Voice Assistant Without Instruction Training Data

Paper • 2410.02678 • Published Oct 3, 2024 • 24
Roadmap towards Superhuman Speech Understanding using Large Language Models

Paper • 2410.13268 • Published Oct 17, 2024 • 33
MuVi: Video-to-Music Generation with Semantic Alignment and Rhythmic Synchronization

Paper • 2410.12957 • Published Oct 16, 2024 • 8
Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant

Paper • 2410.15316 • Published Oct 20, 2024 • 12
Continuous Speech Synthesis using per-token Latent Diffusion

Paper • 2410.16048 • Published Oct 21, 2024 • 30
MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer

Paper • 2409.00750 • Published Sep 1, 2024 • 6
Acoustic Volume Rendering for Neural Impulse Response Fields

Paper • 2411.06307 • Published Nov 9, 2024 • 6
PerceiverS: A Multi-Scale Perceiver with Effective Segmentation for Long-Term Expressive Symbolic Music Generation

Paper • 2411.08307 • Published Nov 13, 2024 • 7
Video-Guided Foley Sound Generation with Multimodal Controls

Paper • 2411.17698 • Published Nov 26, 2024 • 10
Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation

Paper • 2412.09428 • Published Dec 12, 2024 • 7
Whisper-GPT: A Hybrid Representation Audio Large Language Model

Paper • 2412.11449 • Published Dec 16, 2024 • 4
RLDG: Robotic Generalist Policy Distillation via Reinforcement Learning

Paper • 2412.09858 • Published Dec 13, 2024 • 2
Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis

Paper • 2412.15322 • Published Dec 19, 2024 • 20
How "Real" is Your Real-Time Simultaneous Speech-to-Text Translation System?

Paper • 2412.18495 • Published Dec 24, 2024 • 9
TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization

Paper • 2412.21037 • Published Dec 30, 2024 • 24
HiFi-SR: A Unified Generative Transformer-Convolutional Adversarial Network for High-Fidelity Speech Super-Resolution

Paper • 2501.10045 • Published Jan 17, 2025 • 10
Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis

Paper • 2502.04128 • Published Feb 6, 2025 • 27
Soundwave: Less is More for Speech-Text Alignment in LLMs

Paper • 2502.12900 • Published Feb 18, 2025 • 86
SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation

Paper • 2502.13128 • Published Feb 18, 2025 • 41
Mind the Gap! Static and Interactive Evaluations of Large Audio Models

Paper • 2502.15919 • Published Feb 21, 2025 • 4
LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM

Paper • 2503.04724 • Published Mar 6, 2025 • 72
Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities

Paper • 2503.03983 • Published Mar 6, 2025 • 29
YuE: Scaling Open Foundation Models for Long-Form Music Generation

Paper • 2503.08638 • Published Mar 11, 2025 • 73
Quantization for OpenAI's Whisper Models: A Comparative Analysis

Paper • 2503.09905 • Published Mar 12, 2025 • 7
From TOWER to SPIRE: Adding the Speech Modality to a Text-Only LLM

Paper • 2503.10620 • Published Mar 13, 2025 • 7
MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis

Paper • 2502.18924 • Published Feb 26, 2025 • 16
Kimi-Audio Technical Report

Paper • 2504.18425 • Published Apr 25, 2025 • 21
Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play

Paper • 2505.02707 • Published May 5, 2025 • 85
LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis

Paper • 2505.02625 • Published May 5, 2025 • 23
VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model

Paper • 2505.03739 • Published May 6, 2025 • 10
MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder

Paper • 2505.07916 • Published May 12, 2025 • 139
Fast Text-to-Audio Generation with Adversarial Post-Training

Paper • 2505.08175 • Published May 13, 2025 • 26
Efficient Speech Language Modeling via Energy Distance in Continuous Latent Space

Paper • 2505.13181 • Published May 19, 2025 • 9
Learning to Highlight Audio by Watching Movies

Paper • 2505.12154 • Published May 17, 2025 • 3
From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition

Paper • 2505.16972 • Published May 22, 2025 • 9
OWSM v4: Improving Open Whisper-Style Speech Models via Data Scaling and Cleaning

Paper • 2506.00338 • Published May 31, 2025 • 10
FusionAudio-1.2M: Towards Fine-grained Audio Captioning with Multimodal Contextual Fusion

Paper • 2506.01111 • Published Jun 1, 2025 • 32
Auto-Regressive vs Flow-Matching: a Comparative Study of Modeling Paradigms for Text-to-Music Generation

Paper • 2506.08570 • Published Jun 10, 2025 • 33
Discrete Audio Tokens: More Than a Survey!

Paper • 2506.10274 • Published Jun 12, 2025 • 32
EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection

Paper • 2506.09827 • Published Jun 11, 2025 • 24
SonicVerse: Multi-Task Learning for Music Feature-Informed Captioning

Paper • 2506.15154 • Published Jun 18, 2025 • 9
CultureMERT: Continual Pre-Training for Cross-Cultural Music Representation Learning

Paper • 2506.17818 • Published Jun 21, 2025 • 3
USAD: Universal Speech and Audio Representation via Distillation

Paper • 2506.18843 • Published Jun 23, 2025 • 13
Re-Bottleneck: Latent Re-Structuring for Neural Audio Autoencoders

Paper • 2507.07867 • Published Jul 10, 2025 • 2
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

Paper • 2507.08128 • Published Jul 10, 2025 • 15
Voxtral

Paper • 2507.13264 • Published Jul 17, 2025 • 35
IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System

Paper • 2502.05512 • Published Feb 8, 2025 • 7
OpenBEATs: A Fully Open-Source General-Purpose Audio Encoder

Paper • 2507.14129 • Published Jul 18, 2025 • 12
STITCH: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models

Paper • 2507.15375 • Published Jul 21, 2025 • 30
Step-Audio 2 Technical Report

Paper • 2507.16632 • Published Jul 22, 2025 • 76
DMOSpeech 2: Reinforcement Learning for Duration Prediction in Metric-Optimized Speech Synthesis

Paper • 2507.14988 • Published Jul 20, 2025 • 8
SonicMaster: Towards Controllable All-in-One Music Restoration and Mastering

Paper • 2508.03448 • Published Aug 5, 2025 • 7
Marco-Voice Technical Report

Paper • 2508.02038 • Published Aug 4, 2025 • 16
NVSpeech: An Integrated and Scalable Pipeline for Human-Like Speech Modeling with Paralinguistic Vocalizations

Paper • 2508.04195 • Published Aug 6, 2025 • 2
Representing Speech Through Autoregressive Prediction of Cochlear Tokens

Paper • 2508.11598 • Published Aug 15, 2025 • 18
Evaluating Podcast Recommendations with Profile-Aware LLM-as-a-Judge

Paper • 2508.08777 • Published Aug 12, 2025 • 15
Advances in Speech Separation: Techniques, Challenges, and Future Trends

Paper • 2508.10830 • Published Aug 14, 2025 • 16
LLaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model

Paper • 2508.15418 • Published Aug 21, 2025 • 8
TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling

Paper • 2508.16790 • Published Aug 22, 2025 • 10
VibeVoice Technical Report

Paper • 2508.19205 • Published Aug 26, 2025 • 172
AudioStory: Generating Long-Form Narrative Audio with Large Language Models

Paper • 2508.20088 • Published Aug 27, 2025 • 21
AHELM: A Holistic Evaluation of Audio-Language Models

Paper • 2508.21376 • Published Aug 29, 2025 • 9
EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs

Paper • 2509.09174 • Published Sep 11, 2025 • 62
VStyle: A Benchmark for Voice Style Adaptation with Spoken Instructions

Paper • 2509.09716 • Published Sep 9, 2025 • 12
WhisTLE: Deeply Supervised, Text-Only Domain Adaptation for Pretrained Speech Recognition Transformers

Paper • 2509.10452 • Published Sep 12, 2025 • 2
Cross-Attention is Half Explanation in Speech-to-Text Models

Paper • 2509.18010 • Published Sep 22, 2025 • 6
Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention

Paper • 2509.23610 • Published Sep 28, 2025 • 15
UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE

Paper • 2510.13344 • Published Oct 15, 2025 • 65
Step-Audio-EditX Technical Report

Paper • 2511.03601 • Published Nov 5, 2025 • 30
Step-Audio-R1 Technical Report

Paper • 2511.15848 • Published Nov 19, 2025 • 60
SAM Audio: Segment Anything in Audio

Paper • 2512.18099 • Published Dec 19, 2025 • 25
Qwen3-TTS Technical Report

Paper • 2601.15621 • Published Jan 22 • 76
Qwen3-ASR Technical Report

Paper • 2601.21337 • Published Jan 29 • 38
VIBEVOICE-ASR Technical Report

Paper • 2601.18184 • Published Mar 14 • 24
MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models

Paper • 2602.10934 • Published Feb 11 • 50
GLM-5: from Vibe Coding to Agentic Engineering

Paper • 2602.15763 • Published Feb 17 • 152
Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing

Paper • 2604.10708 • Published Apr 12 • 43

Upvote

Collection guide
Browse collections