SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation Paper • 2405.18503 • Published May 28, 2024 • 9
DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation Paper • 2405.20289 • Published May 30, 2024 • 11
LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes Paper • 2406.02897 • Published Jun 5, 2024 • 15
Audio Mamba: Bidirectional State Space Model for Audio Representation Learning Paper • 2406.03344 • Published Jun 5, 2024 • 22
GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities Paper • 2406.11768 • Published Jun 17, 2024 • 24
Towards Robust Speech Representation Learning for Thousands of Languages Paper • 2407.00837 • Published Jun 30, 2024 • 11
FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds Paper • 2407.01494 • Published Jul 1, 2024 • 15
PicoAudio: Enabling Precise Timestamp and Frequency Controllability of Audio Events in Text-to-audio Generation Paper • 2407.02869 • Published Jul 3, 2024 • 21
FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs Paper • 2407.04051 • Published Jul 4, 2024 • 40
Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity Paper • 2407.10387 • Published Jul 15, 2024 • 8
Audio Conditioning for Music Generation via Discrete Bottleneck Features Paper • 2407.12563 • Published Jul 17, 2024 • 7
Efficient Audio Captioning with Encoder-Level Knowledge Distillation Paper • 2407.14329 • Published Jul 19, 2024 • 5
MusiConGen: Rhythm and Chord Control for Transformer-Based Text-to-Music Generation Paper • 2407.15060 • Published Jul 21, 2024 • 9
Towards Achieving Human Parity on End-to-end Simultaneous Speech Translation via LLM Agent Paper • 2407.21646 • Published Jul 31, 2024 • 18
MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models Paper • 2408.01337 • Published Aug 2, 2024 • 11
AVESFormer: Efficient Transformer Design for Real-Time Audio-Visual Segmentation Paper • 2408.01708 • Published Aug 3, 2024 • 4
Facing the Music: Tackling Singing Voice Separation in Cinematic Audio Source Separation Paper • 2408.03588 • Published Aug 7, 2024 • 8
MulliVC: Multi-lingual Voice Conversion With Cycle Consistency Paper • 2408.04708 • Published Aug 8, 2024 • 9
PeriodWave: Multi-Period Flow Matching for High-Fidelity Waveform Generation Paper • 2408.07547 • Published Aug 14, 2024 • 8
Accelerating High-Fidelity Waveform Generation via Adversarial Flow Matching Optimization Paper • 2408.08019 • Published Aug 15, 2024 • 11
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling Paper • 2408.16532 • Published Aug 29, 2024 • 50
The VoxCeleb Speaker Recognition Challenge: A Retrospective Paper • 2408.14886 • Published Aug 27, 2024 • 11
Density Adaptive Attention-based Speech Network: Enhancing Feature Understanding for Mental Health Disorders Paper • 2409.00391 • Published Aug 31, 2024 • 5
FastVoiceGrad: One-step Diffusion-Based Voice Conversion with Adversarial Conditional Diffusion Distillation Paper • 2409.02245 • Published Sep 3, 2024 • 10
LLaMA-Omni: Seamless Speech Interaction with Large Language Models Paper • 2409.06666 • Published Sep 10, 2024 • 60
Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis Paper • 2409.06135 • Published Sep 10, 2024 • 16
Seed-Music: A Unified Framework for High Quality and Controlled Music Generation Paper • 2409.09214 • Published Sep 13, 2024 • 53
EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer Paper • 2409.10819 • Published Sep 17, 2024 • 18
PDMX: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music Processing Paper • 2409.10831 • Published Sep 17, 2024 • 6
Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models Paper • 2409.12139 • Published Sep 18, 2024 • 12
SoloAudio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer Paper • 2409.08425 • Published Sep 12, 2024 • 10
CLAIR-A: Leveraging Large Language Models to Judge Audio Captions Paper • 2409.12962 • Published Sep 19, 2024 • 2
Temporally Aligned Audio for Video with Autoregression Paper • 2409.13689 • Published Sep 20, 2024 • 9
Distilling an End-to-End Voice Assistant Without Instruction Training Data Paper • 2410.02678 • Published Oct 3, 2024 • 23
Roadmap towards Superhuman Speech Understanding using Large Language Models Paper • 2410.13268 • Published Oct 17, 2024 • 33
MuVi: Video-to-Music Generation with Semantic Alignment and Rhythmic Synchronization Paper • 2410.12957 • Published Oct 16, 2024 • 8
Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant Paper • 2410.15316 • Published Oct 20, 2024 • 12
Continuous Speech Synthesis using per-token Latent Diffusion Paper • 2410.16048 • Published Oct 21, 2024 • 29
MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer Paper • 2409.00750 • Published Sep 1, 2024 • 5
Acoustic Volume Rendering for Neural Impulse Response Fields Paper • 2411.06307 • Published Nov 9, 2024 • 5
PerceiverS: A Multi-Scale Perceiver with Effective Segmentation for Long-Term Expressive Symbolic Music Generation Paper • 2411.08307 • Published Nov 13, 2024 • 7
Video-Guided Foley Sound Generation with Multimodal Controls Paper • 2411.17698 • Published Nov 26, 2024 • 10
Multimodal Music Generation with Explicit Bridges and Retrieval Augmentation Paper • 2412.09428 • Published Dec 12, 2024 • 7
Whisper-GPT: A Hybrid Representation Audio Large Language Model Paper • 2412.11449 • Published Dec 16, 2024 • 4
RLDG: Robotic Generalist Policy Distillation via Reinforcement Learning Paper • 2412.09858 • Published Dec 13, 2024 • 2
Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis Paper • 2412.15322 • Published Dec 19, 2024 • 20
How "Real" is Your Real-Time Simultaneous Speech-to-Text Translation System? Paper • 2412.18495 • Published Dec 24, 2024 • 9
TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization Paper • 2412.21037 • Published Dec 30, 2024 • 24
HiFi-SR: A Unified Generative Transformer-Convolutional Adversarial Network for High-Fidelity Speech Super-Resolution Paper • 2501.10045 • Published Jan 17 • 9
Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis Paper • 2502.04128 • Published Feb 6 • 27
SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation Paper • 2502.13128 • Published Feb 18 • 41
Mind the Gap! Static and Interactive Evaluations of Large Audio Models Paper • 2502.15919 • Published Feb 21 • 4
LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM Paper • 2503.04724 • Published Mar 6 • 72
Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities Paper • 2503.03983 • Published Mar 6 • 26
YuE: Scaling Open Foundation Models for Long-Form Music Generation Paper • 2503.08638 • Published Mar 11 • 71
Quantization for OpenAI's Whisper Models: A Comparative Analysis Paper • 2503.09905 • Published Mar 12 • 7
From TOWER to SPIRE: Adding the Speech Modality to a Text-Only LLM Paper • 2503.10620 • Published Mar 13 • 7
MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis Paper • 2502.18924 • Published Feb 26 • 16
Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play Paper • 2505.02707 • Published May 5 • 85
LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis Paper • 2505.02625 • Published May 5 • 22
VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model Paper • 2505.03739 • Published May 6 • 9
MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder Paper • 2505.07916 • Published May 12 • 134
Fast Text-to-Audio Generation with Adversarial Post-Training Paper • 2505.08175 • Published May 13 • 25
Efficient Speech Language Modeling via Energy Distance in Continuous Latent Space Paper • 2505.13181 • Published May 19 • 9
From Tens of Hours to Tens of Thousands: Scaling Back-Translation for Speech Recognition Paper • 2505.16972 • Published May 22 • 9
OWSM v4: Improving Open Whisper-Style Speech Models via Data Scaling and Cleaning Paper • 2506.00338 • Published May 31 • 10
FusionAudio-1.2M: Towards Fine-grained Audio Captioning with Multimodal Contextual Fusion Paper • 2506.01111 • Published Jun 1 • 31
Auto-Regressive vs Flow-Matching: a Comparative Study of Modeling Paradigms for Text-to-Music Generation Paper • 2506.08570 • Published Jun 10 • 33
EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech Emotion Detection Paper • 2506.09827 • Published Jun 11 • 20
SonicVerse: Multi-Task Learning for Music Feature-Informed Captioning Paper • 2506.15154 • Published Jun 18 • 9
CultureMERT: Continual Pre-Training for Cross-Cultural Music Representation Learning Paper • 2506.17818 • Published Jun 21 • 3
USAD: Universal Speech and Audio Representation via Distillation Paper • 2506.18843 • Published Jun 23 • 12
Re-Bottleneck: Latent Re-Structuring for Neural Audio Autoencoders Paper • 2507.07867 • Published Jul 10 • 2
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models Paper • 2507.08128 • Published Jul 10 • 10
IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System Paper • 2502.05512 • Published Feb 8 • 6
OpenBEATs: A Fully Open-Source General-Purpose Audio Encoder Paper • 2507.14129 • Published Jul 18 • 9
STITCH: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models Paper • 2507.15375 • Published Jul 21 • 30
DMOSpeech 2: Reinforcement Learning for Duration Prediction in Metric-Optimized Speech Synthesis Paper • 2507.14988 • Published Jul 20 • 7
SonicMaster: Towards Controllable All-in-One Music Restoration and Mastering Paper • 2508.03448 • Published Aug 5 • 4
NVSpeech: An Integrated and Scalable Pipeline for Human-Like Speech Modeling with Paralinguistic Vocalizations Paper • 2508.04195 • Published Aug 6 • 1
Representing Speech Through Autoregressive Prediction of Cochlear Tokens Paper • 2508.11598 • Published Aug 15 • 17
Evaluating Podcast Recommendations with Profile-Aware LLM-as-a-Judge Paper • 2508.08777 • Published Aug 12 • 15
Advances in Speech Separation: Techniques, Challenges, and Future Trends Paper • 2508.10830 • Published Aug 14 • 15
LLaSO: A Foundational Framework for Reproducible Research in Large Language and Speech Model Paper • 2508.15418 • Published Aug 21 • 8
TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling Paper • 2508.16790 • Published Aug 22 • 10
AudioStory: Generating Long-Form Narrative Audio with Large Language Models Paper • 2508.20088 • Published Aug 27 • 21
EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs Paper • 2509.09174 • Published Sep 11 • 61
VStyle: A Benchmark for Voice Style Adaptation with Spoken Instructions Paper • 2509.09716 • Published Sep 9 • 11
WhisTLE: Deeply Supervised, Text-Only Domain Adaptation for Pretrained Speech Recognition Transformers Paper • 2509.10452 • Published Sep 12 • 2
Cross-Attention is Half Explanation in Speech-to-Text Models Paper • 2509.18010 • Published Sep 22 • 6
Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention Paper • 2509.23610 • Published Sep 28 • 13
UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity MoE Paper • 2510.13344 • Published Oct 15 • 62