File size: 16,255 Bytes
764de60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0183f48
 
8cb38ec
0183f48
f58903b
0183f48
 
 
 
 
 
2d542ba
8cb38ec
 
 
 
 
0183f48
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2e936cf
0183f48
f58903b
0183f48
 
 
 
 
 
 
 
 
2e936cf
0183f48
 
 
 
 
 
2e936cf
0183f48
 
 
 
 
 
2e936cf
0183f48
 
 
 
 
 
 
 
2e936cf
0183f48
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2e936cf
0183f48
 
 
 
 
 
 
 
 
 
 
 
 
 
2e936cf
0183f48
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f58903b
0183f48
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f58903b
0183f48
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2e936cf
0183f48
 
 
 
 
 
 
 
 
 
 
 
2e936cf
0183f48
9c301e6
 
 
 
 
 
 
 
0183f48
 
 
 
 
 
 
 
 
 
2e936cf
0183f48
 
 
 
 
 
 
9c301e6
0183f48
 
 
 
 
 
 
 
2e936cf
0183f48
 
 
 
 
 
 
9c301e6
0183f48
 
 
 
 
 
 
2e936cf
0183f48
 
 
 
 
 
9c301e6
0183f48
 
 
2e936cf
0183f48
 
 
 
 
 
 
 
 
2e936cf
0183f48
 
 
 
 
 
9c301e6
 
 
 
 
 
 
 
0183f48
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
---
library_name: hunyuanvideo-foley
license: other
license_name: tencent-hunyuan-community
license_link: https://huggingface.co/tencent/HunyuanVideo-Foley/blob/main/LICENSE
language:
  - en
  - zh
tags:
  - text-to-audio
  - video-to-audio
  - text-video-to-audio
pipeline_tag: text-to-audio
extra_gated_eu_disallowed: true
---

<div align="center">
  
<img src="assets/logo.png" alt="HunyuanVideo-Foley Logo" width="400">

<h4>Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation</h4>

<p align="center">
  <strong>Professional-grade AI sound effect generation for video content creators</strong>
</p>

<div align="center" style="margin: 20px 0;">
  <a href=https://github.com/Tencent-Hunyuan/HunyuanVideo-Foley target="_blank"><img src=https://img.shields.io/badge/Code-black.svg?logo=github height=22px></a>
  <a href=https://szczesnys.github.io/hunyuanvideo-foley target="_blank"><img src=https://img.shields.io/badge/Page-bb8a2e.svg?logo=github height=22px></a>
  <a href=https://huggingface.co/tencent/HunyuanVideo-Foley target="_blank"><img src=https://img.shields.io/badge/%F0%9F%A4%97%20Models-d96902.svg height=22px></a>
  <a href=https://huggingface.co/spaces/tencent/HunyuanVideo-Foley  target="_blank"><img src=https://img.shields.io/badge/%F0%9F%A4%97%20Demo-276cb4.svg height=22px></a>
  <a href=https://arxiv.org/abs/2508.16930 target="_blank"><img src=https://img.shields.io/badge/Report-b5212f.svg?logo=arxiv height=22px></a>
  <a href=https://x.com/TencentHunyuan target="_blank"><img src=https://img.shields.io/badge/Hunyuan-black.svg?logo=x height=22px></a>
</div>

</div>

---

<div align="center">
  
### πŸ‘₯ **Authors**

<div style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); padding: 20px; border-radius: 15px; margin: 20px 0;">

**Sizhe Shan**<sup>1,2*</sup> β€’ **Qiulin Li**<sup>1,3*</sup> β€’ **Yutao Cui**<sup>1</sup> β€’ **Miles Yang**<sup>1</sup>  β€’ **Yuehai Wang**<sup>2</sup> β€’ **Qun Yang**<sup>3</sup> β€’ **Jin Zhou**<sup>1†</sup> β€’ **Zhao Zhong**<sup>1</sup>

</div>

<div style="margin-top: 15px; font-size: 14px; color: #666;">
  
🏒 <sup>1</sup>**Tencent Hunyuan** β€’ πŸŽ“ <sup>2</sup>**Zhejiang University** β€’ ✈️ <sup>3</sup>**Nanjing University of Aeronautics and Astronautics**

*Equal contribution β€’ †Project lead

</div>

</div>


---

<!-- ## πŸŽ₯ **Demo & Showcase** -->
<!--  -->
<!-- <div align="center"> -->
<!--    -->
<!-- > **Experience the magic of AI-generated Foley audio in perfect sync with video content!** -->
<!--  -->
<!-- <div style="border: 3px solid #4A90E2; border-radius: 15px; padding: 10px; margin: 20px 0; background: linear-gradient(135deg, #f5f7fa 0%, #c3cfe2 100%);"> -->
<!--    -->
<!--   <video src="https://github.com/user-attachments/assets/087a4b59-8b22-4b7a-bac3-8f5d5a9972fe" width="80%" controls style="border-radius: 10px; box-shadow: 0 8px 32px rgba(0,0,0,0.1);"> </video> -->
<!--    -->
<!--   <p><em>🎬 Watch how HunyuanVideo-Foley generates immersive sound effects synchronized with video content</em></p> -->
<!--    -->
<!-- </div> -->

### ✨ **Key Highlights**

<table align="center" style="border: none; margin: 20px 0;">
<tr>
<td align="center" width="33%">
  
🎭 **Multi-scenario Sync**  
High-quality audio synchronized with complex video scenes

</td>
<td align="center" width="33%">
  
🧠 **Multi-modal Balance**  
Perfect harmony between visual and textual information

</td>
<td align="center" width="33%">
  
🎡 **48kHz Hi-Fi Output**  
Professional-grade audio generation with crystal clarity

</td>
</tr>
</table>

</div>

---

## πŸ“„ **Abstract**

<div align="center" style="background: linear-gradient(135deg, #ffeef8 0%, #f0f8ff 100%); padding: 30px; border-radius: 20px; margin: 20px 0; border-left: 5px solid #ff6b9d; color: #333;">

**πŸš€ Tencent Hunyuan** open-sources **HunyuanVideo-Foley** an end-to-end video sound effect generation model! 

*A professional-grade AI tool specifically designed for video content creators, widely applicable to diverse scenarios including short video creation, film production, advertising creativity, and game development.*

</div>

### 🎯 **Core Highlights**

<div style="display: grid; grid-template-columns: 1fr; gap: 15px; margin: 20px 0;">

<div style="border-left: 4px solid #4CAF50; padding: 15px; background: #f8f9fa; border-radius: 8px; color: #333;">
  
**🎬 Multi-scenario Audio-Visual Synchronization**  
Supports generating high-quality audio that is synchronized and semantically aligned with complex video scenes, enhancing realism and immersive experience for film/TV and gaming applications.

</div>

<div style="border-left: 4px solid #2196F3; padding: 15px; background: #f8f9fa; border-radius: 8px; color: #333;">
  
**βš–οΈ Multi-modal Semantic Balance**  
Intelligently balances visual and textual information analysis, comprehensively orchestrates sound effect elements, avoids one-sided generation, and meets personalized dubbing requirements.

</div>

<div style="border-left: 4px solid #FF9800; padding: 15px; background: #f8f9fa; border-radius: 8px; color: #333;">
  
**🎡 High-fidelity Audio Output**  
Self-developed 48kHz audio VAE perfectly reconstructs sound effects, music, and vocals, achieving professional-grade audio generation quality.

</div>

</div>

<div align="center" style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); color: white; padding: 20px; border-radius: 15px; margin: 20px 0; color: #333;">
  
**πŸ† SOTA Performance Achieved**

*HunyuanVideo-Foley comprehensively leads the field across multiple evaluation benchmarks, achieving new state-of-the-art levels in audio fidelity, visual-semantic alignment, temporal alignment, and distribution matching - surpassing all open-source solutions!*

</div>

<div align="center">
  
![Performance Overview](assets/pan_chart.png)
*πŸ“Š Performance comparison across different evaluation metrics - HunyuanVideo-Foley leads in all categories*

</div>

---

## πŸ”§ **Technical Architecture**

### πŸ“Š **Data Pipeline Design**

<div align="center" style="margin: 20px 0;">
  
![Data Pipeline](assets/data_pipeline.png)
*πŸ”„ Comprehensive data processing pipeline for high-quality text-video-audio datasets*

</div>

<div style="background: #f8f9fa; padding: 20px; border-radius: 10px; border-left: 4px solid #17a2b8; margin: 20px 0; color: #333;">

The **TV2A (Text-Video-to-Audio)** task presents a complex multimodal generation challenge requiring large-scale, high-quality datasets. Our comprehensive data pipeline systematically identifies and excludes unsuitable content to produce robust and generalizable audio generation capabilities.

</div>

### πŸ—οΈ **Model Architecture**

<div align="center" style="margin: 20px 0;">
  
![Model Architecture](assets/model_arch.png)
*🧠 HunyuanVideo-Foley hybrid architecture with multimodal and unimodal transformer blocks*

</div>

<div style="background: #f8f9fa; padding: 20px; border-radius: 10px; border-left: 4px solid #28a745; margin: 20px 0; color: #333;">

**HunyuanVideo-Foley** employs a sophisticated hybrid architecture:

- **πŸ”„ Multimodal Transformer Blocks**: Process visual-audio streams simultaneously
- **🎡 Unimodal Transformer Blocks**: Focus on audio stream refinement
- **πŸ‘οΈ Visual Encoding**: Pre-trained encoder extracts visual features from video frames
- **πŸ“ Text Processing**: Semantic features extracted via pre-trained text encoder  
- **🎧 Audio Encoding**: Latent representations with Gaussian noise perturbation
- **⏰ Temporal Alignment**: Synchformer-based frame-level synchronization with gated modulation

</div>

---

## πŸ“ˆ **Performance Benchmarks**

### 🎬 **MovieGen-Audio-Bench Results**

<div align="center">
  
> *Objective and Subjective evaluation results demonstrating superior performance across all metrics*

</div>

<div style="overflow-x: auto; margin: 20px 0;">

| πŸ† **Method** | **PQ** ↑ | **PC** ↓ | **CE** ↑ | **CU** ↑ | **IB** ↑ | **DeSync** ↓ | **CLAP** ↑ | **MOS-Q** ↑ | **MOS-S** ↑ | **MOS-T** ↑ |
|:-------------:|:--------:|:--------:|:--------:|:--------:|:--------:|:-------------:|:-----------:|:------------:|:------------:|:------------:|
| FoleyGrafter | 6.27 | 2.72 | 3.34 | 5.68 | 0.17 | 1.29 | 0.14 | 3.36Β±0.78 | 3.54Β±0.88 | 3.46Β±0.95 |
| V-AURA | 5.82 | 4.30 | 3.63 | 5.11 | 0.23 | 1.38 | 0.14 | 2.55Β±0.97 | 2.60Β±1.20 | 2.70Β±1.37 |
| Frieren | 5.71 | 2.81 | 3.47 | 5.31 | 0.18 | 1.39 | 0.16 | 2.92Β±0.95 | 2.76Β±1.20 | 2.94Β±1.26 |
| MMAudio | 6.17 | 2.84 | 3.59 | 5.62 | 0.27 | 0.80 | 0.35 | 3.58Β±0.84 | 3.63Β±1.00 | 3.47Β±1.03 |
| ThinkSound | 6.04 | 3.73 | 3.81 | 5.59 | 0.18 | 0.91 | 0.20 | 3.20Β±0.97 | 3.01Β±1.04 | 3.02Β±1.08 |
| **HunyuanVideo-Foley (ours)** | **6.59** | **2.74** | **3.88** | **6.13** | **0.35** | **0.74** | **0.33** | **4.14Β±0.68** | **4.12Β±0.77** | **4.15Β±0.75** |

</div>


### 🎯 **Kling-Audio-Eval Results**

<div align="center">
  
> *Comprehensive objective evaluation showcasing state-of-the-art performance*

</div>

<div style="overflow-x: auto; margin: 20px 0;">

| πŸ† **Method** | **FD_PANNs** ↓ | **FD_PASST** ↓ | **KL** ↓ | **IS** ↑ | **PQ** ↑ | **PC** ↓ | **CE** ↑ | **CU** ↑ | **IB** ↑ | **DeSync** ↓ | **CLAP** ↑ |
|:-------------:|:--------------:|:--------------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:-------------:|:-----------:|
| FoleyGrafter | 22.30 | 322.63 | 2.47 | 7.08 | 6.05 | 2.91 | 3.28 | 5.44 | 0.22 | 1.23 | 0.22 |
| V-AURA | 33.15 | 474.56 | 3.24 | 5.80 | 5.69 | 3.98 | 3.13 | 4.83 | 0.25 | 0.86 | 0.13 |
| Frieren | 16.86 | 293.57 | 2.95 | 7.32 | 5.72 | 2.55 | 2.88 | 5.10 | 0.21 | 0.86 | 0.16 |
| MMAudio | 9.01 | 205.85 | 2.17 | 9.59 | 5.94 | 2.91 | 3.30 | 5.39 | 0.30 | 0.56 | 0.27 |
| ThinkSound | 9.92 | 228.68 | 2.39 | 6.86 | 5.78 | 3.23 | 3.12 | 5.11 | 0.22 | 0.67 | 0.22 |
| **HunyuanVideo-Foley (ours)** | **6.07** | **202.12** | **1.89** | **8.30** | **6.12** | **2.76** | **3.22** | **5.53** | **0.38** | **0.54** | **0.24** |

</div>

<div align="center" style="background: linear-gradient(135deg, #4CAF50 0%, #45a049 100%); color: white; padding: 15px; border-radius: 10px; margin: 20px 0;">
  
**πŸŽ‰ Outstanding Results!** HunyuanVideo-Foley achieves the best scores across **ALL** evaluation metrics, demonstrating significant improvements in audio quality, synchronization, and semantic alignment.

</div>



---

## πŸš€ **Quick Start**

### πŸ“¦ **Installation**

<div style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); color: white; padding: 20px; border-radius: 15px; margin: 20px 0;">

**πŸ”§ System Requirements**
- **CUDA**: 12.4 or 11.8 recommended
- **Python**: 3.8+ 
- **OS**: Linux (primary support)

</div>

#### **Step 1: Clone Repository**

```bash
# πŸ“₯ Clone the repository
git clone https://github.com/Tencent-Hunyuan/HunyuanVideo-Foley
cd HunyuanVideo-Foley
```

#### **Step 2: Environment Setup**

<div style="background: #fff3cd; padding: 15px; border-radius: 8px; border-left: 4px solid #ffc107; margin: 10px 0; color: #333;">

πŸ’‘ **Tip**: We recommend using [Conda](https://docs.anaconda.com/free/miniconda/index.html) for Python environment management.

</div>

```bash
# πŸ”§ Install dependencies
pip install -r requirements.txt
```

#### **Step 3: Download Pretrained Models**

<div style="background: #d1ecf1; padding: 15px; border-radius: 8px; border-left: 4px solid #17a2b8; margin: 10px 0; color: #333;">

πŸ”— **Download Model weights from Huggingface**  
```bash
# using git-lfs
git clone https://huggingface.co/tencent/HunyuanVideo-Foley

# using huggingface-cli
huggingface-cli download tencent/HunyuanVideo-Foley
```

</div>


---

## πŸ’» **Usage**

### 🎬 **Single Video Generation**

<div style="background: #e8f5e8; padding: 15px; border-radius: 8px; border-left: 4px solid #28a745; margin: 10px 0; color: #333;">

Generate Foley audio for a single video file with text description:

</div>

```bash
python3 infer.py \
    --model_path PRETRAINED_MODEL_PATH_DIR \
    --config_path ./configs/hunyuanvideo-foley-xxl.yaml \
    --single_video video_path \
    --single_prompt "audio description" \
    --output_dir OUTPUT_DIR
```

### πŸ“‚ **Batch Processing**

<div style="background: #fff3e0; padding: 15px; border-radius: 8px; border-left: 4px solid #ff9800; margin: 10px 0; color: #333;">

Process multiple videos using a CSV file with video paths and descriptions:

</div>

```bash
python3 infer.py \
    --model_path PRETRAINED_MODEL_PATH_DIR \
    --config_path ./configs/hunyuanvideo-foley-xxl.yaml \
    --csv_path assets/test.csv \
    --output_dir OUTPUT_DIR
```

### 🌐 **Interactive Web Interface**

<div style="background: #f3e5f5; padding: 15px; border-radius: 8px; border-left: 4px solid #9c27b0; margin: 10px 0; color: #333;">

Launch a user-friendly Gradio web interface for easy interaction:

</div>

```bash
export HIFI_FOLEY_MODEL_PATH=PRETRAINED_MODEL_PATH_DIR
python3 gradio_app.py
```

<div align="center" style="margin: 20px 0; color: #333;">
  
*πŸš€ Then open your browser and navigate to the provided local URL to start generating Foley audio!*

</div>

---

## πŸ“š **Citation**

<div style="background: #f8f9fa; padding: 20px; border-radius: 10px; border-left: 4px solid #6c757d; margin: 20px 0; color: #333;">

If you find **HunyuanVideo-Foley** useful for your research, please consider citing our paper:

</div>

```bibtex
@misc{shan2025hunyuanvideofoleymultimodaldiffusionrepresentation,
      title={HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation}, 
      author={Sizhe Shan and Qiulin Li and Yutao Cui and Miles Yang and Yuehai Wang and Qun Yang and Jin Zhou and Zhao Zhong},
      year={2025},
      eprint={2508.16930},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2508.16930}, 
}
```

---

## πŸ™ **Acknowledgements**

<div align="center">
  
**We extend our heartfelt gratitude to the open-source community!**

</div>

<table align="center" style="width: 100%; border: none; margin: 20px 0;">
<tr>
<td align="center" style="width: 33%; padding: 10px; vertical-align: top;">

🎨 **[Stable Diffusion 3](https://huggingface.co/stabilityai/stable-diffusion-3-medium)**  
*Foundation diffusion models*

</td>
<td align="center" style="width: 33%; padding: 10px; vertical-align: top;">

⚑ **[FLUX](https://github.com/black-forest-labs/flux)**  
*Advanced generation techniques*

</td>
<td align="center" style="width: 33%; padding: 10px; vertical-align: top;">

🎡 **[MMAudio](https://github.com/hkchengrex/MMAudio)**  
*Multimodal audio generation*

</td>
</tr>
<tr>
<td align="center" style="width: 33%; padding: 10px; vertical-align: top;">

πŸ€— **[HuggingFace](https://huggingface.co)**  
*Platform & diffusers library*

</td>
<td align="center" style="width: 33%; padding: 10px; vertical-align: top;">

πŸ—œοΈ **[DAC](https://github.com/descriptinc/descript-audio-codec)**  
*High-Fidelity Audio Compression*

</td>
<td align="center" style="width: 33%; padding: 10px; vertical-align: top;">

πŸ”— **[Synchformer](https://github.com/v-iashin/Synchformer)**  
*Audio-Visual Synchronization*

</td>
</tr>
</table>

<div align="center" style="background: linear-gradient(135deg, #74b9ff 0%, #0984e3 100%); color: white; padding: 20px; border-radius: 15px; margin: 20px 0;">

**🌟 Special thanks to all researchers and developers who contribute to the advancement of AI-generated audio and multimodal learning!**

</div>


---

<div align="center" style="margin: 30px 0;">
  
### πŸ”— **Connect with Us**

[![GitHub](https://img.shields.io/badge/GitHub-Follow-black?style=for-the-badge&logo=github)](https://github.com/Tencent-Hunyuan)
[![Twitter](https://img.shields.io/badge/Twitter-Follow-blue?style=for-the-badge&logo=twitter)](https://twitter.com/TencentHunyuan)
[![Hunyuan](https://img.shields.io/badge/Website-HunyuanAI-green?style=for-the-badge&logo=hunyuan)](https://hunyuan.tencent.com/)

<p style="color: #666; margin-top: 15px; font-size: 14px;">
  
© 2025 Tencent Hunyuan. All rights reserved. | Made with ❀️ for the AI community

</p>

</div>