James Zhou
commited on
Commit
Β·
2e936cf
1
Parent(s):
9c301e6
[update] readme
Browse files
README.md
CHANGED
@@ -105,7 +105,7 @@ Professional-grade audio generation with crystal clarity
|
|
105 |
|
106 |
## π **Abstract**
|
107 |
|
108 |
-
<div align="center" style="background: linear-gradient(135deg, #ffeef8 0%, #f0f8ff 100%); padding: 30px; border-radius: 20px; margin: 20px 0; border-left: 5px solid #ff6b9d;">
|
109 |
|
110 |
**π Tencent Hunyuan** proudly open-sources **HunyuanVideo-Foley** - an end-to-end video sound effect generation model!
|
111 |
|
@@ -117,21 +117,21 @@ Professional-grade audio generation with crystal clarity
|
|
117 |
|
118 |
<div style="display: grid; grid-template-columns: 1fr; gap: 15px; margin: 20px 0;">
|
119 |
|
120 |
-
<div style="border-left: 4px solid #4CAF50; padding: 15px; background: #f8f9fa; border-radius: 8px;">
|
121 |
|
122 |
**π¬ Multi-scenario Audio-Visual Synchronization**
|
123 |
Supports generating high-quality audio that is synchronized and semantically aligned with complex video scenes, enhancing realism and immersive experience for film/TV and gaming applications.
|
124 |
|
125 |
</div>
|
126 |
|
127 |
-
<div style="border-left: 4px solid #2196F3; padding: 15px; background: #f8f9fa; border-radius: 8px;">
|
128 |
|
129 |
**βοΈ Multi-modal Semantic Balance**
|
130 |
Intelligently balances visual and textual information analysis, comprehensively orchestrates sound effect elements, avoids one-sided generation, and meets personalized dubbing requirements.
|
131 |
|
132 |
</div>
|
133 |
|
134 |
-
<div style="border-left: 4px solid #FF9800; padding: 15px; background: #f8f9fa; border-radius: 8px;">
|
135 |
|
136 |
**π΅ High-fidelity Audio Output**
|
137 |
Self-developed 48kHz audio VAE perfectly reconstructs sound effects, music, and vocals, achieving professional-grade audio generation quality.
|
@@ -140,7 +140,7 @@ Self-developed 48kHz audio VAE perfectly reconstructs sound effects, music, and
|
|
140 |
|
141 |
</div>
|
142 |
|
143 |
-
<div align="center" style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); color: white; padding: 20px; border-radius: 15px; margin: 20px 0;">
|
144 |
|
145 |
**π SOTA Performance Achieved**
|
146 |
|
@@ -168,7 +168,7 @@ Self-developed 48kHz audio VAE perfectly reconstructs sound effects, music, and
|
|
168 |
|
169 |
</div>
|
170 |
|
171 |
-
<div style="background: #f8f9fa; padding: 20px; border-radius: 10px; border-left: 4px solid #17a2b8; margin: 20px 0;">
|
172 |
|
173 |
The **TV2A (Text-Video-to-Audio)** task presents a complex multimodal generation challenge requiring large-scale, high-quality datasets. Our comprehensive data pipeline systematically identifies and excludes unsuitable content to produce robust and generalizable audio generation capabilities.
|
174 |
|
@@ -183,7 +183,7 @@ The **TV2A (Text-Video-to-Audio)** task presents a complex multimodal generation
|
|
183 |
|
184 |
</div>
|
185 |
|
186 |
-
<div style="background: #f8f9fa; padding: 20px; border-radius: 10px; border-left: 4px solid #28a745; margin: 20px 0;">
|
187 |
|
188 |
**HunyuanVideo-Foley** employs a sophisticated hybrid architecture:
|
189 |
|
@@ -276,7 +276,7 @@ cd HunyuanVideo-Foley
|
|
276 |
|
277 |
#### **Step 2: Environment Setup**
|
278 |
|
279 |
-
<div style="background: #fff3cd; padding: 15px; border-radius: 8px; border-left: 4px solid #ffc107; margin: 10px 0;">
|
280 |
|
281 |
π‘ **Tip**: We recommend using [Conda](https://docs.anaconda.com/free/miniconda/index.html) for Python environment management.
|
282 |
|
@@ -289,7 +289,7 @@ pip install -r requirements.txt
|
|
289 |
|
290 |
#### **Step 3: Download Pretrained Models**
|
291 |
|
292 |
-
<div style="background: #d1ecf1; padding: 15px; border-radius: 8px; border-left: 4px solid #17a2b8; margin: 10px 0;">
|
293 |
|
294 |
π **Download Model weights from Huggingface**
|
295 |
```bash
|
@@ -309,7 +309,7 @@ huggingface-cli download tencent/HunyuanVideo-Foley
|
|
309 |
|
310 |
### π¬ **Single Video Generation**
|
311 |
|
312 |
-
<div style="background: #e8f5e8; padding: 15px; border-radius: 8px; border-left: 4px solid #28a745; margin: 10px 0;">
|
313 |
|
314 |
Generate Foley audio for a single video file with text description:
|
315 |
|
@@ -326,7 +326,7 @@ python3 infer.py \
|
|
326 |
|
327 |
### π **Batch Processing**
|
328 |
|
329 |
-
<div style="background: #fff3e0; padding: 15px; border-radius: 8px; border-left: 4px solid #ff9800; margin: 10px 0;">
|
330 |
|
331 |
Process multiple videos using a CSV file with video paths and descriptions:
|
332 |
|
@@ -342,7 +342,7 @@ python3 infer.py \
|
|
342 |
|
343 |
### π **Interactive Web Interface**
|
344 |
|
345 |
-
<div style="background: #f3e5f5; padding: 15px; border-radius: 8px; border-left: 4px solid #9c27b0; margin: 10px 0;">
|
346 |
|
347 |
Launch a user-friendly Gradio web interface for easy interaction:
|
348 |
|
@@ -353,7 +353,7 @@ export HIFI_FOLEY_MODEL_PATH=PRETRAINED_MODEL_PATH_DIR
|
|
353 |
python3 gradio_app.py
|
354 |
```
|
355 |
|
356 |
-
<div align="center" style="margin: 20px 0;">
|
357 |
|
358 |
*π Then open your browser and navigate to the provided local URL to start generating Foley audio!*
|
359 |
|
@@ -363,7 +363,7 @@ python3 gradio_app.py
|
|
363 |
|
364 |
## π **Citation**
|
365 |
|
366 |
-
<div style="background: #f8f9fa; padding: 20px; border-radius: 10px; border-left: 4px solid #6c757d; margin: 20px 0;">
|
367 |
|
368 |
If you find **HunyuanVideo-Foley** useful for your research, please consider citing our paper:
|
369 |
|
|
|
105 |
|
106 |
## π **Abstract**
|
107 |
|
108 |
+
<div align="center" style="background: linear-gradient(135deg, #ffeef8 0%, #f0f8ff 100%); padding: 30px; border-radius: 20px; margin: 20px 0; border-left: 5px solid #ff6b9d; color: #333;">
|
109 |
|
110 |
**π Tencent Hunyuan** proudly open-sources **HunyuanVideo-Foley** - an end-to-end video sound effect generation model!
|
111 |
|
|
|
117 |
|
118 |
<div style="display: grid; grid-template-columns: 1fr; gap: 15px; margin: 20px 0;">
|
119 |
|
120 |
+
<div style="border-left: 4px solid #4CAF50; padding: 15px; background: #f8f9fa; border-radius: 8px; color: #333;">
|
121 |
|
122 |
**π¬ Multi-scenario Audio-Visual Synchronization**
|
123 |
Supports generating high-quality audio that is synchronized and semantically aligned with complex video scenes, enhancing realism and immersive experience for film/TV and gaming applications.
|
124 |
|
125 |
</div>
|
126 |
|
127 |
+
<div style="border-left: 4px solid #2196F3; padding: 15px; background: #f8f9fa; border-radius: 8px; color: #333;">
|
128 |
|
129 |
**βοΈ Multi-modal Semantic Balance**
|
130 |
Intelligently balances visual and textual information analysis, comprehensively orchestrates sound effect elements, avoids one-sided generation, and meets personalized dubbing requirements.
|
131 |
|
132 |
</div>
|
133 |
|
134 |
+
<div style="border-left: 4px solid #FF9800; padding: 15px; background: #f8f9fa; border-radius: 8px; color: #333;">
|
135 |
|
136 |
**π΅ High-fidelity Audio Output**
|
137 |
Self-developed 48kHz audio VAE perfectly reconstructs sound effects, music, and vocals, achieving professional-grade audio generation quality.
|
|
|
140 |
|
141 |
</div>
|
142 |
|
143 |
+
<div align="center" style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); color: white; padding: 20px; border-radius: 15px; margin: 20px 0; color: #333;">
|
144 |
|
145 |
**π SOTA Performance Achieved**
|
146 |
|
|
|
168 |
|
169 |
</div>
|
170 |
|
171 |
+
<div style="background: #f8f9fa; padding: 20px; border-radius: 10px; border-left: 4px solid #17a2b8; margin: 20px 0; color: #333;">
|
172 |
|
173 |
The **TV2A (Text-Video-to-Audio)** task presents a complex multimodal generation challenge requiring large-scale, high-quality datasets. Our comprehensive data pipeline systematically identifies and excludes unsuitable content to produce robust and generalizable audio generation capabilities.
|
174 |
|
|
|
183 |
|
184 |
</div>
|
185 |
|
186 |
+
<div style="background: #f8f9fa; padding: 20px; border-radius: 10px; border-left: 4px solid #28a745; margin: 20px 0; color: #333;">
|
187 |
|
188 |
**HunyuanVideo-Foley** employs a sophisticated hybrid architecture:
|
189 |
|
|
|
276 |
|
277 |
#### **Step 2: Environment Setup**
|
278 |
|
279 |
+
<div style="background: #fff3cd; padding: 15px; border-radius: 8px; border-left: 4px solid #ffc107; margin: 10px 0; color: #333;">
|
280 |
|
281 |
π‘ **Tip**: We recommend using [Conda](https://docs.anaconda.com/free/miniconda/index.html) for Python environment management.
|
282 |
|
|
|
289 |
|
290 |
#### **Step 3: Download Pretrained Models**
|
291 |
|
292 |
+
<div style="background: #d1ecf1; padding: 15px; border-radius: 8px; border-left: 4px solid #17a2b8; margin: 10px 0; color: #333;">
|
293 |
|
294 |
π **Download Model weights from Huggingface**
|
295 |
```bash
|
|
|
309 |
|
310 |
### π¬ **Single Video Generation**
|
311 |
|
312 |
+
<div style="background: #e8f5e8; padding: 15px; border-radius: 8px; border-left: 4px solid #28a745; margin: 10px 0; color: #333;">
|
313 |
|
314 |
Generate Foley audio for a single video file with text description:
|
315 |
|
|
|
326 |
|
327 |
### π **Batch Processing**
|
328 |
|
329 |
+
<div style="background: #fff3e0; padding: 15px; border-radius: 8px; border-left: 4px solid #ff9800; margin: 10px 0; color: #333;">
|
330 |
|
331 |
Process multiple videos using a CSV file with video paths and descriptions:
|
332 |
|
|
|
342 |
|
343 |
### π **Interactive Web Interface**
|
344 |
|
345 |
+
<div style="background: #f3e5f5; padding: 15px; border-radius: 8px; border-left: 4px solid #9c27b0; margin: 10px 0; color: #333;">
|
346 |
|
347 |
Launch a user-friendly Gradio web interface for easy interaction:
|
348 |
|
|
|
353 |
python3 gradio_app.py
|
354 |
```
|
355 |
|
356 |
+
<div align="center" style="margin: 20px 0; color: #333;">
|
357 |
|
358 |
*π Then open your browser and navigate to the provided local URL to start generating Foley audio!*
|
359 |
|
|
|
363 |
|
364 |
## π **Citation**
|
365 |
|
366 |
+
<div style="background: #f8f9fa; padding: 20px; border-radius: 10px; border-left: 4px solid #6c757d; margin: 20px 0; color: #333;">
|
367 |
|
368 |
If you find **HunyuanVideo-Foley** useful for your research, please consider citing our paper:
|
369 |
|