James Zhou commited on
Commit
2e936cf
Β·
1 Parent(s): 9c301e6

[update] readme

Browse files
Files changed (1) hide show
  1. README.md +14 -14
README.md CHANGED
@@ -105,7 +105,7 @@ Professional-grade audio generation with crystal clarity
105
 
106
  ## πŸ“„ **Abstract**
107
 
108
- <div align="center" style="background: linear-gradient(135deg, #ffeef8 0%, #f0f8ff 100%); padding: 30px; border-radius: 20px; margin: 20px 0; border-left: 5px solid #ff6b9d;">
109
 
110
  **πŸš€ Tencent Hunyuan** proudly open-sources **HunyuanVideo-Foley** - an end-to-end video sound effect generation model!
111
 
@@ -117,21 +117,21 @@ Professional-grade audio generation with crystal clarity
117
 
118
  <div style="display: grid; grid-template-columns: 1fr; gap: 15px; margin: 20px 0;">
119
 
120
- <div style="border-left: 4px solid #4CAF50; padding: 15px; background: #f8f9fa; border-radius: 8px;">
121
 
122
  **🎬 Multi-scenario Audio-Visual Synchronization**
123
  Supports generating high-quality audio that is synchronized and semantically aligned with complex video scenes, enhancing realism and immersive experience for film/TV and gaming applications.
124
 
125
  </div>
126
 
127
- <div style="border-left: 4px solid #2196F3; padding: 15px; background: #f8f9fa; border-radius: 8px;">
128
 
129
  **βš–οΈ Multi-modal Semantic Balance**
130
  Intelligently balances visual and textual information analysis, comprehensively orchestrates sound effect elements, avoids one-sided generation, and meets personalized dubbing requirements.
131
 
132
  </div>
133
 
134
- <div style="border-left: 4px solid #FF9800; padding: 15px; background: #f8f9fa; border-radius: 8px;">
135
 
136
  **🎡 High-fidelity Audio Output**
137
  Self-developed 48kHz audio VAE perfectly reconstructs sound effects, music, and vocals, achieving professional-grade audio generation quality.
@@ -140,7 +140,7 @@ Self-developed 48kHz audio VAE perfectly reconstructs sound effects, music, and
140
 
141
  </div>
142
 
143
- <div align="center" style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); color: white; padding: 20px; border-radius: 15px; margin: 20px 0;">
144
 
145
  **πŸ† SOTA Performance Achieved**
146
 
@@ -168,7 +168,7 @@ Self-developed 48kHz audio VAE perfectly reconstructs sound effects, music, and
168
 
169
  </div>
170
 
171
- <div style="background: #f8f9fa; padding: 20px; border-radius: 10px; border-left: 4px solid #17a2b8; margin: 20px 0;">
172
 
173
  The **TV2A (Text-Video-to-Audio)** task presents a complex multimodal generation challenge requiring large-scale, high-quality datasets. Our comprehensive data pipeline systematically identifies and excludes unsuitable content to produce robust and generalizable audio generation capabilities.
174
 
@@ -183,7 +183,7 @@ The **TV2A (Text-Video-to-Audio)** task presents a complex multimodal generation
183
 
184
  </div>
185
 
186
- <div style="background: #f8f9fa; padding: 20px; border-radius: 10px; border-left: 4px solid #28a745; margin: 20px 0;">
187
 
188
  **HunyuanVideo-Foley** employs a sophisticated hybrid architecture:
189
 
@@ -276,7 +276,7 @@ cd HunyuanVideo-Foley
276
 
277
  #### **Step 2: Environment Setup**
278
 
279
- <div style="background: #fff3cd; padding: 15px; border-radius: 8px; border-left: 4px solid #ffc107; margin: 10px 0;">
280
 
281
  πŸ’‘ **Tip**: We recommend using [Conda](https://docs.anaconda.com/free/miniconda/index.html) for Python environment management.
282
 
@@ -289,7 +289,7 @@ pip install -r requirements.txt
289
 
290
  #### **Step 3: Download Pretrained Models**
291
 
292
- <div style="background: #d1ecf1; padding: 15px; border-radius: 8px; border-left: 4px solid #17a2b8; margin: 10px 0;">
293
 
294
  πŸ”— **Download Model weights from Huggingface**
295
  ```bash
@@ -309,7 +309,7 @@ huggingface-cli download tencent/HunyuanVideo-Foley
309
 
310
  ### 🎬 **Single Video Generation**
311
 
312
- <div style="background: #e8f5e8; padding: 15px; border-radius: 8px; border-left: 4px solid #28a745; margin: 10px 0;">
313
 
314
  Generate Foley audio for a single video file with text description:
315
 
@@ -326,7 +326,7 @@ python3 infer.py \
326
 
327
  ### πŸ“‚ **Batch Processing**
328
 
329
- <div style="background: #fff3e0; padding: 15px; border-radius: 8px; border-left: 4px solid #ff9800; margin: 10px 0;">
330
 
331
  Process multiple videos using a CSV file with video paths and descriptions:
332
 
@@ -342,7 +342,7 @@ python3 infer.py \
342
 
343
  ### 🌐 **Interactive Web Interface**
344
 
345
- <div style="background: #f3e5f5; padding: 15px; border-radius: 8px; border-left: 4px solid #9c27b0; margin: 10px 0;">
346
 
347
  Launch a user-friendly Gradio web interface for easy interaction:
348
 
@@ -353,7 +353,7 @@ export HIFI_FOLEY_MODEL_PATH=PRETRAINED_MODEL_PATH_DIR
353
  python3 gradio_app.py
354
  ```
355
 
356
- <div align="center" style="margin: 20px 0;">
357
 
358
  *πŸš€ Then open your browser and navigate to the provided local URL to start generating Foley audio!*
359
 
@@ -363,7 +363,7 @@ python3 gradio_app.py
363
 
364
  ## πŸ“š **Citation**
365
 
366
- <div style="background: #f8f9fa; padding: 20px; border-radius: 10px; border-left: 4px solid #6c757d; margin: 20px 0;">
367
 
368
  If you find **HunyuanVideo-Foley** useful for your research, please consider citing our paper:
369
 
 
105
 
106
  ## πŸ“„ **Abstract**
107
 
108
+ <div align="center" style="background: linear-gradient(135deg, #ffeef8 0%, #f0f8ff 100%); padding: 30px; border-radius: 20px; margin: 20px 0; border-left: 5px solid #ff6b9d; color: #333;">
109
 
110
  **πŸš€ Tencent Hunyuan** proudly open-sources **HunyuanVideo-Foley** - an end-to-end video sound effect generation model!
111
 
 
117
 
118
  <div style="display: grid; grid-template-columns: 1fr; gap: 15px; margin: 20px 0;">
119
 
120
+ <div style="border-left: 4px solid #4CAF50; padding: 15px; background: #f8f9fa; border-radius: 8px; color: #333;">
121
 
122
  **🎬 Multi-scenario Audio-Visual Synchronization**
123
  Supports generating high-quality audio that is synchronized and semantically aligned with complex video scenes, enhancing realism and immersive experience for film/TV and gaming applications.
124
 
125
  </div>
126
 
127
+ <div style="border-left: 4px solid #2196F3; padding: 15px; background: #f8f9fa; border-radius: 8px; color: #333;">
128
 
129
  **βš–οΈ Multi-modal Semantic Balance**
130
  Intelligently balances visual and textual information analysis, comprehensively orchestrates sound effect elements, avoids one-sided generation, and meets personalized dubbing requirements.
131
 
132
  </div>
133
 
134
+ <div style="border-left: 4px solid #FF9800; padding: 15px; background: #f8f9fa; border-radius: 8px; color: #333;">
135
 
136
  **🎡 High-fidelity Audio Output**
137
  Self-developed 48kHz audio VAE perfectly reconstructs sound effects, music, and vocals, achieving professional-grade audio generation quality.
 
140
 
141
  </div>
142
 
143
+ <div align="center" style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); color: white; padding: 20px; border-radius: 15px; margin: 20px 0; color: #333;">
144
 
145
  **πŸ† SOTA Performance Achieved**
146
 
 
168
 
169
  </div>
170
 
171
+ <div style="background: #f8f9fa; padding: 20px; border-radius: 10px; border-left: 4px solid #17a2b8; margin: 20px 0; color: #333;">
172
 
173
  The **TV2A (Text-Video-to-Audio)** task presents a complex multimodal generation challenge requiring large-scale, high-quality datasets. Our comprehensive data pipeline systematically identifies and excludes unsuitable content to produce robust and generalizable audio generation capabilities.
174
 
 
183
 
184
  </div>
185
 
186
+ <div style="background: #f8f9fa; padding: 20px; border-radius: 10px; border-left: 4px solid #28a745; margin: 20px 0; color: #333;">
187
 
188
  **HunyuanVideo-Foley** employs a sophisticated hybrid architecture:
189
 
 
276
 
277
  #### **Step 2: Environment Setup**
278
 
279
+ <div style="background: #fff3cd; padding: 15px; border-radius: 8px; border-left: 4px solid #ffc107; margin: 10px 0; color: #333;">
280
 
281
  πŸ’‘ **Tip**: We recommend using [Conda](https://docs.anaconda.com/free/miniconda/index.html) for Python environment management.
282
 
 
289
 
290
  #### **Step 3: Download Pretrained Models**
291
 
292
+ <div style="background: #d1ecf1; padding: 15px; border-radius: 8px; border-left: 4px solid #17a2b8; margin: 10px 0; color: #333;">
293
 
294
  πŸ”— **Download Model weights from Huggingface**
295
  ```bash
 
309
 
310
  ### 🎬 **Single Video Generation**
311
 
312
+ <div style="background: #e8f5e8; padding: 15px; border-radius: 8px; border-left: 4px solid #28a745; margin: 10px 0; color: #333;">
313
 
314
  Generate Foley audio for a single video file with text description:
315
 
 
326
 
327
  ### πŸ“‚ **Batch Processing**
328
 
329
+ <div style="background: #fff3e0; padding: 15px; border-radius: 8px; border-left: 4px solid #ff9800; margin: 10px 0; color: #333;">
330
 
331
  Process multiple videos using a CSV file with video paths and descriptions:
332
 
 
342
 
343
  ### 🌐 **Interactive Web Interface**
344
 
345
+ <div style="background: #f3e5f5; padding: 15px; border-radius: 8px; border-left: 4px solid #9c27b0; margin: 10px 0; color: #333;">
346
 
347
  Launch a user-friendly Gradio web interface for easy interaction:
348
 
 
353
  python3 gradio_app.py
354
  ```
355
 
356
+ <div align="center" style="margin: 20px 0; color: #333;">
357
 
358
  *πŸš€ Then open your browser and navigate to the provided local URL to start generating Foley audio!*
359
 
 
363
 
364
  ## πŸ“š **Citation**
365
 
366
+ <div style="background: #f8f9fa; padding: 20px; border-radius: 10px; border-left: 4px solid #6c757d; margin: 20px 0; color: #333;">
367
 
368
  If you find **HunyuanVideo-Foley** useful for your research, please consider citing our paper:
369