Spaces:
Running
on
Zero
Running
on
Zero
Update README.md
Browse files
README.md
CHANGED
@@ -4,10 +4,125 @@ emoji: π
|
|
4 |
colorFrom: green
|
5 |
colorTo: purple
|
6 |
sdk: gradio
|
7 |
-
sdk_version: 5.
|
8 |
app_file: app.py
|
9 |
pinned: true
|
10 |
short_description: mcp_server
|
11 |
---
|
|
|
12 |
|
13 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
4 |
colorFrom: green
|
5 |
colorTo: purple
|
6 |
sdk: gradio
|
7 |
+
sdk_version: 5.35.0
|
8 |
app_file: app.py
|
9 |
pinned: true
|
10 |
short_description: mcp_server
|
11 |
---
|
12 |
+
Looking at this code, it's a Text-to-Speech (TTS) application using the Zonos model. Let me provide explanations in both English and Korean.
|
13 |
|
14 |
+
## English Explanation
|
15 |
+
|
16 |
+
### Overview
|
17 |
+
This is a Gradio-based web application for the **Zonos Text-to-Speech (TTS) Generator**. Zonos is an advanced TTS model from Zyphra that can generate natural-sounding speech with customizable voice characteristics.
|
18 |
+
|
19 |
+
### Key Features
|
20 |
+
|
21 |
+
1. **Model Selection**
|
22 |
+
- Two model variants: Transformer and Hybrid
|
23 |
+
- Different models have different conditioning capabilities
|
24 |
+
|
25 |
+
2. **Text Input & Language Support**
|
26 |
+
- Supports multiple languages through eSpeak phoneme conversion
|
27 |
+
- Text length limit of 500 characters
|
28 |
+
- Language selection from supported language codes
|
29 |
+
|
30 |
+
3. **Voice Customization**
|
31 |
+
- **Speaker Cloning**: Upload audio to clone a specific voice
|
32 |
+
- **Voice Quality Settings**:
|
33 |
+
- DNS-MOS (Voice Quality): 1.0-5.0 scale
|
34 |
+
- Frequency Max: Control the highest frequency in Hz
|
35 |
+
- Voice Clarity: Adjust voice intelligibility
|
36 |
+
- Pitch Variation: Control how much the pitch varies
|
37 |
+
- Speaking Rate: Adjust speech speed
|
38 |
+
|
39 |
+
4. **Emotion Control**
|
40 |
+
- 8 emotion sliders: Happiness, Sadness, Disgust, Fear, Surprise, Anger, Other, Neutral
|
41 |
+
- Fine-tune emotional expression in the generated speech
|
42 |
+
|
43 |
+
5. **Advanced Generation Parameters**
|
44 |
+
- **Guidance Scale**: Controls how closely the model follows the conditioning
|
45 |
+
- **Min P**: Controls randomness/creativity in generation
|
46 |
+
- **Seed**: For reproducible results
|
47 |
+
- **Prefix Audio**: Continue generation from existing audio
|
48 |
+
|
49 |
+
6. **Unconditional Generation**
|
50 |
+
- Toggle specific conditions to let the model generate them automatically
|
51 |
+
- Useful for more creative/varied outputs
|
52 |
+
|
53 |
+
### Technical Details
|
54 |
+
- Uses GPU acceleration via CUDA
|
55 |
+
- Implements classifier-free guidance for better control
|
56 |
+
- Supports audio continuation from prefix
|
57 |
+
- Real-time progress tracking during generation
|
58 |
+
|
59 |
+
### How to Use
|
60 |
+
1. Select a model variant
|
61 |
+
2. Enter your text and choose language
|
62 |
+
3. (Optional) Upload speaker audio for voice cloning
|
63 |
+
4. Adjust voice characteristics and emotions
|
64 |
+
5. Click "Generate Audio" to create speech
|
65 |
+
6. Download or play the generated audio
|
66 |
+
|
67 |
+
---
|
68 |
+
|
69 |
+
## νκΈ μ€λͺ
|
70 |
+
|
71 |
+
### κ°μ
|
72 |
+
μ΄κ²μ **Zonos ν
μ€νΈ μμ± λ³ν(TTS) μμ±κΈ°**λ₯Ό μν Gradio κΈ°λ° μΉ μ ν리μΌμ΄μ
μ
λλ€. Zonosλ Zyphraμμ κ°λ°ν κ³ κΈ TTS λͺ¨λΈλ‘, μ¬μ©μκ° μμ± νΉμ±μ 컀μ€ν°λ§μ΄μ§νμ¬ μμ°μ€λ¬μ΄ μμ±μ μμ±ν μ μμ΅λλ€.
|
73 |
+
|
74 |
+
### μ£Όμ κΈ°λ₯
|
75 |
+
|
76 |
+
1. **λͺ¨λΈ μ ν**
|
77 |
+
- λ κ°μ§ λͺ¨λΈ λ³ν: Transformerμ Hybrid
|
78 |
+
- κ° λͺ¨λΈλ§λ€ λ€λ₯Έ μ‘°κ±΄λΆ κΈ°λ₯ μ 곡
|
79 |
+
|
80 |
+
2. **ν
μ€νΈ μ
λ ₯ λ° μΈμ΄ μ§μ**
|
81 |
+
- eSpeak μμ λ³νμ ν΅ν λ€κ΅μ΄ μ§μ
|
82 |
+
- ν
μ€νΈ κΈΈμ΄ μ ν: 500μ
|
83 |
+
- μ§μλλ μΈμ΄ μ½λ μ€ μ ν κ°λ₯
|
84 |
+
|
85 |
+
3. **μμ± μ»€μ€ν°λ§μ΄μ§**
|
86 |
+
- **νμ 볡μ **: νΉμ μμ±μ 볡μ νκΈ° μν μ€λμ€ μ
λ‘λ
|
87 |
+
- **μμ± νμ§ μ€μ **:
|
88 |
+
- DNS-MOS (μμ± νμ§): 1.0-5.0 μ²λ
|
89 |
+
- μ΅λ μ£Όνμ: Hz λ¨μλ‘ μ΅κ³ μ£Όνμ μ μ΄
|
90 |
+
- μμ± λͺ
λ£λ: μμ±μ μ΄ν΄λ μ‘°μ
|
91 |
+
- μλμ΄ λ³ν: μλμ΄ λ³νλ μ μ΄
|
92 |
+
- λ°ν μλ: μμ± μλ μ‘°μ
|
93 |
+
|
94 |
+
4. **κ°μ μ μ΄**
|
95 |
+
- 8κ°μ§ κ°μ μ¬λΌμ΄λ: ν볡, μ¬ν, νμ€, λλ €μ, λλ, λΆλ
Έ, κΈ°ν, μ€λ¦½
|
96 |
+
- μμ±λ μμ±μ κ°μ ννμ μΈλ°νκ² μ‘°μ
|
97 |
+
|
98 |
+
5. **κ³ κΈ μμ± λ§€κ°λ³μ**
|
99 |
+
- **κ°μ΄λμ€ μ€μΌμΌ**: λͺ¨λΈμ΄ 쑰건μ μΌλ§λ μΆ©μ€ν λ°λ₯Όμ§ μ μ΄
|
100 |
+
- **Min P**: μμ±μ 무μμμ±/μ°½μμ± μ μ΄
|
101 |
+
- **μλ**: μ¬ν κ°λ₯ν κ²°κ³Όλ₯Ό μν μ€μ
|
102 |
+
- **ν리ν½μ€ μ€λμ€**: κΈ°μ‘΄ μ€λμ€μμ μ΄μ΄μ μμ±
|
103 |
+
|
104 |
+
6. **λ¬΄μ‘°κ±΄λΆ μμ±**
|
105 |
+
- νΉμ 쑰건μ ν κΈνμ¬ λͺ¨λΈμ΄ μλμΌλ‘ μμ±νλλ‘ μ€μ
|
106 |
+
- λ μ°½μμ μ΄κ³ λ€μν μΆλ ₯μ μ μ©
|
107 |
+
|
108 |
+
### κΈ°μ μ μΈλΆμ¬ν
|
109 |
+
- CUDAλ₯Ό ν΅ν GPU κ°μ μ¬μ©
|
110 |
+
- λ λμ μ μ΄λ₯Ό μν classifier-free guidance ꡬν
|
111 |
+
- ν리ν½μ€μμ μ€λμ€ μ°μ μμ± μ§μ
|
112 |
+
- μμ± μ€ μ€μκ° μ§ν μν© μΆμ
|
113 |
+
|
114 |
+
### μ¬μ© λ°©λ²
|
115 |
+
1. λͺ¨λΈ λ³ν μ ν
|
116 |
+
2. ν
μ€νΈ μ
λ ₯ λ° μΈμ΄ μ ν
|
117 |
+
3. (μ νμ¬ν) μμ± λ³΅μ λ₯Ό μν νμ μ€λμ€ μ
λ‘λ
|
118 |
+
4. μμ± νΉμ± λ° κ°μ μ‘°μ
|
119 |
+
5. "Generate Audio" λ²νΌμ ν΄λ¦νμ¬ μμ± μμ±
|
120 |
+
6. μμ±λ μ€λμ€ λ€μ΄λ‘λ λλ μ¬μ
|
121 |
+
|
122 |
+
### νΉλ³ κΈ°λ₯
|
123 |
+
- **κ°μ μ€μ **: μμ±λ μμ±μ κ°μ ν€μ μΈλ°νκ² μ μ΄
|
124 |
+
- **μμ± νμ§**: DNS-MOS μ μλ‘ μμ± νμ§ μ‘°μ
|
125 |
+
- **νμ λ
Έμ΄μ¦ μ κ±°**: μ
λ‘λλ νμ μ€λμ€μ λ
Έμ΄μ¦ μ κ±° μ΅μ
|
126 |
+
- **λ¬΄μ‘°κ±΄λΆ ν€**: νΉμ κΈ°λ₯μ μλμΌλ‘ μμ±νλλ‘ μ€μ
|
127 |
+
|
128 |
+
μ΄ μ ν리μΌμ΄μ
μ κ³ νμ§ TTS μμ±μ μν κ°λ ₯νκ³ μ μ°ν λꡬλ‘, λ€μν μ©λμ μμ± μ½ν
μΈ μ μμ νμ©ν μ μμ΅λλ€.
|