Paulmzr commited on
Commit
2d5297d
·
verified ·
1 Parent(s): 4e85ebb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +203 -2
README.md CHANGED
@@ -6,6 +6,207 @@ language:
6
  pipeline_tag: text-to-speech
7
  ---
8
 
9
- Usage Instructions: https://github.com/ictnlp/SLED-TTS
10
 
11
- #This repo is under construction
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  pipeline_tag: text-to-speech
7
  ---
8
 
 
9
 
10
+
11
+ # SLED-TTS: Efficient Speech Language Modeling via Energy Distance in Continuous Space
12
+ [![HuggingFace](https://img.shields.io/badge/HuggingFace-FEC200?style=flat&logo=Hugging%20Face)](https://huggingface.co/collections/ICTNLP/sled-tts-680253e19c889010a1a376ac)
13
+ [![WeChat AI](https://img.shields.io/badge/WeChat%20AI-4CAF50?style=flat&logo=wechat)](https://www.wechat.com)
14
+ [![ICT/CAS](https://img.shields.io/badge/ICT%2FCAS-0066cc?style=flat&logo=school)](https://ict.cas.cn)
15
+
16
+
17
+ ## Codes: https://github.com/ictnlp/SLED-TTS
18
+
19
+ ## Key features
20
+ - **Autoregressive Continuous Modeling**: SLED models speech in a continuous latent space using a speacial type of maximum mean discrepancy as the objective.
21
+ - **Streaming Synthesis**: SLED supports streaming synthesis, enabling speech generation to start as soon as the text stream begins.
22
+ - **Voice Cloning**: Capable of generating speech based on a 3-second prefix or reference utterance as prompt.
23
+
24
+
25
+
26
+ ## Demo
27
+ You can check SLED in action by exploring the [demo page](https://sled-demo.github.io/).
28
+ <div style="display: flex;">
29
+ <img src="https://github.com/user-attachments/assets/0f6ee8a0-4258-48a2-a670-5556672dbc18" width="200" style="margin-right: 20px;"/>
30
+ <img src="https://github.com/user-attachments/assets/f48848b0-58d9-403a-86d1-80683565a4d7" width="500"/>
31
+ </div>
32
+
33
+ ## Available Models on Hugging Face
34
+
35
+ We have made SLED available on [Hugging Face](https://huggingface.co/collections/ICTNLP/sled-tts-680253e19c889010a1a376ac), currently offering two distinct English models for different use cases:
36
+
37
+ 1. **[SLED-TTS-Libriheavy](https://huggingface.co/ICTNLP/SLED-TTS-Libriheavy)**: This model is trained on the Libriheavy dataset and provides high-quality text-to-speech synthesis.
38
+
39
+ 2. **[SLED-TTS-Streaming-Libriheavy](https://huggingface.co/ICTNLP/SLED-TTS-Streaming-Libriheavy)**: This variant supports **streaming decoding**, which generates a 0.6-second speech chunk for every 5 text tokens received. It’s ideal for applications requiring low-latency audio generation.
40
+
41
+
42
+ The Mandarin models are on the way! Alternatively, you can train your own SLED-TTS models by following the guidelines below.
43
+
44
+ ## Usage
45
+ **We provide the training and inference code for SLED-TTS.**
46
+
47
+ ### Installation
48
+ ``` sh
49
+ git clone https://github.com/ictnlp/SLED-TTS.git
50
+ cd SLED-TTS
51
+ pip install -e ./
52
+ ```
53
+
54
+ We currently utilize the sum of the first 8 embedding vectors from [Encodec_24khz](https://huggingface.co/facebook/encodec_24khz) as the continuous latent vector. To proceed, ensure that [Encodec_24khz](https://huggingface.co/facebook/encodec_24khz) is downloaded and cached in your HuggingFace dir.
55
+
56
+ ### Inference
57
+ - Set the `CHECKPOINT` variable to the path of the cached **[SLED-TTS-Libriheavy](https://huggingface.co/ICTNLP/SLED-TTS-Libriheavy)** or **[SLED-TTS-Streaming-Libriheavy](https://huggingface.co/ICTNLP/SLED-TTS-Streaming-Libriheavy)** model.
58
+ - Diverse generation results can be obtained by varying the `SEED` variable.
59
+ ``` sh
60
+ CHECKPOINT=/path/to/checkpoint
61
+ CFG=2.0
62
+ SEED=0
63
+ ```
64
+ ***Offline Inference***
65
+ ``` sh
66
+ python scripts/run_offline.py \
67
+ --model_name_or_path ${CHECKPOINT} \
68
+ --cfg ${CFG} \
69
+ --input "My remark pleases him, but I soon prove to him that it is not the right way to speak. However perfect may have been the language of that ancient writer." \
70
+ --seed ${SEED}
71
+ ```
72
+ ***Streaming Inference***
73
+ ``` sh
74
+ python scripts/run_stream.py \
75
+ --model_name_or_path ${CHECKPOINT} \
76
+ --cfg ${CFG} \
77
+ --input "My remark pleases him, but I soon prove to him that it is not the right way to speak. However perfect may have been the language of that ancient writer." \
78
+ --seed ${SEED}
79
+ # Please note that we have simulated the generation in a streaming environment in run_stream.py for evaluating its quality.
80
+ # However, the existing code does not actually provide a streaming API.
81
+ ```
82
+ ***Voice Clone***
83
+
84
+ You can adjust the prompt speech by setting `--prompt_text` and `--prompt_audio`.
85
+ ``` sh
86
+ python scripts/run_voice_clone.py \
87
+ --prompt_text "Were I in the warm room with all the splendor and magnificence!" \
88
+ --prompt_audio "example_prompt.flac" \
89
+ --model_name_or_path ${CHECKPOINT} \
90
+ --cfg ${CFG} \
91
+ --input "Perhaps the other trees from the forest will come to look at me!" \
92
+ --seed ${SEED}
93
+ ```
94
+
95
+ ### Training
96
+
97
+ ***Data Processing***
98
+ #TODO
99
+
100
+ ***Training Offline Model***
101
+ ``` sh
102
+ OUTPUT_DIR=./runs/libriheavy
103
+ mkdir -p $OUTPUT_DIR
104
+ LOG_FILE=${OUTPUT_DIR}/log
105
+
106
+ BATCH_SIZE=8
107
+ UPDATE_FREQ=8
108
+ # assume 8 proc per node, then WORLD_SIZE * 8 * BATCH_SIZE * UPDATE_FREQ == 512
109
+
110
+ torchrun --nnodes ${WORLD_SIZE} --node_rank ${RANK} --nproc_per_node 8 --master_addr ${MASTER_ADDR} --master_port ${MASTER_PORT} \
111
+ ./scripts/train_libriheavy.py \
112
+ --training_cfg 0.1 \
113
+ --num_hidden_layers 12 --diffloss_d 6 --noise_channels 128 \
114
+ --dataloader_num_workers 8 \
115
+ --dataloader_pin_memory True \
116
+ --remove_unused_columns False \
117
+ --label_names audio_inputs \
118
+ --group_by_speech_length \
119
+ --do_train \
120
+ --do_eval \
121
+ --eval_strategy steps \
122
+ --eval_steps 10000 \
123
+ --prediction_loss_only \
124
+ --per_device_train_batch_size ${BATCH_SIZE} \
125
+ --per_device_eval_batch_size 24 \
126
+ --gradient_accumulation_steps ${UPDATE_FREQ} \
127
+ --bf16 \
128
+ --learning_rate 5e-4 \
129
+ --weight_decay 0.01 \
130
+ --adam_beta1 0.9 \
131
+ --adam_beta2 0.999 \
132
+ --adam_epsilon 1e-8 \
133
+ --max_grad_norm 1.0 \
134
+ --max_steps 300000 \
135
+ --lr_scheduler_type "linear" \
136
+ --warmup_steps 32000 \
137
+ --logging_first_step \
138
+ --logging_steps 100 \
139
+ --save_steps 10000 \
140
+ --save_total_limit 10 \
141
+ --output_dir ${OUTPUT_DIR} \
142
+ --report_to tensorboard \
143
+ --disable_tqdm True \
144
+ --ddp_timeout 3600 --overwrite_output_dir
145
+
146
+ ```
147
+
148
+ ***Training Streaming Model***
149
+ ``` sh
150
+ OUTPUT_DIR=./runs/libriheavy_stream
151
+ mkdir -p $OUTPUT_DIR
152
+ LOG_FILE=${OUTPUT_DIR}/log
153
+
154
+ BATCH_SIZE=8
155
+ UPDATE_FREQ=8
156
+ # assume 8 proc per node, then WORLD_SIZE * 8 * BATCH_SIZE * UPDATE_FREQ == 512
157
+
158
+ torchrun --nnodes ${WORLD_SIZE} --node_rank ${RANK} --nproc_per_node 8 --master_addr ${MASTER_ADDR} --master_port ${MASTER_PORT} \
159
+ ./scripts/train_libriheavy_stream.py \
160
+ --finetune_path ./runs/libriheavy/checkpoint-300000/model.safetensors \
161
+ --stream_n 5 --stream_m 45 \
162
+ --training_cfg 0.1 \
163
+ --num_hidden_layers 12 --diffloss_d 6 --noise_channels 128 \
164
+ --dataloader_num_workers 8 \
165
+ --dataloader_pin_memory True \
166
+ --remove_unused_columns False \
167
+ --label_names audio_inputs \
168
+ --group_by_speech_length \
169
+ --do_train \
170
+ --do_eval \
171
+ --eval_strategy steps \
172
+ --eval_steps 10000 \
173
+ --prediction_loss_only \
174
+ --per_device_train_batch_size ${BATCH_SIZE} \
175
+ --per_device_eval_batch_size 24 \
176
+ --gradient_accumulation_steps ${UPDATE_FREQ} \
177
+ --bf16 \
178
+ --learning_rate 3e-4 \
179
+ --weight_decay 0.01 \
180
+ --adam_beta1 0.9 \
181
+ --adam_beta2 0.999 \
182
+ --adam_epsilon 1e-8 \
183
+ --max_grad_norm 1.0 \
184
+ --max_steps 100000 \
185
+ --lr_scheduler_type "linear" \
186
+ --warmup_steps 10000 \
187
+ --logging_first_step \
188
+ --logging_steps 100 \
189
+ --save_steps 10000 \
190
+ --save_total_limit 10 \
191
+ --output_dir ${OUTPUT_DIR} \
192
+ --report_to tensorboard \
193
+ --disable_tqdm True \
194
+ --ddp_timeout 3600 --overwrite_output_dir
195
+ ```
196
+
197
+
198
+ ## Code Contributors
199
+
200
+ - [Zhengrui Ma](https://scholar.google.com/citations?user=dUgq6tEAAAAJ)
201
+ - [Chenze Shao](https://scholar.google.com/citations?user=LH_rZf8AAAAJ)
202
+
203
+
204
+
205
+ ## Ackonwledgement
206
+ This work is inspired by following great works:
207
+ - A Proper Loss Is All You Need: Autoregressive Image Generation in Continuous Space via Score Maximization
208
+ - Autoregressive Image Generation without Vector Quantization
209
+ - A Spectral Energy Distance for Parallel Speech Synthesis
210
+
211
+ ## Citation
212
+ #TODO