Spaces:
Configuration error
Configuration error
Update README.md
Browse files
README.md
CHANGED
@@ -1,10 +1,12 @@
|
|
1 |
|
2 |
# F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
|
3 |
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
|
|
|
|
8 |
|
9 |
## Installation
|
10 |
Clone this repository.
|
@@ -18,7 +20,7 @@ pip install -r requirements.txt
|
|
18 |
```
|
19 |
|
20 |
## Prepare Dataset
|
21 |
-
|
22 |
```bash
|
23 |
# prepare custom dataset up to your need
|
24 |
# download corresponding dataset first, and fill in the path in scripts
|
@@ -31,7 +33,7 @@ python scripts/prepare_wenetspeech4tts.py
|
|
31 |
```
|
32 |
|
33 |
## Training
|
34 |
-
Once your datasets are prepared, you can start the training process.
|
35 |
```bash
|
36 |
# setup accelerate config, e.g. use multi-gpu ddp, fp16
|
37 |
# will be to: ~/.cache/huggingface/accelerate/default_config.yaml
|
@@ -40,7 +42,7 @@ accelerate launch test_train.py
|
|
40 |
```
|
41 |
|
42 |
## Inference
|
43 |
-
To
|
44 |
|
45 |
### Single Inference
|
46 |
You can test single inference using the following command. Before running the command, modify the config up to your need.
|
@@ -61,33 +63,32 @@ python test_infer_single_edit.py
|
|
61 |
## Evaluation
|
62 |
### Prepare Test Datasets
|
63 |
1. Seed-TTS test set: Download from [seed-tts-eval](https://github.com/BytedanceSpeech/seed-tts-eval).
|
64 |
-
2. LibriSpeech test
|
65 |
3. Unzip the downloaded datasets and place them in the data/ directory.
|
66 |
-
4. Update the path for the test
|
67 |
-
5.
|
68 |
-
### Download Evaluation Model Checkpoints
|
69 |
-
1. Chinese ASR Model: [Paraformer-zh](https://huggingface.co/funasr/paraformer-zh)
|
70 |
-
2. English ASR Model: [Faster-Whisper](https://huggingface.co/Systran/faster-whisper-large-v3)
|
71 |
-
3. WavLM Model: Download from [Google Drive](https://drive.google.com/file/d/1-aE1NfzpRCLxA4GUxX9ITI3F9LlbtEGP/view).
|
72 |
|
73 |
-
|
74 |
-
### Batch inference
|
75 |
To run batch inference for evaluations, execute the following commands:
|
76 |
```bash
|
77 |
# batch inference for evaluations
|
78 |
accelerate config # if not set before
|
79 |
bash test_infer_batch.sh
|
80 |
```
|
81 |
-
**Installation Notes**
|
82 |
-
For Faster-Whisper with CUDA 11:
|
83 |
-
```bash
|
84 |
-
pip install --force-reinstall ctranslate2==3.24.0
|
85 |
-
pip install faster-whisper==0.10.1 # recommended
|
86 |
-
```
|
87 |
-
This will help avoid ASR failures, such as abnormal repetitions in output.
|
88 |
|
89 |
-
### Evaluation
|
90 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
91 |
```bash
|
92 |
# Evaluation for Seed-TTS test set
|
93 |
python scripts/eval_seedtts_testset.py
|
@@ -102,20 +103,18 @@ python scripts/eval_librispeech_test_clean.py
|
|
102 |
- <a href="https://arxiv.org/abs/2407.05361">Emilia</a>, <a href="https://arxiv.org/abs/2406.05763">WenetSpeech4TTS</a> valuable datasets
|
103 |
- <a href="https://github.com/lucidrains/e2-tts-pytorch">lucidrains</a> initial CFM structure</a> with also <a href="https://github.com/bfs18">bfs18</a> for discussion</a>
|
104 |
- <a href="https://arxiv.org/abs/2403.03206">SD3</a> & <a href="https://github.com/huggingface/diffusers">Huggingface diffusers</a> DiT and MMDiT code structure
|
105 |
-
- <a href="https://github.com/modelscope/FunASR">FunASR</a>, <a href="https://github.com/SYSTRAN/faster-whisper">faster-whisper</a> & <a href="https://github.com/microsoft/UniSpeech">UniSpeech</a> for evaluation tools
|
106 |
- <a href="https://github.com/rtqichen/torchdiffeq">torchdiffeq</a> as ODE solver, <a href="https://huggingface.co/charactr/vocos-mel-24khz">Vocos</a> as vocoder
|
|
|
|
|
107 |
- <a href="https://github.com/MahmoudAshraf97/ctc-forced-aligner">ctc-forced-aligner</a> for speech edit test
|
108 |
|
109 |
## Citation
|
110 |
```
|
111 |
-
@
|
112 |
title={F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching},
|
113 |
author={Yushen Chen and Zhikang Niu and Ziyang Ma and Keqi Deng and Chunhui Wang and Jian Zhao and Kai Yu and Xie Chen},
|
|
|
114 |
year={2024},
|
115 |
-
eprint={2410.06885},
|
116 |
-
archivePrefix={arXiv},
|
117 |
-
primaryClass={eess.AS},
|
118 |
-
url={https://arxiv.org/abs/2410.06885},
|
119 |
}
|
120 |
```
|
121 |
## LICENSE
|
|
|
1 |
|
2 |
# F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
|
3 |
|
4 |
+
[](https://arxiv.org/abs/2410.06885)
|
5 |
+
[](https://swivid.github.io/F5-TTS/)
|
6 |
+
[](https://huggingface.co/spaces/mrfakename/E2-F5-TTS) \
|
7 |
+
**F5-TTS**: Diffusion Transformer with ConvNeXt V2, faster trained and inference. \
|
8 |
+
**E2 TTS**: Flat-UNet Transformer, closest reproduction.\
|
9 |
+
**Sway Sampling**: Inference-time flow step sampling strategy, greatly improves performance
|
10 |
|
11 |
## Installation
|
12 |
Clone this repository.
|
|
|
20 |
```
|
21 |
|
22 |
## Prepare Dataset
|
23 |
+
Example data processing scripts for Emilia and Wenetspeech4TTS, and you may tailor your own one along with a Dataset class in `model/dataset.py`.
|
24 |
```bash
|
25 |
# prepare custom dataset up to your need
|
26 |
# download corresponding dataset first, and fill in the path in scripts
|
|
|
33 |
```
|
34 |
|
35 |
## Training
|
36 |
+
Once your datasets are prepared, you can start the training process.
|
37 |
```bash
|
38 |
# setup accelerate config, e.g. use multi-gpu ddp, fp16
|
39 |
# will be to: ~/.cache/huggingface/accelerate/default_config.yaml
|
|
|
42 |
```
|
43 |
|
44 |
## Inference
|
45 |
+
To inference with pretrained models, download the checkpoints from [🤗](https://huggingface.co/SWivid/F5-TTS).
|
46 |
|
47 |
### Single Inference
|
48 |
You can test single inference using the following command. Before running the command, modify the config up to your need.
|
|
|
63 |
## Evaluation
|
64 |
### Prepare Test Datasets
|
65 |
1. Seed-TTS test set: Download from [seed-tts-eval](https://github.com/BytedanceSpeech/seed-tts-eval).
|
66 |
+
2. LibriSpeech test-clean: Download from [OpenSLR](http://www.openslr.org/12/).
|
67 |
3. Unzip the downloaded datasets and place them in the data/ directory.
|
68 |
+
4. Update the path for the test-clean data in `test_infer_batch.py`
|
69 |
+
5. Our filtered LibriSpeech-PC 4-10s subset is already under data/ in this repo
|
|
|
|
|
|
|
|
|
70 |
|
71 |
+
### Batch Inference for Test Set
|
|
|
72 |
To run batch inference for evaluations, execute the following commands:
|
73 |
```bash
|
74 |
# batch inference for evaluations
|
75 |
accelerate config # if not set before
|
76 |
bash test_infer_batch.sh
|
77 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
78 |
|
79 |
+
### Download Evaluation Model Checkpoints
|
80 |
+
1. Chinese ASR Model: [Paraformer-zh](https://huggingface.co/funasr/paraformer-zh)
|
81 |
+
2. English ASR Model: [Faster-Whisper](https://huggingface.co/Systran/faster-whisper-large-v3)
|
82 |
+
3. WavLM Model: Download from [Google Drive](https://drive.google.com/file/d/1-aE1NfzpRCLxA4GUxX9ITI3F9LlbtEGP/view).
|
83 |
+
|
84 |
+
### Objective Evaluation
|
85 |
+
**Some Notes**\
|
86 |
+
For faster-whisper with CUDA 11: \
|
87 |
+
`pip install --force-reinstall ctranslate2==3.24.0`\
|
88 |
+
(Recommended) To avoid possible ASR failures, such as abnormal repetitions in output:\
|
89 |
+
`pip install faster-whisper==0.10.1`
|
90 |
+
|
91 |
+
Update the path with your batch-inferenced results, and carry out WER / SIM evaluations:
|
92 |
```bash
|
93 |
# Evaluation for Seed-TTS test set
|
94 |
python scripts/eval_seedtts_testset.py
|
|
|
103 |
- <a href="https://arxiv.org/abs/2407.05361">Emilia</a>, <a href="https://arxiv.org/abs/2406.05763">WenetSpeech4TTS</a> valuable datasets
|
104 |
- <a href="https://github.com/lucidrains/e2-tts-pytorch">lucidrains</a> initial CFM structure</a> with also <a href="https://github.com/bfs18">bfs18</a> for discussion</a>
|
105 |
- <a href="https://arxiv.org/abs/2403.03206">SD3</a> & <a href="https://github.com/huggingface/diffusers">Huggingface diffusers</a> DiT and MMDiT code structure
|
|
|
106 |
- <a href="https://github.com/rtqichen/torchdiffeq">torchdiffeq</a> as ODE solver, <a href="https://huggingface.co/charactr/vocos-mel-24khz">Vocos</a> as vocoder
|
107 |
+
- <a href="https://x.com/realmrfakename">mrfakename</a> huggingface space demo ~
|
108 |
+
- <a href="https://github.com/modelscope/FunASR">FunASR</a>, <a href="https://github.com/SYSTRAN/faster-whisper">faster-whisper</a> & <a href="https://github.com/microsoft/UniSpeech">UniSpeech</a> for evaluation tools
|
109 |
- <a href="https://github.com/MahmoudAshraf97/ctc-forced-aligner">ctc-forced-aligner</a> for speech edit test
|
110 |
|
111 |
## Citation
|
112 |
```
|
113 |
+
@article{chen-etal-2024-f5tts,
|
114 |
title={F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching},
|
115 |
author={Yushen Chen and Zhikang Niu and Ziyang Ma and Keqi Deng and Chunhui Wang and Jian Zhao and Kai Yu and Xie Chen},
|
116 |
+
journal={arXiv preprint arXiv:2410.06885},
|
117 |
year={2024},
|
|
|
|
|
|
|
|
|
118 |
}
|
119 |
```
|
120 |
## LICENSE
|