Zhikang Niu commited on
Commit
263e3c5
·
1 Parent(s): cdf6969

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +68 -29
README.md CHANGED
@@ -1,25 +1,37 @@
1
 
2
- # F5-TTS
3
- ### <a href="https://swivid.github.io/F5-TTS/">Demo</a>; <a href="https://arxiv.org/abs/2410.06885">Paper</a>
4
- Official code for "A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching"
5
 
6
- ## Installation
 
 
 
7
 
 
 
 
 
 
 
 
8
  ```bash
9
  pip install -r requirements.txt
10
  ```
11
 
12
- ## Dataset
13
-
14
  ```bash
15
  # prepare custom dataset up to your need
16
  # download corresponding dataset first, and fill in the path in scripts
 
 
17
  python scripts/prepare_emilia.py
 
 
18
  python scripts/prepare_wenetspeech4tts.py
19
  ```
20
 
21
  ## Training
22
-
23
  ```bash
24
  # setup accelerate config, e.g. use multi-gpu ddp, fp16
25
  # will be to: ~/.cache/huggingface/accelerate/default_config.yaml
@@ -28,51 +40,63 @@ accelerate launch test_train.py
28
  ```
29
 
30
  ## Inference
31
- Pretrained model ckpts. https://huggingface.co/SWivid/F5-TTS
 
 
 
32
  ```bash
33
- # test single inference
34
  # modify the config up to your need,
35
  # e.g. fix_duration (the total length of prompt + to_generate, currently support up to 30s)
36
  # nfe_step (larger takes more time to do more precise inference ode)
37
  # ode_method (switch to 'midpoint' for better compatibility with small nfe_step, )
38
  # ( though 'midpoint' is 2nd-order ode solver, slower compared to 1st-order 'Euler')
39
  python test_infer_single.py
40
-
41
- # test speech edit
 
 
42
  python test_infer_single_edit.py
43
  ```
44
 
45
-
46
  ## Evaluation
47
-
48
- download seedtts testset. https://github.com/BytedanceSpeech/seed-tts-eval \
49
- download test-clean. http://www.openslr.org/12/ \
50
- uzip and place under data/, and fill in the path of test-clean in `test_infer_batch.py` \
51
- our librispeech-pc 4-10s subset is already under data/ in this repo
52
-
53
- zh asr model ckpt. https://huggingface.co/funasr/paraformer-zh \
54
- en asr model ckpt. https://huggingface.co/Systran/faster-whisper-large-v3 \
55
- wavlm model ckpt. https://drive.google.com/file/d/1-aE1NfzpRCLxA4GUxX9ITI3F9LlbtEGP/view \
56
- fill in the path of ckpts in `test_infer_batch.py`
 
 
 
 
57
  ```bash
58
  # batch inference for evaluations
59
  accelerate config # if not set before
60
  bash test_infer_batch.sh
61
  ```
 
 
 
 
 
 
 
62
 
63
- faster-whisper if cuda11,
64
- `pip install --force-reinstall ctranslate2==3.24.0`
65
- (recommended) `pip install faster-whisper==0.10.1`,
66
- otherwise may encounter asr failure (output abnormal repetition)
67
  ```bash
68
- # evaluation for Seed-TTS test set
69
  python scripts/eval_seedtts_testset.py
70
 
71
- # evaluation for LibriSpeech-PC test-clean cross sentence
72
  python scripts/eval_librispeech_test_clean.py
73
  ```
74
 
75
- ## Appreciation
76
 
77
  - <a href="https://arxiv.org/abs/2406.18009">E2-TTS</a> brilliant work, simple and effective
78
  - <a href="https://arxiv.org/abs/2407.05361">Emilia</a>, <a href="https://arxiv.org/abs/2406.05763">WenetSpeech4TTS</a> valuable datasets
@@ -81,3 +105,18 @@ python scripts/eval_librispeech_test_clean.py
81
  - <a href="https://github.com/modelscope/FunASR">FunASR</a>, <a href="https://github.com/SYSTRAN/faster-whisper">faster-whisper</a> & <a href="https://github.com/microsoft/UniSpeech">UniSpeech</a> for evaluation tools
82
  - <a href="https://github.com/rtqichen/torchdiffeq">torchdiffeq</a> as ODE solver, <a href="https://huggingface.co/charactr/vocos-mel-24khz">Vocos</a> as vocoder
83
  - <a href="https://github.com/MahmoudAshraf97/ctc-forced-aligner">ctc-forced-aligner</a> for speech edit test
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
 
2
+ ## F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
 
 
3
 
4
+ ### <a href="https://swivid.github.io/F5-TTS/">Demo</a>; <a href="https://arxiv.org/abs/2410.06885">Paper</a>; <a href="https://huggingface.co/SWivid/F5-TTS">Checkpoints</a>.
5
+ F5-TTS, a fully non-autoregressive text-to-speech system based on flow matching with Diffusion Transformer (DiT). Without requiring complex designs such as duration model, text encoder, and phoneme alignment, the text input is simply padded with filler tokens to the same length as input speech, and then the denoising is performed for speech generation, which was originally proved feasible by E2 TTS.
6
+
7
+ ![image](https://github.com/user-attachments/assets/6194b82e-fe90-4b86-9d45-82ade478fb49)
8
 
9
+ ## Installation
10
+ Clone this repository.
11
+ ```bash
12
+ git clone [email protected]:SWivid/F5-TTS.git
13
+ cd F5-TTS
14
+ ```
15
+ Install packages.
16
  ```bash
17
  pip install -r requirements.txt
18
  ```
19
 
20
+ ## Prepare Dataset
21
+ We provide data processing scripts for Wenetspeech4TTS and Emilia and you just need to update your data paths in the scripts.
22
  ```bash
23
  # prepare custom dataset up to your need
24
  # download corresponding dataset first, and fill in the path in scripts
25
+
26
+ # Prepare the Emilia dataset
27
  python scripts/prepare_emilia.py
28
+
29
+ # Prepare the Wenetspeech4TTS dataset
30
  python scripts/prepare_wenetspeech4tts.py
31
  ```
32
 
33
  ## Training
34
+ Once your datasets are prepared, you can start the training process. Here’s how to set it up:
35
  ```bash
36
  # setup accelerate config, e.g. use multi-gpu ddp, fp16
37
  # will be to: ~/.cache/huggingface/accelerate/default_config.yaml
 
40
  ```
41
 
42
  ## Inference
43
+ To perform inference with the pretrained model, you can download the model checkpoints from [F5-TTS Pretrained Model](https://huggingface.co/SWivid/F5-TTS)
44
+
45
+ ### Single Inference
46
+ You can test single inference using the following command. Before running the command, modify the config up to your need.
47
  ```bash
 
48
  # modify the config up to your need,
49
  # e.g. fix_duration (the total length of prompt + to_generate, currently support up to 30s)
50
  # nfe_step (larger takes more time to do more precise inference ode)
51
  # ode_method (switch to 'midpoint' for better compatibility with small nfe_step, )
52
  # ( though 'midpoint' is 2nd-order ode solver, slower compared to 1st-order 'Euler')
53
  python test_infer_single.py
54
+ ```
55
+ ### Speech Edit
56
+ To test speech editing capabilities, use the following command.
57
+ ```
58
  python test_infer_single_edit.py
59
  ```
60
 
 
61
  ## Evaluation
62
+ ### Prepare Test Datasets
63
+ 1. Seed-TTS test set: Download from [seed-tts-eval](https://github.com/BytedanceSpeech/seed-tts-eval).
64
+ 2. LibriSpeech test clean: Download from [OpenSLR](http://www.openslr.org/12/).
65
+ 3. Unzip the downloaded datasets and place them in the data/ directory.
66
+ 4. Update the path for the test clean data in `test_infer_batch.py`
67
+ 5. our librispeech-pc 4-10s subset is already under data/ in this repo
68
+ ### Download Evaluation Model Checkpoints
69
+ 1. Chinese ASR Model: [Paraformer-zh](https://huggingface.co/funasr/paraformer-zh)
70
+ 2. English ASR Model: [Faster-Whisper](https://huggingface.co/Systran/faster-whisper-large-v3)
71
+ 3. WavLM Model: Download from [Google Drive](https://drive.google.com/file/d/1-aE1NfzpRCLxA4GUxX9ITI3F9LlbtEGP/view).
72
+
73
+ Ensure you update the path for the checkpoints in test_infer_batch.py.
74
+ ### Batch inference
75
+ To run batch inference for evaluations, execute the following commands:
76
  ```bash
77
  # batch inference for evaluations
78
  accelerate config # if not set before
79
  bash test_infer_batch.sh
80
  ```
81
+ **Installation Notes**
82
+ For Faster-Whisper with CUDA 11:
83
+ ```bash
84
+ pip install --force-reinstall ctranslate2==3.24.0
85
+ pip install faster-whisper==0.10.1 # recommended
86
+ ```
87
+ This will help avoid ASR failures, such as abnormal repetitions in output.
88
 
89
+ ### Evaluation
90
+ Run the following commands to evaluate the model's performance:
 
 
91
  ```bash
92
+ # Evaluation for Seed-TTS test set
93
  python scripts/eval_seedtts_testset.py
94
 
95
+ # Evaluation for LibriSpeech-PC test-clean (cross-sentence)
96
  python scripts/eval_librispeech_test_clean.py
97
  ```
98
 
99
+ ## Acknowledgements
100
 
101
  - <a href="https://arxiv.org/abs/2406.18009">E2-TTS</a> brilliant work, simple and effective
102
  - <a href="https://arxiv.org/abs/2407.05361">Emilia</a>, <a href="https://arxiv.org/abs/2406.05763">WenetSpeech4TTS</a> valuable datasets
 
105
  - <a href="https://github.com/modelscope/FunASR">FunASR</a>, <a href="https://github.com/SYSTRAN/faster-whisper">faster-whisper</a> & <a href="https://github.com/microsoft/UniSpeech">UniSpeech</a> for evaluation tools
106
  - <a href="https://github.com/rtqichen/torchdiffeq">torchdiffeq</a> as ODE solver, <a href="https://huggingface.co/charactr/vocos-mel-24khz">Vocos</a> as vocoder
107
  - <a href="https://github.com/MahmoudAshraf97/ctc-forced-aligner">ctc-forced-aligner</a> for speech edit test
108
+
109
+ ## Citation
110
+ ```
111
+ @misc{chen2024f5ttsfairytalerfakesfluent,
112
+ title={F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching},
113
+ author={Yushen Chen and Zhikang Niu and Ziyang Ma and Keqi Deng and Chunhui Wang and Jian Zhao and Kai Yu and Xie Chen},
114
+ year={2024},
115
+ eprint={2410.06885},
116
+ archivePrefix={arXiv},
117
+ primaryClass={eess.AS},
118
+ url={https://arxiv.org/abs/2410.06885},
119
+ }
120
+ ```
121
+ ## LICENSE
122
+ Our code is released under MIT License.