SWivid commited on
Commit
ed0b71a
·
1 Parent(s): 09e398f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +30 -31
README.md CHANGED
@@ -1,10 +1,12 @@
1
 
2
  # F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
3
 
4
- ### <a href="https://swivid.github.io/F5-TTS/">Demo</a>; <a href="https://arxiv.org/abs/2410.06885">Paper</a>; <a href="https://huggingface.co/SWivid/F5-TTS">Checkpoints</a>.
5
- F5-TTS, a fully non-autoregressive text-to-speech system based on flow matching with Diffusion Transformer (DiT). Without requiring complex designs such as duration model, text encoder, and phoneme alignment, the text input is simply padded with filler tokens to the same length as input speech, and then the denoising is performed for speech generation, which was originally proved feasible by E2 TTS.
6
-
7
- ![image](https://github.com/user-attachments/assets/6194b82e-fe90-4b86-9d45-82ade478fb49)
 
 
8
 
9
  ## Installation
10
  Clone this repository.
@@ -18,7 +20,7 @@ pip install -r requirements.txt
18
  ```
19
 
20
  ## Prepare Dataset
21
- We provide data processing scripts for Wenetspeech4TTS and Emilia and you just need to update your data paths in the scripts.
22
  ```bash
23
  # prepare custom dataset up to your need
24
  # download corresponding dataset first, and fill in the path in scripts
@@ -31,7 +33,7 @@ python scripts/prepare_wenetspeech4tts.py
31
  ```
32
 
33
  ## Training
34
- Once your datasets are prepared, you can start the training process. Here’s how to set it up:
35
  ```bash
36
  # setup accelerate config, e.g. use multi-gpu ddp, fp16
37
  # will be to: ~/.cache/huggingface/accelerate/default_config.yaml
@@ -40,7 +42,7 @@ accelerate launch test_train.py
40
  ```
41
 
42
  ## Inference
43
- To perform inference with the pretrained model, you can download the model checkpoints from [F5-TTS Pretrained Model](https://huggingface.co/SWivid/F5-TTS)
44
 
45
  ### Single Inference
46
  You can test single inference using the following command. Before running the command, modify the config up to your need.
@@ -61,33 +63,32 @@ python test_infer_single_edit.py
61
  ## Evaluation
62
  ### Prepare Test Datasets
63
  1. Seed-TTS test set: Download from [seed-tts-eval](https://github.com/BytedanceSpeech/seed-tts-eval).
64
- 2. LibriSpeech test clean: Download from [OpenSLR](http://www.openslr.org/12/).
65
  3. Unzip the downloaded datasets and place them in the data/ directory.
66
- 4. Update the path for the test clean data in `test_infer_batch.py`
67
- 5. our librispeech-pc 4-10s subset is already under data/ in this repo
68
- ### Download Evaluation Model Checkpoints
69
- 1. Chinese ASR Model: [Paraformer-zh](https://huggingface.co/funasr/paraformer-zh)
70
- 2. English ASR Model: [Faster-Whisper](https://huggingface.co/Systran/faster-whisper-large-v3)
71
- 3. WavLM Model: Download from [Google Drive](https://drive.google.com/file/d/1-aE1NfzpRCLxA4GUxX9ITI3F9LlbtEGP/view).
72
 
73
- Ensure you update the path for the checkpoints in test_infer_batch.py.
74
- ### Batch inference
75
  To run batch inference for evaluations, execute the following commands:
76
  ```bash
77
  # batch inference for evaluations
78
  accelerate config # if not set before
79
  bash test_infer_batch.sh
80
  ```
81
- **Installation Notes**
82
- For Faster-Whisper with CUDA 11:
83
- ```bash
84
- pip install --force-reinstall ctranslate2==3.24.0
85
- pip install faster-whisper==0.10.1 # recommended
86
- ```
87
- This will help avoid ASR failures, such as abnormal repetitions in output.
88
 
89
- ### Evaluation
90
- Run the following commands to evaluate the model's performance:
 
 
 
 
 
 
 
 
 
 
 
91
  ```bash
92
  # Evaluation for Seed-TTS test set
93
  python scripts/eval_seedtts_testset.py
@@ -102,20 +103,18 @@ python scripts/eval_librispeech_test_clean.py
102
  - <a href="https://arxiv.org/abs/2407.05361">Emilia</a>, <a href="https://arxiv.org/abs/2406.05763">WenetSpeech4TTS</a> valuable datasets
103
  - <a href="https://github.com/lucidrains/e2-tts-pytorch">lucidrains</a> initial CFM structure</a> with also <a href="https://github.com/bfs18">bfs18</a> for discussion</a>
104
  - <a href="https://arxiv.org/abs/2403.03206">SD3</a> & <a href="https://github.com/huggingface/diffusers">Huggingface diffusers</a> DiT and MMDiT code structure
105
- - <a href="https://github.com/modelscope/FunASR">FunASR</a>, <a href="https://github.com/SYSTRAN/faster-whisper">faster-whisper</a> & <a href="https://github.com/microsoft/UniSpeech">UniSpeech</a> for evaluation tools
106
  - <a href="https://github.com/rtqichen/torchdiffeq">torchdiffeq</a> as ODE solver, <a href="https://huggingface.co/charactr/vocos-mel-24khz">Vocos</a> as vocoder
 
 
107
  - <a href="https://github.com/MahmoudAshraf97/ctc-forced-aligner">ctc-forced-aligner</a> for speech edit test
108
 
109
  ## Citation
110
  ```
111
- @misc{chen2024f5ttsfairytalerfakesfluent,
112
  title={F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching},
113
  author={Yushen Chen and Zhikang Niu and Ziyang Ma and Keqi Deng and Chunhui Wang and Jian Zhao and Kai Yu and Xie Chen},
 
114
  year={2024},
115
- eprint={2410.06885},
116
- archivePrefix={arXiv},
117
- primaryClass={eess.AS},
118
- url={https://arxiv.org/abs/2410.06885},
119
  }
120
  ```
121
  ## LICENSE
 
1
 
2
  # F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
3
 
4
+ [![arXiv](https://img.shields.io/badge/arXiv-2410.06885-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2410.06885)
5
+ [![demo](https://img.shields.io/badge/GitHub-Demo%20page-blue.svg)](https://swivid.github.io/F5-TTS/)
6
+ [![space](https://img.shields.io/badge/🤗-Space%20demo-yellow)](https://huggingface.co/spaces/mrfakename/E2-F5-TTS) \
7
+ **F5-TTS**: Diffusion Transformer with ConvNeXt V2, faster trained and inference. \
8
+ **E2 TTS**: Flat-UNet Transformer, closest reproduction.\
9
+ **Sway Sampling**: Inference-time flow step sampling strategy, greatly improves performance
10
 
11
  ## Installation
12
  Clone this repository.
 
20
  ```
21
 
22
  ## Prepare Dataset
23
+ Example data processing scripts for Emilia and Wenetspeech4TTS, and you may tailor your own one along with a Dataset class in `model/dataset.py`.
24
  ```bash
25
  # prepare custom dataset up to your need
26
  # download corresponding dataset first, and fill in the path in scripts
 
33
  ```
34
 
35
  ## Training
36
+ Once your datasets are prepared, you can start the training process.
37
  ```bash
38
  # setup accelerate config, e.g. use multi-gpu ddp, fp16
39
  # will be to: ~/.cache/huggingface/accelerate/default_config.yaml
 
42
  ```
43
 
44
  ## Inference
45
+ To inference with pretrained models, download the checkpoints from [🤗](https://huggingface.co/SWivid/F5-TTS).
46
 
47
  ### Single Inference
48
  You can test single inference using the following command. Before running the command, modify the config up to your need.
 
63
  ## Evaluation
64
  ### Prepare Test Datasets
65
  1. Seed-TTS test set: Download from [seed-tts-eval](https://github.com/BytedanceSpeech/seed-tts-eval).
66
+ 2. LibriSpeech test-clean: Download from [OpenSLR](http://www.openslr.org/12/).
67
  3. Unzip the downloaded datasets and place them in the data/ directory.
68
+ 4. Update the path for the test-clean data in `test_infer_batch.py`
69
+ 5. Our filtered LibriSpeech-PC 4-10s subset is already under data/ in this repo
 
 
 
 
70
 
71
+ ### Batch Inference for Test Set
 
72
  To run batch inference for evaluations, execute the following commands:
73
  ```bash
74
  # batch inference for evaluations
75
  accelerate config # if not set before
76
  bash test_infer_batch.sh
77
  ```
 
 
 
 
 
 
 
78
 
79
+ ### Download Evaluation Model Checkpoints
80
+ 1. Chinese ASR Model: [Paraformer-zh](https://huggingface.co/funasr/paraformer-zh)
81
+ 2. English ASR Model: [Faster-Whisper](https://huggingface.co/Systran/faster-whisper-large-v3)
82
+ 3. WavLM Model: Download from [Google Drive](https://drive.google.com/file/d/1-aE1NfzpRCLxA4GUxX9ITI3F9LlbtEGP/view).
83
+
84
+ ### Objective Evaluation
85
+ **Some Notes**\
86
+ For faster-whisper with CUDA 11: \
87
+ `pip install --force-reinstall ctranslate2==3.24.0`\
88
+ (Recommended) To avoid possible ASR failures, such as abnormal repetitions in output:\
89
+ `pip install faster-whisper==0.10.1`
90
+
91
+ Update the path with your batch-inferenced results, and carry out WER / SIM evaluations:
92
  ```bash
93
  # Evaluation for Seed-TTS test set
94
  python scripts/eval_seedtts_testset.py
 
103
  - <a href="https://arxiv.org/abs/2407.05361">Emilia</a>, <a href="https://arxiv.org/abs/2406.05763">WenetSpeech4TTS</a> valuable datasets
104
  - <a href="https://github.com/lucidrains/e2-tts-pytorch">lucidrains</a> initial CFM structure</a> with also <a href="https://github.com/bfs18">bfs18</a> for discussion</a>
105
  - <a href="https://arxiv.org/abs/2403.03206">SD3</a> & <a href="https://github.com/huggingface/diffusers">Huggingface diffusers</a> DiT and MMDiT code structure
 
106
  - <a href="https://github.com/rtqichen/torchdiffeq">torchdiffeq</a> as ODE solver, <a href="https://huggingface.co/charactr/vocos-mel-24khz">Vocos</a> as vocoder
107
+ - <a href="https://x.com/realmrfakename">mrfakename</a> huggingface space demo ~
108
+ - <a href="https://github.com/modelscope/FunASR">FunASR</a>, <a href="https://github.com/SYSTRAN/faster-whisper">faster-whisper</a> & <a href="https://github.com/microsoft/UniSpeech">UniSpeech</a> for evaluation tools
109
  - <a href="https://github.com/MahmoudAshraf97/ctc-forced-aligner">ctc-forced-aligner</a> for speech edit test
110
 
111
  ## Citation
112
  ```
113
+ @article{chen-etal-2024-f5tts,
114
  title={F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching},
115
  author={Yushen Chen and Zhikang Niu and Ziyang Ma and Keqi Deng and Chunhui Wang and Jian Zhao and Kai Yu and Xie Chen},
116
+ journal={arXiv preprint arXiv:2410.06885},
117
  year={2024},
 
 
 
 
118
  }
119
  ```
120
  ## LICENSE