Spaces:
Configuration error
Configuration error
File size: 6,945 Bytes
09e398f 0748816 408075f ed0b71a 2bf52c3 3eeb78e f992c4e 3365e96 3eeb78e 3365e96 ed0b71a 0748816 3eeb78e 8629c6f f10ecad 8629c6f c01b988 3365e96 c01b988 d8638a6 6281b4b d8638a6 7370ce7 d1f708d 7370ce7 9640640 d1f708d 9640640 7370ce7 d8638a6 d3951b9 3365e96 0748816 d3951b9 0748816 c1ba121 77abf4e d3951b9 3152302 d3951b9 712d527 36a4aad 925ce4b 6a104b4 925ce4b 6a104b4 4d95921 925ce4b a4ca14b 0748816 3365e96 f69a602 ab2ad3b f69a602 d8638a6 8629c6f f69a602 8629c6f f69a602 d8638a6 f69a602 d8638a6 f69a602 c01b988 0022231 8629c6f e938b40 3365e96 6a104b4 3365e96 0022231 3365e96 8629c6f 3365e96 8629c6f 3365e96 8629c6f 3365e96 8629c6f 9d2b8cb 8629c6f 9d2b8cb 8629c6f a846ae6 0748816 c01b988 0748816 3365e96 b7bc641 3365e96 c93462f b7bc641 3365e96 1a09b80 8d18494 c01b988 1a09b80 c01b988 ed0b71a c01b988 ed0b71a c01b988 3365e96 7022d22 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 |
# F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
[](https://github.com/SWivid/F5-TTS)
[](https://arxiv.org/abs/2410.06885)
[](https://swivid.github.io/F5-TTS/)
[](https://huggingface.co/spaces/mrfakename/E2-F5-TTS)
[](https://modelscope.cn/studios/modelscope/E2-F5-TTS)
[](https://x-lance.sjtu.edu.cn/)
[](https://www.pcl.ac.cn)
<!-- <img src="https://github.com/user-attachments/assets/12d7749c-071a-427c-81bf-b87b91def670" alt="Watermark" style="width: 40px; height: auto"> -->
**F5-TTS**: Diffusion Transformer with ConvNeXt V2, faster trained and inference.
**E2 TTS**: Flat-UNet Transformer, closest reproduction from [paper](https://arxiv.org/abs/2406.18009).
**Sway Sampling**: Inference-time flow step sampling strategy, greatly improves performance
### Thanks to all the contributors !
## News
- **2024/10/08**: F5-TTS & E2 TTS base models on [🤗 Hugging Face](https://huggingface.co/SWivid/F5-TTS), [🤖 Model Scope](https://www.modelscope.cn/models/SWivid/F5-TTS_Emilia-ZH-EN), [🟣 Wisemodel](https://wisemodel.cn/models/SJTU_X-LANCE/F5-TTS_Emilia-ZH-EN).
## Installation
```bash
# Create a python 3.10 conda env (you could also use virtualenv)
conda create -n f5-tts python=3.10
conda activate f5-tts
# NVIDIA GPU: install pytorch with your CUDA version, e.g.
pip install torch==2.3.0+cu118 torchaudio==2.3.0+cu118 --extra-index-url https://download.pytorch.org/whl/cu118
# AMD GPU: install pytorch with your ROCm version, e.g. (Linux only)
pip install torch==2.5.1+rocm6.2 torchaudio==2.5.1+rocm6.2 --extra-index-url https://download.pytorch.org/whl/rocm6.2
# Intel GPU: install pytorch with your XPU version, e.g.
# Intel® Deep Learning Essentials or Intel® oneAPI Base Toolkit must be installed
pip install --pre torch torchaudio --index-url https://download.pytorch.org/whl/nightly/xpu
```
Then you can choose from a few options below:
### 1. As a pip package (if just for inference)
```bash
pip install git+https://github.com/SWivid/F5-TTS.git
```
### 2. Local editable (if also do training, finetuning)
```bash
git clone https://github.com/SWivid/F5-TTS.git
cd F5-TTS
# git submodule update --init --recursive # (optional, if need bigvgan)
pip install -e .
```
### 3. Docker usage
```bash
# Build from Dockerfile
docker build -t f5tts:v1 .
# Or pull from GitHub Container Registry
docker pull ghcr.io/swivid/f5-tts:main
```
## Inference
### 1. Gradio App
Currently supported features:
- Basic TTS with Chunk Inference
- Multi-Style / Multi-Speaker Generation
- Voice Chat powered by Qwen2.5-3B-Instruct
- [Custom inference with more language support](src/f5_tts/infer/SHARED.md)
```bash
# Launch a Gradio app (web interface)
f5-tts_infer-gradio
# Specify the port/host
f5-tts_infer-gradio --port 7860 --host 0.0.0.0
# Launch a share link
f5-tts_infer-gradio --share
```
### 2. CLI Inference
```bash
# Run with flags
# Leave --ref_text "" will have ASR model transcribe (extra GPU memory usage)
f5-tts_infer-cli \
--model "F5-TTS" \
--ref_audio "ref_audio.wav" \
--ref_text "The content, subtitle or transcription of reference audio." \
--gen_text "Some text you want TTS model generate for you."
# Run with default setting. src/f5_tts/infer/examples/basic/basic.toml
f5-tts_infer-cli
# Or with your own .toml file
f5-tts_infer-cli -c custom.toml
# Multi voice. See src/f5_tts/infer/README.md
f5-tts_infer-cli -c src/f5_tts/infer/examples/multi/story.toml
```
### 3. More instructions
- In order to have better generation results, take a moment to read [detailed guidance](src/f5_tts/infer).
- The [Issues](https://github.com/SWivid/F5-TTS/issues?q=is%3Aissue) are very useful, please try to find the solution by properly searching the keywords of problem encountered. If no answer found, then feel free to open an issue.
## Training
### 1. Gradio App
Read [training & finetuning guidance](src/f5_tts/train) for more instructions.
```bash
# Quick start with Gradio web interface
f5-tts_finetune-gradio
```
## [Evaluation](src/f5_tts/eval)
## Development
Use pre-commit to ensure code quality (will run linters and formatters automatically)
```bash
pip install pre-commit
pre-commit install
```
When making a pull request, before each commit, run:
```bash
pre-commit run --all-files
```
Note: Some model components have linting exceptions for E722 to accommodate tensor notation
## Acknowledgements
- [E2-TTS](https://arxiv.org/abs/2406.18009) brilliant work, simple and effective
- [Emilia](https://arxiv.org/abs/2407.05361), [WenetSpeech4TTS](https://arxiv.org/abs/2406.05763), [LibriTTS](https://arxiv.org/abs/1904.02882), [LJSpeech](https://keithito.com/LJ-Speech-Dataset/) valuable datasets
- [lucidrains](https://github.com/lucidrains) initial CFM structure with also [bfs18](https://github.com/bfs18) for discussion
- [SD3](https://arxiv.org/abs/2403.03206) & [Hugging Face diffusers](https://github.com/huggingface/diffusers) DiT and MMDiT code structure
- [torchdiffeq](https://github.com/rtqichen/torchdiffeq) as ODE solver, [Vocos](https://huggingface.co/charactr/vocos-mel-24khz) and [BigVGAN](https://github.com/NVIDIA/BigVGAN) as vocoder
- [FunASR](https://github.com/modelscope/FunASR), [faster-whisper](https://github.com/SYSTRAN/faster-whisper), [UniSpeech](https://github.com/microsoft/UniSpeech), [SpeechMOS](https://github.com/tarepan/SpeechMOS) for evaluation tools
- [ctc-forced-aligner](https://github.com/MahmoudAshraf97/ctc-forced-aligner) for speech edit test
- [mrfakename](https://x.com/realmrfakename) huggingface space demo ~
- [f5-tts-mlx](https://github.com/lucasnewman/f5-tts-mlx/tree/main) Implementation with MLX framework by [Lucas Newman](https://github.com/lucasnewman)
- [F5-TTS-ONNX](https://github.com/DakeQQ/F5-TTS-ONNX) ONNX Runtime version by [DakeQQ](https://github.com/DakeQQ)
## Citation
If our work and codebase is useful for you, please cite as:
```
@article{chen-etal-2024-f5tts,
title={F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching},
author={Yushen Chen and Zhikang Niu and Ziyang Ma and Keqi Deng and Chunhui Wang and Jian Zhao and Kai Yu and Xie Chen},
journal={arXiv preprint arXiv:2410.06885},
year={2024},
}
```
## License
Our code is released under MIT License. The pre-trained models are licensed under the CC-BY-NC license due to the training data Emilia, which is an in-the-wild dataset. Sorry for any inconvenience this may cause.
|