Spaces:
Configuration error
Configuration error
finish infer dependencies; update readmes
Browse files- README.md +34 -5
- src/f5_tts/eval/README.md +5 -1
- src/f5_tts/infer/README.md +74 -55
- src/f5_tts/infer/infer_cli.py +12 -2
- src/f5_tts/infer/speech_edit.py +7 -5
- src/f5_tts/train/README.md +1 -0
README.md
CHANGED
@@ -54,17 +54,46 @@ docker build -t f5tts:v1 .
|
|
54 |
|
55 |
## Inference
|
56 |
|
57 |
-
### 1.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
58 |
|
59 |
```bash
|
60 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
61 |
f5-tts_infer-cli
|
|
|
|
|
62 |
|
63 |
-
#
|
64 |
-
f5-tts_infer-
|
65 |
```
|
66 |
|
67 |
-
###
|
68 |
|
69 |
- In order to have better generation results, take a moment to read [detailed guidance](src/f5_tts/infer).
|
70 |
- The [Issues](https://github.com/SWivid/F5-TTS/issues?q=is%3Aissue) are very useful, please try to find the solution by properly searching the keywords of problem encountered. If no answer found, then feel free to open an issue.
|
|
|
54 |
|
55 |
## Inference
|
56 |
|
57 |
+
### 1. Gradio App
|
58 |
+
|
59 |
+
Currently supported features:
|
60 |
+
|
61 |
+
- Basic TTS with Chunk Inference
|
62 |
+
- Multi-Style / Multi-Speaker Generation
|
63 |
+
- Voice Chat powered by Qwen2.5-3B-Instruct
|
64 |
+
|
65 |
+
```bash
|
66 |
+
# Launch a Gradio app (web interface)
|
67 |
+
f5-tts_infer-gradio
|
68 |
+
|
69 |
+
# Specify the port/host
|
70 |
+
f5-tts_infer-gradio --port 7860 --host 0.0.0.0
|
71 |
+
|
72 |
+
# Launch a share link
|
73 |
+
f5-tts_infer-gradio --share
|
74 |
+
```
|
75 |
+
|
76 |
+
### 2. CLI Inference
|
77 |
|
78 |
```bash
|
79 |
+
# Run with flags
|
80 |
+
# Leave --ref_text "" will have ASR model transcribe (extra GPU memory usage)
|
81 |
+
f5-tts_infer-cli \
|
82 |
+
--model "F5-TTS" \
|
83 |
+
--ref_audio "ref_audio.wav" \
|
84 |
+
--ref_text "The content, subtitle or transcription of reference audio." \
|
85 |
+
--gen_text "Some text you want TTS model generate for you."
|
86 |
+
|
87 |
+
# Run with default setting. src/f5_tts/infer/examples/basic/basic.toml
|
88 |
f5-tts_infer-cli
|
89 |
+
# Or with your own .toml file
|
90 |
+
f5-tts_infer-cli -c custom.toml
|
91 |
|
92 |
+
# Multi voice. See src/f5_tts/infer/README.md
|
93 |
+
f5-tts_infer-cli -c src/f5_tts/infer/examples/multi/story.toml
|
94 |
```
|
95 |
|
96 |
+
### 3. More instructions
|
97 |
|
98 |
- In order to have better generation results, take a moment to read [detailed guidance](src/f5_tts/infer).
|
99 |
- The [Issues](https://github.com/SWivid/F5-TTS/issues?q=is%3Aissue) are very useful, please try to find the solution by properly searching the keywords of problem encountered. If no answer found, then feel free to open an issue.
|
src/f5_tts/eval/README.md
CHANGED
@@ -1,5 +1,5 @@
|
|
1 |
|
2 |
-
|
3 |
|
4 |
Install packages for evaluation:
|
5 |
|
@@ -7,6 +7,8 @@ Install packages for evaluation:
|
|
7 |
pip install -e .[eval]
|
8 |
```
|
9 |
|
|
|
|
|
10 |
### Prepare Test Datasets
|
11 |
|
12 |
1. *Seed-TTS testset*: Download from [seed-tts-eval](https://github.com/BytedanceSpeech/seed-tts-eval).
|
@@ -25,6 +27,8 @@ accelerate config # if not set before
|
|
25 |
bash src/f5_tts/eval/eval_infer_batch.sh
|
26 |
```
|
27 |
|
|
|
|
|
28 |
### Download Evaluation Model Checkpoints
|
29 |
|
30 |
1. Chinese ASR Model: [Paraformer-zh](https://huggingface.co/funasr/paraformer-zh)
|
|
|
1 |
|
2 |
+
# Evaluation
|
3 |
|
4 |
Install packages for evaluation:
|
5 |
|
|
|
7 |
pip install -e .[eval]
|
8 |
```
|
9 |
|
10 |
+
## Generating Samples for Evaluation
|
11 |
+
|
12 |
### Prepare Test Datasets
|
13 |
|
14 |
1. *Seed-TTS testset*: Download from [seed-tts-eval](https://github.com/BytedanceSpeech/seed-tts-eval).
|
|
|
27 |
bash src/f5_tts/eval/eval_infer_batch.sh
|
28 |
```
|
29 |
|
30 |
+
## Objective Evaluation on Generated Results
|
31 |
+
|
32 |
### Download Evaluation Model Checkpoints
|
33 |
|
34 |
1. Chinese ASR Model: [Paraformer-zh](https://huggingface.co/funasr/paraformer-zh)
|
src/f5_tts/infer/README.md
CHANGED
@@ -1,8 +1,8 @@
|
|
1 |
-
|
2 |
|
3 |
The pretrained model checkpoints can be reached at [🤗 Hugging Face](https://huggingface.co/SWivid/F5-TTS) and [🤖 Model Scope](https://www.modelscope.cn/models/SWivid/F5-TTS_Emilia-ZH-EN), or will be automatically downloaded when running inference scripts.
|
4 |
|
5 |
-
Currently support **30s for a single** generation, which is the **total length** including both prompt and output audio. However, you can
|
6 |
|
7 |
To avoid possible inference failures, make sure you have seen through the following instructions.
|
8 |
|
@@ -10,83 +10,102 @@ To avoid possible inference failures, make sure you have seen through the follow
|
|
10 |
- Add some spaces (blank: " ") or punctuations (e.g. "," ".") to explicitly introduce some pauses.
|
11 |
- Preprocess numbers to Chinese letters if you want to have them read in Chinese, otherwise in English.
|
12 |
|
13 |
-
# TODO 👇 ...
|
14 |
|
15 |
-
|
16 |
|
17 |
-
|
18 |
|
19 |
-
|
|
|
|
|
20 |
|
21 |
-
|
22 |
-
for change vocab.txt use `--vocab_file` to provide your vocab.txt file.
|
23 |
|
24 |
-
|
25 |
-
# switch to the main directory
|
26 |
-
cd f5_tts
|
27 |
|
28 |
-
|
29 |
-
|
30 |
-
|
31 |
-
|
32 |
-
--gen_text "I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring. Respect me and I'll nurture you; ignore me and you shall face the consequences."
|
33 |
-
|
34 |
-
python inference-cli.py \
|
35 |
-
--model "E2-TTS" \
|
36 |
-
--ref_audio "tests/ref_audio/test_zh_1_ref_short.wav" \
|
37 |
-
--ref_text "对,这就是我,万人敬仰的太乙真人。" \
|
38 |
-
--gen_text "突然,身边一阵笑声。我看着他们,意气风发地挺直了胸膛,甩了甩那稍显肉感的双臂,轻笑道,我身上的肉,是为了掩饰我爆棚的魅力,否则,岂不吓坏了你们呢?"
|
39 |
-
|
40 |
-
# Multi voice
|
41 |
-
# https://github.com/SWivid/F5-TTS/pull/146#issue-2595207852
|
42 |
-
python inference-cli.py -c samples/story.toml
|
43 |
-
```
|
44 |
|
45 |
-
|
46 |
-
|
47 |
-
- Chunk inference
|
48 |
-
- Podcast Generation
|
49 |
-
- Multiple Speech-Type Generation
|
50 |
-
- Voice Chat powered by Qwen2.5-3B-Instruct
|
51 |
|
52 |
-
|
53 |
|
54 |
-
|
55 |
|
56 |
-
|
57 |
-
python f5_tts/gradio_app.py
|
58 |
```
|
59 |
|
60 |
-
You can specify the port/host:
|
61 |
|
62 |
-
|
63 |
-
|
64 |
-
|
65 |
|
66 |
-
|
67 |
|
|
|
|
|
|
|
68 |
```bash
|
69 |
-
|
|
|
|
|
|
|
|
|
|
|
70 |
```
|
71 |
|
72 |
-
|
73 |
-
import gradio as gr
|
74 |
-
from f5_tts.gradio_app import app
|
75 |
|
76 |
-
|
77 |
-
|
78 |
-
|
79 |
-
# ... other Gradio components
|
80 |
|
81 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
82 |
|
83 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
84 |
```
|
|
|
85 |
|
86 |
-
|
87 |
|
88 |
-
To test speech editing capabilities, use the following command
|
89 |
|
90 |
```bash
|
91 |
-
python f5_tts/speech_edit.py
|
92 |
```
|
|
|
1 |
+
# Inference
|
2 |
|
3 |
The pretrained model checkpoints can be reached at [🤗 Hugging Face](https://huggingface.co/SWivid/F5-TTS) and [🤖 Model Scope](https://www.modelscope.cn/models/SWivid/F5-TTS_Emilia-ZH-EN), or will be automatically downloaded when running inference scripts.
|
4 |
|
5 |
+
Currently support **30s for a single** generation, which is the **total length** including both prompt and output audio. However, you can provide `infer_cli` and `infer_gradio` with longer text, will automatically do chunk generation. Long reference audio will be **clip short to ~15s**.
|
6 |
|
7 |
To avoid possible inference failures, make sure you have seen through the following instructions.
|
8 |
|
|
|
10 |
- Add some spaces (blank: " ") or punctuations (e.g. "," ".") to explicitly introduce some pauses.
|
11 |
- Preprocess numbers to Chinese letters if you want to have them read in Chinese, otherwise in English.
|
12 |
|
|
|
13 |
|
14 |
+
## Gradio App
|
15 |
|
16 |
+
Currently supported features:
|
17 |
|
18 |
+
- Basic TTS with Chunk Inference
|
19 |
+
- Multi-Style / Multi-Speaker Generation
|
20 |
+
- Voice Chat powered by Qwen2.5-3B-Instruct
|
21 |
|
22 |
+
The cli command `f5-tts_infer-gradio` equals to `python src/f5_tts/infer/infer_gradio.py`, which launches a Gradio APP (web interface) for inference.
|
|
|
23 |
|
24 |
+
The script will load model checkpoints from Huggingface. You can also manually download files and update the path to `load_model()` in `infer_gradio.py`. Currently only load TTS models first, will load ASR model to do transcription if `ref_text` not provided, will load LLM model if use Voice Chat.
|
|
|
|
|
25 |
|
26 |
+
Could also be used as a component for larger application.
|
27 |
+
```python
|
28 |
+
import gradio as gr
|
29 |
+
from f5_tts.infer.infer_gradio import app
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
30 |
|
31 |
+
with gr.Blocks() as main_app:
|
32 |
+
gr.Markdown("# This is an example of using F5-TTS within a bigger Gradio app")
|
|
|
|
|
|
|
|
|
33 |
|
34 |
+
# ... other Gradio components
|
35 |
|
36 |
+
app.render()
|
37 |
|
38 |
+
main_app.launch()
|
|
|
39 |
```
|
40 |
|
|
|
41 |
|
42 |
+
## CLI Inference
|
43 |
+
|
44 |
+
The cli command `f5-tts_infer-cli` equals to `python src/f5_tts/infer/infer_cli.py`, which is a command line tool for inference.
|
45 |
|
46 |
+
The script will load model checkpoints from Huggingface. You can also manually download files and use `--ckpt_file` to specify the model you want to load, or directly update in `infer_cli.py`.
|
47 |
|
48 |
+
For change vocab.txt use `--vocab_file` to provide your `vocab.txt` file.
|
49 |
+
|
50 |
+
Basically you can inference with flags:
|
51 |
```bash
|
52 |
+
# Leave --ref_text "" will have ASR model transcribe (extra GPU memory usage)
|
53 |
+
f5-tts_infer-cli \
|
54 |
+
--model "F5-TTS" \
|
55 |
+
--ref_audio "ref_audio.wav" \
|
56 |
+
--ref_text "The content, subtitle or transcription of reference audio." \
|
57 |
+
--gen_text "Some text you want TTS model generate for you."
|
58 |
```
|
59 |
|
60 |
+
And a `.toml` file would help with more flexible usage.
|
|
|
|
|
61 |
|
62 |
+
```bash
|
63 |
+
f5-tts_infer-cli -c custom.toml
|
64 |
+
```
|
|
|
65 |
|
66 |
+
For example, you can use `.toml` to pass in variables, refer to `src/f5_tts/infer/examples/basic/basic.toml`:
|
67 |
+
|
68 |
+
```toml
|
69 |
+
# F5-TTS | E2-TTS
|
70 |
+
model = "F5-TTS"
|
71 |
+
ref_audio = "infer/examples/basic/basic_ref_en.wav"
|
72 |
+
# If an empty "", transcribes the reference audio automatically.
|
73 |
+
ref_text = "Some call me nature, others call me mother nature."
|
74 |
+
gen_text = "I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring."
|
75 |
+
# File with text to generate. Ignores the text above.
|
76 |
+
gen_file = ""
|
77 |
+
remove_silence = false
|
78 |
+
output_dir = "tests"
|
79 |
+
```
|
80 |
|
81 |
+
You can also leverage `.toml` file to do multi-style generation, refer to `src/f5_tts/infer/examples/multi/story.toml`.
|
82 |
+
|
83 |
+
```toml
|
84 |
+
# F5-TTS | E2-TTS
|
85 |
+
model = "F5-TTS"
|
86 |
+
ref_audio = "infer/examples/multi/main.flac"
|
87 |
+
# If an empty "", transcribes the reference audio automatically.
|
88 |
+
ref_text = ""
|
89 |
+
gen_text = ""
|
90 |
+
# File with text to generate. Ignores the text above.
|
91 |
+
gen_file = "infer/examples/multi/story.txt"
|
92 |
+
remove_silence = true
|
93 |
+
output_dir = "tests"
|
94 |
+
|
95 |
+
[voices.town]
|
96 |
+
ref_audio = "infer/examples/multi/town.flac"
|
97 |
+
ref_text = ""
|
98 |
+
|
99 |
+
[voices.country]
|
100 |
+
ref_audio = "infer/examples/multi/country.flac"
|
101 |
+
ref_text = ""
|
102 |
```
|
103 |
+
You should mark the voice with `[main]` `[town]` `[country]` whenever you want to change voice, refer to `src/f5_tts/infer/examples/multi/story.txt`.
|
104 |
|
105 |
+
## Speech Editing
|
106 |
|
107 |
+
To test speech editing capabilities, use the following command:
|
108 |
|
109 |
```bash
|
110 |
+
python src/f5_tts/infer/speech_edit.py
|
111 |
```
|
src/f5_tts/infer/infer_cli.py
CHANGED
@@ -80,11 +80,21 @@ args = parser.parse_args()
|
|
80 |
config = tomli.load(open(args.config, "rb"))
|
81 |
|
82 |
ref_audio = args.ref_audio if args.ref_audio else config["ref_audio"]
|
83 |
-
if "infer/examples/" in ref_audio: # for pip pkg user
|
84 |
-
ref_audio = str(files("f5_tts").joinpath(f"{ref_audio}"))
|
85 |
ref_text = args.ref_text if args.ref_text != "666" else config["ref_text"]
|
86 |
gen_text = args.gen_text if args.gen_text else config["gen_text"]
|
87 |
gen_file = args.gen_file if args.gen_file else config["gen_file"]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
88 |
if gen_file:
|
89 |
gen_text = codecs.open(gen_file, "r", "utf-8").read()
|
90 |
output_dir = args.output_dir if args.output_dir else config["output_dir"]
|
|
|
80 |
config = tomli.load(open(args.config, "rb"))
|
81 |
|
82 |
ref_audio = args.ref_audio if args.ref_audio else config["ref_audio"]
|
|
|
|
|
83 |
ref_text = args.ref_text if args.ref_text != "666" else config["ref_text"]
|
84 |
gen_text = args.gen_text if args.gen_text else config["gen_text"]
|
85 |
gen_file = args.gen_file if args.gen_file else config["gen_file"]
|
86 |
+
|
87 |
+
# patches for pip pkg user
|
88 |
+
if "infer/examples/" in ref_audio:
|
89 |
+
ref_audio = str(files("f5_tts").joinpath(f"{ref_audio}"))
|
90 |
+
if "infer/examples/" in gen_file:
|
91 |
+
gen_file = str(files("f5_tts").joinpath(f"{gen_file}"))
|
92 |
+
if "voices" in config:
|
93 |
+
for voice in config["voices"]:
|
94 |
+
voice_ref_audio = config["voices"][voice]["ref_audio"]
|
95 |
+
if "infer/examples/" in voice_ref_audio:
|
96 |
+
config["voices"][voice]["ref_audio"] = str(files("f5_tts").joinpath(f"{voice_ref_audio}"))
|
97 |
+
|
98 |
if gen_file:
|
99 |
gen_text = codecs.open(gen_file, "r", "utf-8").read()
|
100 |
output_dir = args.output_dir if args.output_dir else config["output_dir"]
|
src/f5_tts/infer/speech_edit.py
CHANGED
@@ -7,11 +7,13 @@ from vocos import Vocos
|
|
7 |
|
8 |
from f5_tts.model import CFM, UNetT, DiT
|
9 |
from f5_tts.model.utils import (
|
10 |
-
load_checkpoint,
|
11 |
get_tokenizer,
|
12 |
convert_char_to_pinyin,
|
13 |
)
|
14 |
-
from f5_tts.infer.utils_infer import
|
|
|
|
|
|
|
15 |
|
16 |
device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
|
17 |
|
@@ -54,12 +56,12 @@ output_dir = "tests"
|
|
54 |
# [leverage https://github.com/MahmoudAshraf97/ctc-forced-aligner to get char level alignment]
|
55 |
# pip install git+https://github.com/MahmoudAshraf97/ctc-forced-aligner.git
|
56 |
# [write the origin_text into a file, e.g. tests/test_edit.txt]
|
57 |
-
# ctc-forced-aligner --audio_path "
|
58 |
# [result will be saved at same path of audio file]
|
59 |
# [--language "zho" for Chinese, "eng" for English]
|
60 |
# [if local ckpt, set --alignment_model "../checkpoints/mms-300m-1130-forced-aligner"]
|
61 |
|
62 |
-
audio_to_edit = "
|
63 |
origin_text = "Some call me nature, others call me mother nature."
|
64 |
target_text = "Some call me optimist, others call me realist."
|
65 |
parts_to_edit = [
|
@@ -71,7 +73,7 @@ fix_duration = [
|
|
71 |
1,
|
72 |
] # fix duration for "optimist" & "realist", in seconds
|
73 |
|
74 |
-
# audio_to_edit = "
|
75 |
# origin_text = "对,这就是我,万人敬仰的太乙真人。"
|
76 |
# target_text = "对,那就是你,万人敬仰的太白金星。"
|
77 |
# parts_to_edit = [[0.84, 1.4], [1.92, 2.4], [4.26, 6.26], ]
|
|
|
7 |
|
8 |
from f5_tts.model import CFM, UNetT, DiT
|
9 |
from f5_tts.model.utils import (
|
|
|
10 |
get_tokenizer,
|
11 |
convert_char_to_pinyin,
|
12 |
)
|
13 |
+
from f5_tts.infer.utils_infer import (
|
14 |
+
load_checkpoint,
|
15 |
+
save_spectrogram,
|
16 |
+
)
|
17 |
|
18 |
device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
|
19 |
|
|
|
56 |
# [leverage https://github.com/MahmoudAshraf97/ctc-forced-aligner to get char level alignment]
|
57 |
# pip install git+https://github.com/MahmoudAshraf97/ctc-forced-aligner.git
|
58 |
# [write the origin_text into a file, e.g. tests/test_edit.txt]
|
59 |
+
# ctc-forced-aligner --audio_path "src/f5_tts/infer/examples/basic/basic_ref_en.wav" --text_path "tests/test_edit.txt" --language "zho" --romanize --split_size "char"
|
60 |
# [result will be saved at same path of audio file]
|
61 |
# [--language "zho" for Chinese, "eng" for English]
|
62 |
# [if local ckpt, set --alignment_model "../checkpoints/mms-300m-1130-forced-aligner"]
|
63 |
|
64 |
+
audio_to_edit = "src/f5_tts/infer/examples/basic/basic_ref_en.wav"
|
65 |
origin_text = "Some call me nature, others call me mother nature."
|
66 |
target_text = "Some call me optimist, others call me realist."
|
67 |
parts_to_edit = [
|
|
|
73 |
1,
|
74 |
] # fix duration for "optimist" & "realist", in seconds
|
75 |
|
76 |
+
# audio_to_edit = "src/f5_tts/infer/examples/basic/basic_ref_zh.wav"
|
77 |
# origin_text = "对,这就是我,万人敬仰的太乙真人。"
|
78 |
# target_text = "对,那就是你,万人敬仰的太白金星。"
|
79 |
# parts_to_edit = [[0.84, 1.4], [1.92, 2.4], [4.26, 6.26], ]
|
src/f5_tts/train/README.md
CHANGED
@@ -1,3 +1,4 @@
|
|
|
|
1 |
|
2 |
## Prepare Dataset
|
3 |
|
|
|
1 |
+
# Training
|
2 |
|
3 |
## Prepare Dataset
|
4 |
|