initial updates for infer stuffs
Browse files- README.md +21 -82
- pyproject.toml +2 -1
- src/f5_tts/api.py +2 -2
- src/f5_tts/infer/README.md +92 -0
- src/f5_tts/infer/infer_cli.py +12 -6
- src/f5_tts/infer/utils_infer.py +3 -4
README.md
CHANGED
@@ -16,6 +16,9 @@
|
|
16 |
|
17 |
### Thanks to all the contributors !
|
18 |
|
|
|
|
|
|
|
19 |
## Installation
|
20 |
|
21 |
```bash
|
@@ -48,112 +51,48 @@ pip install -e .
|
|
48 |
docker build -t f5tts:v1 .
|
49 |
```
|
50 |
|
51 |
-
## Development
|
52 |
-
|
53 |
-
Use pre-commit to ensure code quality (will run linters and formatters automatically)
|
54 |
-
|
55 |
-
```bash
|
56 |
-
pip install pre-commit
|
57 |
-
pre-commit install
|
58 |
-
```
|
59 |
-
|
60 |
-
When making a pull request, before each commit, run:
|
61 |
-
|
62 |
-
```bash
|
63 |
-
pre-commit run --all-files
|
64 |
-
```
|
65 |
-
|
66 |
-
Note: Some model components have linting exceptions for E722 to accommodate tensor notation
|
67 |
|
68 |
## Inference
|
69 |
|
70 |
-
|
71 |
-
import gradio as gr
|
72 |
-
from f5_tts.gradio_app import app
|
73 |
-
|
74 |
-
with gr.Blocks() as main_app:
|
75 |
-
gr.Markdown("# This is an example of using F5-TTS within a bigger Gradio app")
|
76 |
-
|
77 |
-
# ... other Gradio components
|
78 |
|
79 |
-
|
80 |
-
|
81 |
-
|
82 |
|
|
|
|
|
83 |
```
|
84 |
|
85 |
-
|
86 |
|
87 |
-
|
88 |
-
-
|
89 |
-
- A longer prompt audio allows shorter generated output. The part longer than 30s cannot be generated properly. Consider using a prompt audio <15s.
|
90 |
-
- Uppercased letters will be uttered letter by letter, so use lowercased letters for normal words.
|
91 |
-
- Add some spaces (blank: " ") or punctuations (e.g. "," ".") to explicitly introduce some pauses. If first few words skipped in code-switched generation (cuz different speed with different languages), this might help.
|
92 |
|
93 |
-
### CLI Inference
|
94 |
|
95 |
-
|
96 |
-
|
97 |
-
for change model use `--ckpt_file` to specify the model you want to load,
|
98 |
-
for change vocab.txt use `--vocab_file` to provide your vocab.txt file.
|
99 |
-
|
100 |
-
```bash
|
101 |
-
# switch to the main directory
|
102 |
-
cd f5_tts
|
103 |
-
|
104 |
-
python inference-cli.py \
|
105 |
-
--model "F5-TTS" \
|
106 |
-
--ref_audio "tests/ref_audio/test_en_1_ref_short.wav" \
|
107 |
-
--ref_text "Some call me nature, others call me mother nature." \
|
108 |
-
--gen_text "I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring. Respect me and I'll nurture you; ignore me and you shall face the consequences."
|
109 |
-
|
110 |
-
python inference-cli.py \
|
111 |
-
--model "E2-TTS" \
|
112 |
-
--ref_audio "tests/ref_audio/test_zh_1_ref_short.wav" \
|
113 |
-
--ref_text "对,这就是我,万人敬仰的太乙真人。" \
|
114 |
-
--gen_text "突然,身边一阵笑声。我看着他们,意气风发地挺直了胸膛,甩了甩那稍显肉感的双臂,轻笑道,我身上的肉,是为了掩饰我爆棚的魅力,否则,岂不吓坏了你们呢?"
|
115 |
-
|
116 |
-
# Multi voice
|
117 |
-
# https://github.com/SWivid/F5-TTS/pull/146#issue-2595207852
|
118 |
-
python inference-cli.py -c samples/story.toml
|
119 |
-
```
|
120 |
-
|
121 |
-
### Gradio App
|
122 |
-
Currently supported features:
|
123 |
-
- Chunk inference
|
124 |
-
- Podcast Generation
|
125 |
-
- Multiple Speech-Type Generation
|
126 |
-
- Voice Chat powered by Qwen2.5-3B-Instruct
|
127 |
|
128 |
-
You can launch a Gradio app (web interface) to launch a GUI for inference (will load ckpt from Huggingface, you may also use local file in `gradio_app.py`). Currently load ASR model, F5-TTS and E2 TTS all in once, thus use more GPU memory than `inference-cli`.
|
129 |
|
130 |
-
|
131 |
-
python f5_tts/gradio_app.py
|
132 |
-
```
|
133 |
|
134 |
-
You can specify the port/host:
|
135 |
|
136 |
-
|
137 |
-
python f5_tts/gradio_app.py --port 7860 --host 0.0.0.0
|
138 |
-
```
|
139 |
|
140 |
-
|
141 |
|
142 |
```bash
|
143 |
-
|
|
|
144 |
```
|
145 |
|
146 |
-
|
147 |
-
|
148 |
-
To test speech editing capabilities, use the following command.
|
149 |
|
150 |
```bash
|
151 |
-
|
152 |
```
|
153 |
|
154 |
-
|
155 |
|
156 |
-
## [Evaluation](src/f5_tts/eval/README.md)
|
157 |
|
158 |
## Acknowledgements
|
159 |
|
|
|
16 |
|
17 |
### Thanks to all the contributors !
|
18 |
|
19 |
+
## News
|
20 |
+
- **2024/10/08**: F5-TTS & E2 TTS base models on [🤗 Hugging Face](https://huggingface.co/SWivid/F5-TTS), [🤖 Model Scope](https://www.modelscope.cn/models/SWivid/F5-TTS_Emilia-ZH-EN).
|
21 |
+
|
22 |
## Installation
|
23 |
|
24 |
```bash
|
|
|
51 |
docker build -t f5tts:v1 .
|
52 |
```
|
53 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
54 |
|
55 |
## Inference
|
56 |
|
57 |
+
### 1. Basic usage
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
58 |
|
59 |
+
```bash
|
60 |
+
# cli inference
|
61 |
+
f5-tts_infer-cli
|
62 |
|
63 |
+
# gradio interface
|
64 |
+
f5-tts_infer-gradio
|
65 |
```
|
66 |
|
67 |
+
### 2. More instructions
|
68 |
|
69 |
+
- In order to have better generation results, take a moment to read [detailed guidance](src/f5_tts/infer/README.md).
|
70 |
+
- The [Issues](https://github.com/SWivid/F5-TTS/issues?q=is%3Aissue) are very useful, please try to find the solution by properly searching the keywords of problem encountered. If no answer found, then feel free to open an issue.
|
|
|
|
|
|
|
71 |
|
|
|
72 |
|
73 |
+
## [Training](src/f5_tts/train/README.md)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
74 |
|
|
|
75 |
|
76 |
+
## [Evaluation](src/f5_tts/eval/README.md)
|
|
|
|
|
77 |
|
|
|
78 |
|
79 |
+
## Development
|
|
|
|
|
80 |
|
81 |
+
Use pre-commit to ensure code quality (will run linters and formatters automatically)
|
82 |
|
83 |
```bash
|
84 |
+
pip install pre-commit
|
85 |
+
pre-commit install
|
86 |
```
|
87 |
|
88 |
+
When making a pull request, before each commit, run:
|
|
|
|
|
89 |
|
90 |
```bash
|
91 |
+
pre-commit run --all-files
|
92 |
```
|
93 |
|
94 |
+
Note: Some model components have linting exceptions for E722 to accommodate tensor notation
|
95 |
|
|
|
96 |
|
97 |
## Acknowledgements
|
98 |
|
pyproject.toml
CHANGED
@@ -55,4 +55,5 @@ eval = [
|
|
55 |
Homepage = "https://github.com/SWivid/F5-TTS"
|
56 |
|
57 |
[project.scripts]
|
58 |
-
"
|
|
|
|
55 |
Homepage = "https://github.com/SWivid/F5-TTS"
|
56 |
|
57 |
[project.scripts]
|
58 |
+
"f5-tts_infer-cli" = "f5_tts.infer.infer_cli:main"
|
59 |
+
"f5-tts_infer-gradio" = "f5_tts.infer.infer_gradio:main"
|
src/f5_tts/api.py
CHANGED
@@ -130,8 +130,8 @@ if __name__ == "__main__":
|
|
130 |
ref_file=str(files("f5_tts").joinpath("infer/examples/basic/basic_ref_en.wav")),
|
131 |
ref_text="some call me nature, others call me mother nature.",
|
132 |
gen_text="""I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring. Respect me and I'll nurture you; ignore me and you shall face the consequences.""",
|
133 |
-
file_wave=str(files("f5_tts").joinpath("../../
|
134 |
-
file_spect=str(files("f5_tts").joinpath("../../
|
135 |
seed=-1, # random seed = -1
|
136 |
)
|
137 |
|
|
|
130 |
ref_file=str(files("f5_tts").joinpath("infer/examples/basic/basic_ref_en.wav")),
|
131 |
ref_text="some call me nature, others call me mother nature.",
|
132 |
gen_text="""I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring. Respect me and I'll nurture you; ignore me and you shall face the consequences.""",
|
133 |
+
file_wave=str(files("f5_tts").joinpath("../../tests/api_out.wav")),
|
134 |
+
file_spect=str(files("f5_tts").joinpath("../../tests/api_out.png")),
|
135 |
seed=-1, # random seed = -1
|
136 |
)
|
137 |
|
src/f5_tts/infer/README.md
ADDED
@@ -0,0 +1,92 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
## Inference
|
2 |
+
|
3 |
+
The pretrained model checkpoints can be reached at [🤗 Hugging Face](https://huggingface.co/SWivid/F5-TTS) and [🤖 Model Scope](https://www.modelscope.cn/models/SWivid/F5-TTS_Emilia-ZH-EN), or will be automatically downloaded when running inference scripts.
|
4 |
+
|
5 |
+
Currently support **30s for a single** generation, which is the **total length** including both prompt and output audio. However, you can leverage `infer_cli` and `infer_gradio` for longer text, will automatically do chunk generation. Long reference audio will be clip short to ~15s.
|
6 |
+
|
7 |
+
To avoid possible inference failures, make sure you have seen through the following instructions.
|
8 |
+
|
9 |
+
- Uppercased letters will be uttered letter by letter, so use lowercased letters for normal words.
|
10 |
+
- Add some spaces (blank: " ") or punctuations (e.g. "," ".") to explicitly introduce some pauses.
|
11 |
+
- Preprocess numbers to Chinese letters if you want to have them read in Chinese, otherwise in English.
|
12 |
+
|
13 |
+
# TODO 👇 ...
|
14 |
+
|
15 |
+
### CLI Inference
|
16 |
+
|
17 |
+
It is possible to use cli `f5-tts_infer-cli` for following commands.
|
18 |
+
|
19 |
+
Either you can specify everything in `inference-cli.toml` or override with flags. Leave `--ref_text ""` will have ASR model transcribe the reference audio automatically (use extra GPU memory). If encounter network error, consider use local ckpt, just set `ckpt_file` in `inference-cli.py`
|
20 |
+
|
21 |
+
for change model use `--ckpt_file` to specify the model you want to load,
|
22 |
+
for change vocab.txt use `--vocab_file` to provide your vocab.txt file.
|
23 |
+
|
24 |
+
```bash
|
25 |
+
# switch to the main directory
|
26 |
+
cd f5_tts
|
27 |
+
|
28 |
+
python inference-cli.py \
|
29 |
+
--model "F5-TTS" \
|
30 |
+
--ref_audio "tests/ref_audio/test_en_1_ref_short.wav" \
|
31 |
+
--ref_text "Some call me nature, others call me mother nature." \
|
32 |
+
--gen_text "I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring. Respect me and I'll nurture you; ignore me and you shall face the consequences."
|
33 |
+
|
34 |
+
python inference-cli.py \
|
35 |
+
--model "E2-TTS" \
|
36 |
+
--ref_audio "tests/ref_audio/test_zh_1_ref_short.wav" \
|
37 |
+
--ref_text "对,这就是我,万人敬仰的太乙真人。" \
|
38 |
+
--gen_text "突然,身边一阵笑声。我看着他们,意气风发地挺直了胸膛,甩了甩那稍显肉感的双臂,轻笑道,我身上的肉,是为了掩饰我爆棚的魅力,否则,岂不吓坏了你们呢?"
|
39 |
+
|
40 |
+
# Multi voice
|
41 |
+
# https://github.com/SWivid/F5-TTS/pull/146#issue-2595207852
|
42 |
+
python inference-cli.py -c samples/story.toml
|
43 |
+
```
|
44 |
+
|
45 |
+
### Gradio App
|
46 |
+
Currently supported features:
|
47 |
+
- Chunk inference
|
48 |
+
- Podcast Generation
|
49 |
+
- Multiple Speech-Type Generation
|
50 |
+
- Voice Chat powered by Qwen2.5-3B-Instruct
|
51 |
+
|
52 |
+
It is possible to use cli `f5-tts_infer-gradio` for following commands.
|
53 |
+
|
54 |
+
You can launch a Gradio app (web interface) to launch a GUI for inference (will load ckpt from Huggingface, you may also use local file in `gradio_app.py`). Currently load ASR model, F5-TTS and E2 TTS all in once, thus use more GPU memory than `inference-cli`.
|
55 |
+
|
56 |
+
```bash
|
57 |
+
python f5_tts/gradio_app.py
|
58 |
+
```
|
59 |
+
|
60 |
+
You can specify the port/host:
|
61 |
+
|
62 |
+
```bash
|
63 |
+
python f5_tts/gradio_app.py --port 7860 --host 0.0.0.0
|
64 |
+
```
|
65 |
+
|
66 |
+
Or launch a share link:
|
67 |
+
|
68 |
+
```bash
|
69 |
+
python f5_tts/gradio_app.py --share
|
70 |
+
```
|
71 |
+
|
72 |
+
```python
|
73 |
+
import gradio as gr
|
74 |
+
from f5_tts.gradio_app import app
|
75 |
+
|
76 |
+
with gr.Blocks() as main_app:
|
77 |
+
gr.Markdown("# This is an example of using F5-TTS within a bigger Gradio app")
|
78 |
+
|
79 |
+
# ... other Gradio components
|
80 |
+
|
81 |
+
app.render()
|
82 |
+
|
83 |
+
main_app.launch()
|
84 |
+
```
|
85 |
+
|
86 |
+
### Speech Editing
|
87 |
+
|
88 |
+
To test speech editing capabilities, use the following command.
|
89 |
+
|
90 |
+
```bash
|
91 |
+
python f5_tts/speech_edit.py
|
92 |
+
```
|
src/f5_tts/infer/infer_cli.py
CHANGED
@@ -21,15 +21,15 @@ from f5_tts.infer.utils_infer import (
|
|
21 |
|
22 |
|
23 |
parser = argparse.ArgumentParser(
|
24 |
-
prog="python3
|
25 |
description="Commandline interface for E2/F5 TTS with Advanced Batch Processing.",
|
26 |
-
epilog="Specify
|
27 |
)
|
28 |
parser.add_argument(
|
29 |
"-c",
|
30 |
"--config",
|
31 |
-
help="Configuration file. Default=
|
32 |
-
default=os.path.join(files("f5_tts").joinpath("
|
33 |
)
|
34 |
parser.add_argument(
|
35 |
"-m",
|
@@ -80,6 +80,8 @@ args = parser.parse_args()
|
|
80 |
config = tomli.load(open(args.config, "rb"))
|
81 |
|
82 |
ref_audio = args.ref_audio if args.ref_audio else config["ref_audio"]
|
|
|
|
|
83 |
ref_text = args.ref_text if args.ref_text != "666" else config["ref_text"]
|
84 |
gen_text = args.gen_text if args.gen_text else config["gen_text"]
|
85 |
gen_file = args.gen_file if args.gen_file else config["gen_file"]
|
@@ -90,8 +92,8 @@ model = args.model if args.model else config["model"]
|
|
90 |
ckpt_file = args.ckpt_file if args.ckpt_file else ""
|
91 |
vocab_file = args.vocab_file if args.vocab_file else ""
|
92 |
remove_silence = args.remove_silence if args.remove_silence else config["remove_silence"]
|
93 |
-
wave_path = Path(output_dir) / "
|
94 |
-
spectrogram_path = Path(output_dir) / "
|
95 |
vocos_local_path = "../checkpoints/charactr/vocos-mel-24khz"
|
96 |
|
97 |
vocos = load_vocoder(is_local=args.load_vocoder_from_local, local_path=vocos_local_path)
|
@@ -161,6 +163,10 @@ def main_process(ref_audio, ref_text, text_gen, model_obj, remove_silence):
|
|
161 |
|
162 |
if generated_audio_segments:
|
163 |
final_wave = np.concatenate(generated_audio_segments)
|
|
|
|
|
|
|
|
|
164 |
with open(wave_path, "wb") as f:
|
165 |
sf.write(f.name, final_wave, final_sample_rate)
|
166 |
# Remove silence
|
|
|
21 |
|
22 |
|
23 |
parser = argparse.ArgumentParser(
|
24 |
+
prog="python3 infer-cli.py",
|
25 |
description="Commandline interface for E2/F5 TTS with Advanced Batch Processing.",
|
26 |
+
epilog="Specify options above to override one or more settings from config.",
|
27 |
)
|
28 |
parser.add_argument(
|
29 |
"-c",
|
30 |
"--config",
|
31 |
+
help="Configuration file. Default=infer/examples/basic/basic.toml",
|
32 |
+
default=os.path.join(files("f5_tts").joinpath("infer/examples/basic"), "basic.toml"),
|
33 |
)
|
34 |
parser.add_argument(
|
35 |
"-m",
|
|
|
80 |
config = tomli.load(open(args.config, "rb"))
|
81 |
|
82 |
ref_audio = args.ref_audio if args.ref_audio else config["ref_audio"]
|
83 |
+
if "src/f5_tts/infer/examples/basic" in ref_audio: # for pip pkg user
|
84 |
+
ref_audio = str(files("f5_tts").joinpath(f"../../{ref_audio}"))
|
85 |
ref_text = args.ref_text if args.ref_text != "666" else config["ref_text"]
|
86 |
gen_text = args.gen_text if args.gen_text else config["gen_text"]
|
87 |
gen_file = args.gen_file if args.gen_file else config["gen_file"]
|
|
|
92 |
ckpt_file = args.ckpt_file if args.ckpt_file else ""
|
93 |
vocab_file = args.vocab_file if args.vocab_file else ""
|
94 |
remove_silence = args.remove_silence if args.remove_silence else config["remove_silence"]
|
95 |
+
wave_path = Path(output_dir) / "infer_cli_out.wav"
|
96 |
+
# spectrogram_path = Path(output_dir) / "infer_cli_out.png"
|
97 |
vocos_local_path = "../checkpoints/charactr/vocos-mel-24khz"
|
98 |
|
99 |
vocos = load_vocoder(is_local=args.load_vocoder_from_local, local_path=vocos_local_path)
|
|
|
163 |
|
164 |
if generated_audio_segments:
|
165 |
final_wave = np.concatenate(generated_audio_segments)
|
166 |
+
|
167 |
+
if not os.path.exists(output_dir):
|
168 |
+
os.makedirs(output_dir)
|
169 |
+
|
170 |
with open(wave_path, "wb") as f:
|
171 |
sf.write(f.name, final_wave, final_sample_rate)
|
172 |
# Remove silence
|
src/f5_tts/infer/utils_infer.py
CHANGED
@@ -186,13 +186,12 @@ def preprocess_ref_audio_text(ref_audio_orig, ref_text, show_info=print, device=
|
|
186 |
non_silent_segs = silence.split_on_silence(aseg, min_silence_len=1000, silence_thresh=-50, keep_silence=1000)
|
187 |
non_silent_wave = AudioSegment.silent(duration=0)
|
188 |
for non_silent_seg in non_silent_segs:
|
|
|
|
|
|
|
189 |
non_silent_wave += non_silent_seg
|
190 |
aseg = non_silent_wave
|
191 |
|
192 |
-
audio_duration = len(aseg)
|
193 |
-
if audio_duration > 15000:
|
194 |
-
show_info("Audio is over 15s, clipping to only first 15s.")
|
195 |
-
aseg = aseg[:15000]
|
196 |
aseg.export(f.name, format="wav")
|
197 |
ref_audio = f.name
|
198 |
|
|
|
186 |
non_silent_segs = silence.split_on_silence(aseg, min_silence_len=1000, silence_thresh=-50, keep_silence=1000)
|
187 |
non_silent_wave = AudioSegment.silent(duration=0)
|
188 |
for non_silent_seg in non_silent_segs:
|
189 |
+
if len(non_silent_wave) > 10000 and len(non_silent_wave + non_silent_seg) > 18000:
|
190 |
+
show_info("Audio is over 18s, clipping short.")
|
191 |
+
break
|
192 |
non_silent_wave += non_silent_seg
|
193 |
aseg = non_silent_wave
|
194 |
|
|
|
|
|
|
|
|
|
195 |
aseg.export(f.name, format="wav")
|
196 |
ref_audio = f.name
|
197 |
|