SWivid commited on
Commit
8640433
·
1 Parent(s): d280126

initial updates for infer stuffs

Browse files
README.md CHANGED
@@ -16,6 +16,9 @@
16
 
17
  ### Thanks to all the contributors !
18
 
 
 
 
19
  ## Installation
20
 
21
  ```bash
@@ -48,112 +51,48 @@ pip install -e .
48
  docker build -t f5tts:v1 .
49
  ```
50
 
51
- ## Development
52
-
53
- Use pre-commit to ensure code quality (will run linters and formatters automatically)
54
-
55
- ```bash
56
- pip install pre-commit
57
- pre-commit install
58
- ```
59
-
60
- When making a pull request, before each commit, run:
61
-
62
- ```bash
63
- pre-commit run --all-files
64
- ```
65
-
66
- Note: Some model components have linting exceptions for E722 to accommodate tensor notation
67
 
68
  ## Inference
69
 
70
- ```python
71
- import gradio as gr
72
- from f5_tts.gradio_app import app
73
-
74
- with gr.Blocks() as main_app:
75
- gr.Markdown("# This is an example of using F5-TTS within a bigger Gradio app")
76
-
77
- # ... other Gradio components
78
 
79
- app.render()
80
-
81
- main_app.launch()
82
 
 
 
83
  ```
84
 
85
- The pretrained model checkpoints can be reached at [🤗 Hugging Face](https://huggingface.co/SWivid/F5-TTS) and [🤖 Model Scope](https://www.modelscope.cn/models/SWivid/F5-TTS_Emilia-ZH-EN), or automatically downloaded with `inference-cli` and `gradio_app`.
86
 
87
- Currently support 30s for a single generation, which is the **TOTAL** length of prompt audio and the generated. Batch inference with chunks is supported by `inference-cli` and `gradio_app`.
88
- - To avoid possible inference failures, make sure you have seen through the following instructions.
89
- - A longer prompt audio allows shorter generated output. The part longer than 30s cannot be generated properly. Consider using a prompt audio <15s.
90
- - Uppercased letters will be uttered letter by letter, so use lowercased letters for normal words.
91
- - Add some spaces (blank: " ") or punctuations (e.g. "," ".") to explicitly introduce some pauses. If first few words skipped in code-switched generation (cuz different speed with different languages), this might help.
92
 
93
- ### CLI Inference
94
 
95
- Either you can specify everything in `inference-cli.toml` or override with flags. Leave `--ref_text ""` will have ASR model transcribe the reference audio automatically (use extra GPU memory). If encounter network error, consider use local ckpt, just set `ckpt_file` in `inference-cli.py`
96
-
97
- for change model use `--ckpt_file` to specify the model you want to load,
98
- for change vocab.txt use `--vocab_file` to provide your vocab.txt file.
99
-
100
- ```bash
101
- # switch to the main directory
102
- cd f5_tts
103
-
104
- python inference-cli.py \
105
- --model "F5-TTS" \
106
- --ref_audio "tests/ref_audio/test_en_1_ref_short.wav" \
107
- --ref_text "Some call me nature, others call me mother nature." \
108
- --gen_text "I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring. Respect me and I'll nurture you; ignore me and you shall face the consequences."
109
-
110
- python inference-cli.py \
111
- --model "E2-TTS" \
112
- --ref_audio "tests/ref_audio/test_zh_1_ref_short.wav" \
113
- --ref_text "对,这就是我,万人敬仰的太乙真人。" \
114
- --gen_text "突然,身边一阵笑声。我看着他们,意气风发地挺直了胸膛,甩了甩那稍显肉感的双臂,轻笑道,我身上的肉,是为了掩饰我爆棚的魅力,否则,岂不吓坏了你们呢?"
115
-
116
- # Multi voice
117
- # https://github.com/SWivid/F5-TTS/pull/146#issue-2595207852
118
- python inference-cli.py -c samples/story.toml
119
- ```
120
-
121
- ### Gradio App
122
- Currently supported features:
123
- - Chunk inference
124
- - Podcast Generation
125
- - Multiple Speech-Type Generation
126
- - Voice Chat powered by Qwen2.5-3B-Instruct
127
 
128
- You can launch a Gradio app (web interface) to launch a GUI for inference (will load ckpt from Huggingface, you may also use local file in `gradio_app.py`). Currently load ASR model, F5-TTS and E2 TTS all in once, thus use more GPU memory than `inference-cli`.
129
 
130
- ```bash
131
- python f5_tts/gradio_app.py
132
- ```
133
 
134
- You can specify the port/host:
135
 
136
- ```bash
137
- python f5_tts/gradio_app.py --port 7860 --host 0.0.0.0
138
- ```
139
 
140
- Or launch a share link:
141
 
142
  ```bash
143
- python f5_tts/gradio_app.py --share
 
144
  ```
145
 
146
- ### Speech Editing
147
-
148
- To test speech editing capabilities, use the following command.
149
 
150
  ```bash
151
- python f5_tts/speech_edit.py
152
  ```
153
 
154
- ## [Training](src/f5_tts/train/README.md)
155
 
156
- ## [Evaluation](src/f5_tts/eval/README.md)
157
 
158
  ## Acknowledgements
159
 
 
16
 
17
  ### Thanks to all the contributors !
18
 
19
+ ## News
20
+ - **2024/10/08**: F5-TTS & E2 TTS base models on [🤗 Hugging Face](https://huggingface.co/SWivid/F5-TTS), [🤖 Model Scope](https://www.modelscope.cn/models/SWivid/F5-TTS_Emilia-ZH-EN).
21
+
22
  ## Installation
23
 
24
  ```bash
 
51
  docker build -t f5tts:v1 .
52
  ```
53
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54
 
55
  ## Inference
56
 
57
+ ### 1. Basic usage
 
 
 
 
 
 
 
58
 
59
+ ```bash
60
+ # cli inference
61
+ f5-tts_infer-cli
62
 
63
+ # gradio interface
64
+ f5-tts_infer-gradio
65
  ```
66
 
67
+ ### 2. More instructions
68
 
69
+ - In order to have better generation results, take a moment to read [detailed guidance](src/f5_tts/infer/README.md).
70
+ - The [Issues](https://github.com/SWivid/F5-TTS/issues?q=is%3Aissue) are very useful, please try to find the solution by properly searching the keywords of problem encountered. If no answer found, then feel free to open an issue.
 
 
 
71
 
 
72
 
73
+ ## [Training](src/f5_tts/train/README.md)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
74
 
 
75
 
76
+ ## [Evaluation](src/f5_tts/eval/README.md)
 
 
77
 
 
78
 
79
+ ## Development
 
 
80
 
81
+ Use pre-commit to ensure code quality (will run linters and formatters automatically)
82
 
83
  ```bash
84
+ pip install pre-commit
85
+ pre-commit install
86
  ```
87
 
88
+ When making a pull request, before each commit, run:
 
 
89
 
90
  ```bash
91
+ pre-commit run --all-files
92
  ```
93
 
94
+ Note: Some model components have linting exceptions for E722 to accommodate tensor notation
95
 
 
96
 
97
  ## Acknowledgements
98
 
pyproject.toml CHANGED
@@ -55,4 +55,5 @@ eval = [
55
  Homepage = "https://github.com/SWivid/F5-TTS"
56
 
57
  [project.scripts]
58
- "inference-cli" = "f5_tts.inference_cli:main"
 
 
55
  Homepage = "https://github.com/SWivid/F5-TTS"
56
 
57
  [project.scripts]
58
+ "f5-tts_infer-cli" = "f5_tts.infer.infer_cli:main"
59
+ "f5-tts_infer-gradio" = "f5_tts.infer.infer_gradio:main"
src/f5_tts/api.py CHANGED
@@ -130,8 +130,8 @@ if __name__ == "__main__":
130
  ref_file=str(files("f5_tts").joinpath("infer/examples/basic/basic_ref_en.wav")),
131
  ref_text="some call me nature, others call me mother nature.",
132
  gen_text="""I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring. Respect me and I'll nurture you; ignore me and you shall face the consequences.""",
133
- file_wave=str(files("f5_tts").joinpath("../../api_test_out.wav")),
134
- file_spect=str(files("f5_tts").joinpath("../../api_test_out.png")),
135
  seed=-1, # random seed = -1
136
  )
137
 
 
130
  ref_file=str(files("f5_tts").joinpath("infer/examples/basic/basic_ref_en.wav")),
131
  ref_text="some call me nature, others call me mother nature.",
132
  gen_text="""I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring. Respect me and I'll nurture you; ignore me and you shall face the consequences.""",
133
+ file_wave=str(files("f5_tts").joinpath("../../tests/api_out.wav")),
134
+ file_spect=str(files("f5_tts").joinpath("../../tests/api_out.png")),
135
  seed=-1, # random seed = -1
136
  )
137
 
src/f5_tts/infer/README.md ADDED
@@ -0,0 +1,92 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Inference
2
+
3
+ The pretrained model checkpoints can be reached at [🤗 Hugging Face](https://huggingface.co/SWivid/F5-TTS) and [🤖 Model Scope](https://www.modelscope.cn/models/SWivid/F5-TTS_Emilia-ZH-EN), or will be automatically downloaded when running inference scripts.
4
+
5
+ Currently support **30s for a single** generation, which is the **total length** including both prompt and output audio. However, you can leverage `infer_cli` and `infer_gradio` for longer text, will automatically do chunk generation. Long reference audio will be clip short to ~15s.
6
+
7
+ To avoid possible inference failures, make sure you have seen through the following instructions.
8
+
9
+ - Uppercased letters will be uttered letter by letter, so use lowercased letters for normal words.
10
+ - Add some spaces (blank: " ") or punctuations (e.g. "," ".") to explicitly introduce some pauses.
11
+ - Preprocess numbers to Chinese letters if you want to have them read in Chinese, otherwise in English.
12
+
13
+ # TODO 👇 ...
14
+
15
+ ### CLI Inference
16
+
17
+ It is possible to use cli `f5-tts_infer-cli` for following commands.
18
+
19
+ Either you can specify everything in `inference-cli.toml` or override with flags. Leave `--ref_text ""` will have ASR model transcribe the reference audio automatically (use extra GPU memory). If encounter network error, consider use local ckpt, just set `ckpt_file` in `inference-cli.py`
20
+
21
+ for change model use `--ckpt_file` to specify the model you want to load,
22
+ for change vocab.txt use `--vocab_file` to provide your vocab.txt file.
23
+
24
+ ```bash
25
+ # switch to the main directory
26
+ cd f5_tts
27
+
28
+ python inference-cli.py \
29
+ --model "F5-TTS" \
30
+ --ref_audio "tests/ref_audio/test_en_1_ref_short.wav" \
31
+ --ref_text "Some call me nature, others call me mother nature." \
32
+ --gen_text "I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring. Respect me and I'll nurture you; ignore me and you shall face the consequences."
33
+
34
+ python inference-cli.py \
35
+ --model "E2-TTS" \
36
+ --ref_audio "tests/ref_audio/test_zh_1_ref_short.wav" \
37
+ --ref_text "对,这就是我,万人敬仰的太乙真人。" \
38
+ --gen_text "突然,身边一阵笑声。我看着他们,意气风发地挺直了胸膛,甩了甩那稍显肉感的双臂,轻笑道,我身上的肉,是为了掩饰我爆棚的魅力,否则,岂不吓坏了你们呢?"
39
+
40
+ # Multi voice
41
+ # https://github.com/SWivid/F5-TTS/pull/146#issue-2595207852
42
+ python inference-cli.py -c samples/story.toml
43
+ ```
44
+
45
+ ### Gradio App
46
+ Currently supported features:
47
+ - Chunk inference
48
+ - Podcast Generation
49
+ - Multiple Speech-Type Generation
50
+ - Voice Chat powered by Qwen2.5-3B-Instruct
51
+
52
+ It is possible to use cli `f5-tts_infer-gradio` for following commands.
53
+
54
+ You can launch a Gradio app (web interface) to launch a GUI for inference (will load ckpt from Huggingface, you may also use local file in `gradio_app.py`). Currently load ASR model, F5-TTS and E2 TTS all in once, thus use more GPU memory than `inference-cli`.
55
+
56
+ ```bash
57
+ python f5_tts/gradio_app.py
58
+ ```
59
+
60
+ You can specify the port/host:
61
+
62
+ ```bash
63
+ python f5_tts/gradio_app.py --port 7860 --host 0.0.0.0
64
+ ```
65
+
66
+ Or launch a share link:
67
+
68
+ ```bash
69
+ python f5_tts/gradio_app.py --share
70
+ ```
71
+
72
+ ```python
73
+ import gradio as gr
74
+ from f5_tts.gradio_app import app
75
+
76
+ with gr.Blocks() as main_app:
77
+ gr.Markdown("# This is an example of using F5-TTS within a bigger Gradio app")
78
+
79
+ # ... other Gradio components
80
+
81
+ app.render()
82
+
83
+ main_app.launch()
84
+ ```
85
+
86
+ ### Speech Editing
87
+
88
+ To test speech editing capabilities, use the following command.
89
+
90
+ ```bash
91
+ python f5_tts/speech_edit.py
92
+ ```
src/f5_tts/infer/infer_cli.py CHANGED
@@ -21,15 +21,15 @@ from f5_tts.infer.utils_infer import (
21
 
22
 
23
  parser = argparse.ArgumentParser(
24
- prog="python3 inference-cli.py",
25
  description="Commandline interface for E2/F5 TTS with Advanced Batch Processing.",
26
- epilog="Specify options above to override one or more settings from config.",
27
  )
28
  parser.add_argument(
29
  "-c",
30
  "--config",
31
- help="Configuration file. Default=inference-cli.toml",
32
- default=os.path.join(files("f5_tts").joinpath("data"), "inference-cli.toml"),
33
  )
34
  parser.add_argument(
35
  "-m",
@@ -80,6 +80,8 @@ args = parser.parse_args()
80
  config = tomli.load(open(args.config, "rb"))
81
 
82
  ref_audio = args.ref_audio if args.ref_audio else config["ref_audio"]
 
 
83
  ref_text = args.ref_text if args.ref_text != "666" else config["ref_text"]
84
  gen_text = args.gen_text if args.gen_text else config["gen_text"]
85
  gen_file = args.gen_file if args.gen_file else config["gen_file"]
@@ -90,8 +92,8 @@ model = args.model if args.model else config["model"]
90
  ckpt_file = args.ckpt_file if args.ckpt_file else ""
91
  vocab_file = args.vocab_file if args.vocab_file else ""
92
  remove_silence = args.remove_silence if args.remove_silence else config["remove_silence"]
93
- wave_path = Path(output_dir) / "out.wav"
94
- spectrogram_path = Path(output_dir) / "out.png"
95
  vocos_local_path = "../checkpoints/charactr/vocos-mel-24khz"
96
 
97
  vocos = load_vocoder(is_local=args.load_vocoder_from_local, local_path=vocos_local_path)
@@ -161,6 +163,10 @@ def main_process(ref_audio, ref_text, text_gen, model_obj, remove_silence):
161
 
162
  if generated_audio_segments:
163
  final_wave = np.concatenate(generated_audio_segments)
 
 
 
 
164
  with open(wave_path, "wb") as f:
165
  sf.write(f.name, final_wave, final_sample_rate)
166
  # Remove silence
 
21
 
22
 
23
  parser = argparse.ArgumentParser(
24
+ prog="python3 infer-cli.py",
25
  description="Commandline interface for E2/F5 TTS with Advanced Batch Processing.",
26
+ epilog="Specify options above to override one or more settings from config.",
27
  )
28
  parser.add_argument(
29
  "-c",
30
  "--config",
31
+ help="Configuration file. Default=infer/examples/basic/basic.toml",
32
+ default=os.path.join(files("f5_tts").joinpath("infer/examples/basic"), "basic.toml"),
33
  )
34
  parser.add_argument(
35
  "-m",
 
80
  config = tomli.load(open(args.config, "rb"))
81
 
82
  ref_audio = args.ref_audio if args.ref_audio else config["ref_audio"]
83
+ if "src/f5_tts/infer/examples/basic" in ref_audio: # for pip pkg user
84
+ ref_audio = str(files("f5_tts").joinpath(f"../../{ref_audio}"))
85
  ref_text = args.ref_text if args.ref_text != "666" else config["ref_text"]
86
  gen_text = args.gen_text if args.gen_text else config["gen_text"]
87
  gen_file = args.gen_file if args.gen_file else config["gen_file"]
 
92
  ckpt_file = args.ckpt_file if args.ckpt_file else ""
93
  vocab_file = args.vocab_file if args.vocab_file else ""
94
  remove_silence = args.remove_silence if args.remove_silence else config["remove_silence"]
95
+ wave_path = Path(output_dir) / "infer_cli_out.wav"
96
+ # spectrogram_path = Path(output_dir) / "infer_cli_out.png"
97
  vocos_local_path = "../checkpoints/charactr/vocos-mel-24khz"
98
 
99
  vocos = load_vocoder(is_local=args.load_vocoder_from_local, local_path=vocos_local_path)
 
163
 
164
  if generated_audio_segments:
165
  final_wave = np.concatenate(generated_audio_segments)
166
+
167
+ if not os.path.exists(output_dir):
168
+ os.makedirs(output_dir)
169
+
170
  with open(wave_path, "wb") as f:
171
  sf.write(f.name, final_wave, final_sample_rate)
172
  # Remove silence
src/f5_tts/infer/utils_infer.py CHANGED
@@ -186,13 +186,12 @@ def preprocess_ref_audio_text(ref_audio_orig, ref_text, show_info=print, device=
186
  non_silent_segs = silence.split_on_silence(aseg, min_silence_len=1000, silence_thresh=-50, keep_silence=1000)
187
  non_silent_wave = AudioSegment.silent(duration=0)
188
  for non_silent_seg in non_silent_segs:
 
 
 
189
  non_silent_wave += non_silent_seg
190
  aseg = non_silent_wave
191
 
192
- audio_duration = len(aseg)
193
- if audio_duration > 15000:
194
- show_info("Audio is over 15s, clipping to only first 15s.")
195
- aseg = aseg[:15000]
196
  aseg.export(f.name, format="wav")
197
  ref_audio = f.name
198
 
 
186
  non_silent_segs = silence.split_on_silence(aseg, min_silence_len=1000, silence_thresh=-50, keep_silence=1000)
187
  non_silent_wave = AudioSegment.silent(duration=0)
188
  for non_silent_seg in non_silent_segs:
189
+ if len(non_silent_wave) > 10000 and len(non_silent_wave + non_silent_seg) > 18000:
190
+ show_info("Audio is over 18s, clipping short.")
191
+ break
192
  non_silent_wave += non_silent_seg
193
  aseg = non_silent_wave
194
 
 
 
 
 
195
  aseg.export(f.name, format="wav")
196
  ref_audio = f.name
197