SWivid commited on
Commit
f69a602
·
1 Parent(s): adca73b

finish infer dependencies; update readmes

Browse files
README.md CHANGED
@@ -54,17 +54,46 @@ docker build -t f5tts:v1 .
54
 
55
  ## Inference
56
 
57
- ### 1. Basic usage
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
 
59
  ```bash
60
- # cli inference
 
 
 
 
 
 
 
 
61
  f5-tts_infer-cli
 
 
62
 
63
- # gradio interface
64
- f5-tts_infer-gradio
65
  ```
66
 
67
- ### 2. More instructions
68
 
69
  - In order to have better generation results, take a moment to read [detailed guidance](src/f5_tts/infer).
70
  - The [Issues](https://github.com/SWivid/F5-TTS/issues?q=is%3Aissue) are very useful, please try to find the solution by properly searching the keywords of problem encountered. If no answer found, then feel free to open an issue.
 
54
 
55
  ## Inference
56
 
57
+ ### 1. Gradio App
58
+
59
+ Currently supported features:
60
+
61
+ - Basic TTS with Chunk Inference
62
+ - Multi-Style / Multi-Speaker Generation
63
+ - Voice Chat powered by Qwen2.5-3B-Instruct
64
+
65
+ ```bash
66
+ # Launch a Gradio app (web interface)
67
+ f5-tts_infer-gradio
68
+
69
+ # Specify the port/host
70
+ f5-tts_infer-gradio --port 7860 --host 0.0.0.0
71
+
72
+ # Launch a share link
73
+ f5-tts_infer-gradio --share
74
+ ```
75
+
76
+ ### 2. CLI Inference
77
 
78
  ```bash
79
+ # Run with flags
80
+ # Leave --ref_text "" will have ASR model transcribe (extra GPU memory usage)
81
+ f5-tts_infer-cli \
82
+ --model "F5-TTS" \
83
+ --ref_audio "ref_audio.wav" \
84
+ --ref_text "The content, subtitle or transcription of reference audio." \
85
+ --gen_text "Some text you want TTS model generate for you."
86
+
87
+ # Run with default setting. src/f5_tts/infer/examples/basic/basic.toml
88
  f5-tts_infer-cli
89
+ # Or with your own .toml file
90
+ f5-tts_infer-cli -c custom.toml
91
 
92
+ # Multi voice. See src/f5_tts/infer/README.md
93
+ f5-tts_infer-cli -c src/f5_tts/infer/examples/multi/story.toml
94
  ```
95
 
96
+ ### 3. More instructions
97
 
98
  - In order to have better generation results, take a moment to read [detailed guidance](src/f5_tts/infer).
99
  - The [Issues](https://github.com/SWivid/F5-TTS/issues?q=is%3Aissue) are very useful, please try to find the solution by properly searching the keywords of problem encountered. If no answer found, then feel free to open an issue.
src/f5_tts/eval/README.md CHANGED
@@ -1,5 +1,5 @@
1
 
2
- ## Evaluation
3
 
4
  Install packages for evaluation:
5
 
@@ -7,6 +7,8 @@ Install packages for evaluation:
7
  pip install -e .[eval]
8
  ```
9
 
 
 
10
  ### Prepare Test Datasets
11
 
12
  1. *Seed-TTS testset*: Download from [seed-tts-eval](https://github.com/BytedanceSpeech/seed-tts-eval).
@@ -25,6 +27,8 @@ accelerate config # if not set before
25
  bash src/f5_tts/eval/eval_infer_batch.sh
26
  ```
27
 
 
 
28
  ### Download Evaluation Model Checkpoints
29
 
30
  1. Chinese ASR Model: [Paraformer-zh](https://huggingface.co/funasr/paraformer-zh)
 
1
 
2
+ # Evaluation
3
 
4
  Install packages for evaluation:
5
 
 
7
  pip install -e .[eval]
8
  ```
9
 
10
+ ## Generating Samples for Evaluation
11
+
12
  ### Prepare Test Datasets
13
 
14
  1. *Seed-TTS testset*: Download from [seed-tts-eval](https://github.com/BytedanceSpeech/seed-tts-eval).
 
27
  bash src/f5_tts/eval/eval_infer_batch.sh
28
  ```
29
 
30
+ ## Objective Evaluation on Generated Results
31
+
32
  ### Download Evaluation Model Checkpoints
33
 
34
  1. Chinese ASR Model: [Paraformer-zh](https://huggingface.co/funasr/paraformer-zh)
src/f5_tts/infer/README.md CHANGED
@@ -1,8 +1,8 @@
1
- ## Inference
2
 
3
  The pretrained model checkpoints can be reached at [🤗 Hugging Face](https://huggingface.co/SWivid/F5-TTS) and [🤖 Model Scope](https://www.modelscope.cn/models/SWivid/F5-TTS_Emilia-ZH-EN), or will be automatically downloaded when running inference scripts.
4
 
5
- Currently support **30s for a single** generation, which is the **total length** including both prompt and output audio. However, you can leverage `infer_cli` and `infer_gradio` for longer text, will automatically do chunk generation. Long reference audio will be clip short to ~15s.
6
 
7
  To avoid possible inference failures, make sure you have seen through the following instructions.
8
 
@@ -10,83 +10,102 @@ To avoid possible inference failures, make sure you have seen through the follow
10
  - Add some spaces (blank: " ") or punctuations (e.g. "," ".") to explicitly introduce some pauses.
11
  - Preprocess numbers to Chinese letters if you want to have them read in Chinese, otherwise in English.
12
 
13
- # TODO 👇 ...
14
 
15
- ### CLI Inference
16
 
17
- It is possible to use cli `f5-tts_infer-cli` for following commands.
18
 
19
- Either you can specify everything in `inference-cli.toml` or override with flags. Leave `--ref_text ""` will have ASR model transcribe the reference audio automatically (use extra GPU memory). If encounter network error, consider use local ckpt, just set `ckpt_file` in `inference-cli.py`
 
 
20
 
21
- for change model use `--ckpt_file` to specify the model you want to load,
22
- for change vocab.txt use `--vocab_file` to provide your vocab.txt file.
23
 
24
- ```bash
25
- # switch to the main directory
26
- cd f5_tts
27
 
28
- python inference-cli.py \
29
- --model "F5-TTS" \
30
- --ref_audio "tests/ref_audio/test_en_1_ref_short.wav" \
31
- --ref_text "Some call me nature, others call me mother nature." \
32
- --gen_text "I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring. Respect me and I'll nurture you; ignore me and you shall face the consequences."
33
-
34
- python inference-cli.py \
35
- --model "E2-TTS" \
36
- --ref_audio "tests/ref_audio/test_zh_1_ref_short.wav" \
37
- --ref_text "对,这就是我,万人敬仰的太乙真人。" \
38
- --gen_text "突然,身边一阵笑声。我看着他们,意气风发地挺直了胸膛,甩了甩那稍显肉感的双臂,轻笑道,我身上的肉,是为了掩饰我爆棚的魅力,否则,岂不吓坏了你们呢?"
39
-
40
- # Multi voice
41
- # https://github.com/SWivid/F5-TTS/pull/146#issue-2595207852
42
- python inference-cli.py -c samples/story.toml
43
- ```
44
 
45
- ### Gradio App
46
- Currently supported features:
47
- - Chunk inference
48
- - Podcast Generation
49
- - Multiple Speech-Type Generation
50
- - Voice Chat powered by Qwen2.5-3B-Instruct
51
 
52
- It is possible to use cli `f5-tts_infer-gradio` for following commands.
53
 
54
- You can launch a Gradio app (web interface) to launch a GUI for inference (will load ckpt from Huggingface, you may also use local file in `gradio_app.py`). Currently load ASR model, F5-TTS and E2 TTS all in once, thus use more GPU memory than `inference-cli`.
55
 
56
- ```bash
57
- python f5_tts/gradio_app.py
58
  ```
59
 
60
- You can specify the port/host:
61
 
62
- ```bash
63
- python f5_tts/gradio_app.py --port 7860 --host 0.0.0.0
64
- ```
65
 
66
- Or launch a share link:
67
 
 
 
 
68
  ```bash
69
- python f5_tts/gradio_app.py --share
 
 
 
 
 
70
  ```
71
 
72
- ```python
73
- import gradio as gr
74
- from f5_tts.gradio_app import app
75
 
76
- with gr.Blocks() as main_app:
77
- gr.Markdown("# This is an example of using F5-TTS within a bigger Gradio app")
78
-
79
- # ... other Gradio components
80
 
81
- app.render()
 
 
 
 
 
 
 
 
 
 
 
 
 
82
 
83
- main_app.launch()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
84
  ```
 
85
 
86
- ### Speech Editing
87
 
88
- To test speech editing capabilities, use the following command.
89
 
90
  ```bash
91
- python f5_tts/speech_edit.py
92
  ```
 
1
+ # Inference
2
 
3
  The pretrained model checkpoints can be reached at [🤗 Hugging Face](https://huggingface.co/SWivid/F5-TTS) and [🤖 Model Scope](https://www.modelscope.cn/models/SWivid/F5-TTS_Emilia-ZH-EN), or will be automatically downloaded when running inference scripts.
4
 
5
+ Currently support **30s for a single** generation, which is the **total length** including both prompt and output audio. However, you can provide `infer_cli` and `infer_gradio` with longer text, will automatically do chunk generation. Long reference audio will be **clip short to ~15s**.
6
 
7
  To avoid possible inference failures, make sure you have seen through the following instructions.
8
 
 
10
  - Add some spaces (blank: " ") or punctuations (e.g. "," ".") to explicitly introduce some pauses.
11
  - Preprocess numbers to Chinese letters if you want to have them read in Chinese, otherwise in English.
12
 
 
13
 
14
+ ## Gradio App
15
 
16
+ Currently supported features:
17
 
18
+ - Basic TTS with Chunk Inference
19
+ - Multi-Style / Multi-Speaker Generation
20
+ - Voice Chat powered by Qwen2.5-3B-Instruct
21
 
22
+ The cli command `f5-tts_infer-gradio` equals to `python src/f5_tts/infer/infer_gradio.py`, which launches a Gradio APP (web interface) for inference.
 
23
 
24
+ The script will load model checkpoints from Huggingface. You can also manually download files and update the path to `load_model()` in `infer_gradio.py`. Currently only load TTS models first, will load ASR model to do transcription if `ref_text` not provided, will load LLM model if use Voice Chat.
 
 
25
 
26
+ Could also be used as a component for larger application.
27
+ ```python
28
+ import gradio as gr
29
+ from f5_tts.infer.infer_gradio import app
 
 
 
 
 
 
 
 
 
 
 
 
30
 
31
+ with gr.Blocks() as main_app:
32
+ gr.Markdown("# This is an example of using F5-TTS within a bigger Gradio app")
 
 
 
 
33
 
34
+ # ... other Gradio components
35
 
36
+ app.render()
37
 
38
+ main_app.launch()
 
39
  ```
40
 
 
41
 
42
+ ## CLI Inference
43
+
44
+ The cli command `f5-tts_infer-cli` equals to `python src/f5_tts/infer/infer_cli.py`, which is a command line tool for inference.
45
 
46
+ The script will load model checkpoints from Huggingface. You can also manually download files and use `--ckpt_file` to specify the model you want to load, or directly update in `infer_cli.py`.
47
 
48
+ For change vocab.txt use `--vocab_file` to provide your `vocab.txt` file.
49
+
50
+ Basically you can inference with flags:
51
  ```bash
52
+ # Leave --ref_text "" will have ASR model transcribe (extra GPU memory usage)
53
+ f5-tts_infer-cli \
54
+ --model "F5-TTS" \
55
+ --ref_audio "ref_audio.wav" \
56
+ --ref_text "The content, subtitle or transcription of reference audio." \
57
+ --gen_text "Some text you want TTS model generate for you."
58
  ```
59
 
60
+ And a `.toml` file would help with more flexible usage.
 
 
61
 
62
+ ```bash
63
+ f5-tts_infer-cli -c custom.toml
64
+ ```
 
65
 
66
+ For example, you can use `.toml` to pass in variables, refer to `src/f5_tts/infer/examples/basic/basic.toml`:
67
+
68
+ ```toml
69
+ # F5-TTS | E2-TTS
70
+ model = "F5-TTS"
71
+ ref_audio = "infer/examples/basic/basic_ref_en.wav"
72
+ # If an empty "", transcribes the reference audio automatically.
73
+ ref_text = "Some call me nature, others call me mother nature."
74
+ gen_text = "I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring."
75
+ # File with text to generate. Ignores the text above.
76
+ gen_file = ""
77
+ remove_silence = false
78
+ output_dir = "tests"
79
+ ```
80
 
81
+ You can also leverage `.toml` file to do multi-style generation, refer to `src/f5_tts/infer/examples/multi/story.toml`.
82
+
83
+ ```toml
84
+ # F5-TTS | E2-TTS
85
+ model = "F5-TTS"
86
+ ref_audio = "infer/examples/multi/main.flac"
87
+ # If an empty "", transcribes the reference audio automatically.
88
+ ref_text = ""
89
+ gen_text = ""
90
+ # File with text to generate. Ignores the text above.
91
+ gen_file = "infer/examples/multi/story.txt"
92
+ remove_silence = true
93
+ output_dir = "tests"
94
+
95
+ [voices.town]
96
+ ref_audio = "infer/examples/multi/town.flac"
97
+ ref_text = ""
98
+
99
+ [voices.country]
100
+ ref_audio = "infer/examples/multi/country.flac"
101
+ ref_text = ""
102
  ```
103
+ You should mark the voice with `[main]` `[town]` `[country]` whenever you want to change voice, refer to `src/f5_tts/infer/examples/multi/story.txt`.
104
 
105
+ ## Speech Editing
106
 
107
+ To test speech editing capabilities, use the following command:
108
 
109
  ```bash
110
+ python src/f5_tts/infer/speech_edit.py
111
  ```
src/f5_tts/infer/infer_cli.py CHANGED
@@ -80,11 +80,21 @@ args = parser.parse_args()
80
  config = tomli.load(open(args.config, "rb"))
81
 
82
  ref_audio = args.ref_audio if args.ref_audio else config["ref_audio"]
83
- if "infer/examples/" in ref_audio: # for pip pkg user
84
- ref_audio = str(files("f5_tts").joinpath(f"{ref_audio}"))
85
  ref_text = args.ref_text if args.ref_text != "666" else config["ref_text"]
86
  gen_text = args.gen_text if args.gen_text else config["gen_text"]
87
  gen_file = args.gen_file if args.gen_file else config["gen_file"]
 
 
 
 
 
 
 
 
 
 
 
 
88
  if gen_file:
89
  gen_text = codecs.open(gen_file, "r", "utf-8").read()
90
  output_dir = args.output_dir if args.output_dir else config["output_dir"]
 
80
  config = tomli.load(open(args.config, "rb"))
81
 
82
  ref_audio = args.ref_audio if args.ref_audio else config["ref_audio"]
 
 
83
  ref_text = args.ref_text if args.ref_text != "666" else config["ref_text"]
84
  gen_text = args.gen_text if args.gen_text else config["gen_text"]
85
  gen_file = args.gen_file if args.gen_file else config["gen_file"]
86
+
87
+ # patches for pip pkg user
88
+ if "infer/examples/" in ref_audio:
89
+ ref_audio = str(files("f5_tts").joinpath(f"{ref_audio}"))
90
+ if "infer/examples/" in gen_file:
91
+ gen_file = str(files("f5_tts").joinpath(f"{gen_file}"))
92
+ if "voices" in config:
93
+ for voice in config["voices"]:
94
+ voice_ref_audio = config["voices"][voice]["ref_audio"]
95
+ if "infer/examples/" in voice_ref_audio:
96
+ config["voices"][voice]["ref_audio"] = str(files("f5_tts").joinpath(f"{voice_ref_audio}"))
97
+
98
  if gen_file:
99
  gen_text = codecs.open(gen_file, "r", "utf-8").read()
100
  output_dir = args.output_dir if args.output_dir else config["output_dir"]
src/f5_tts/infer/speech_edit.py CHANGED
@@ -7,11 +7,13 @@ from vocos import Vocos
7
 
8
  from f5_tts.model import CFM, UNetT, DiT
9
  from f5_tts.model.utils import (
10
- load_checkpoint,
11
  get_tokenizer,
12
  convert_char_to_pinyin,
13
  )
14
- from f5_tts.infer.utils_infer import save_spectrogram
 
 
 
15
 
16
  device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
17
 
@@ -54,12 +56,12 @@ output_dir = "tests"
54
  # [leverage https://github.com/MahmoudAshraf97/ctc-forced-aligner to get char level alignment]
55
  # pip install git+https://github.com/MahmoudAshraf97/ctc-forced-aligner.git
56
  # [write the origin_text into a file, e.g. tests/test_edit.txt]
57
- # ctc-forced-aligner --audio_path "tests/ref_audio/test_en_1_ref_short.wav" --text_path "tests/test_edit.txt" --language "zho" --romanize --split_size "char"
58
  # [result will be saved at same path of audio file]
59
  # [--language "zho" for Chinese, "eng" for English]
60
  # [if local ckpt, set --alignment_model "../checkpoints/mms-300m-1130-forced-aligner"]
61
 
62
- audio_to_edit = "tests/ref_audio/test_en_1_ref_short.wav"
63
  origin_text = "Some call me nature, others call me mother nature."
64
  target_text = "Some call me optimist, others call me realist."
65
  parts_to_edit = [
@@ -71,7 +73,7 @@ fix_duration = [
71
  1,
72
  ] # fix duration for "optimist" & "realist", in seconds
73
 
74
- # audio_to_edit = "tests/ref_audio/test_zh_1_ref_short.wav"
75
  # origin_text = "对,这就是我,万人敬仰的太乙真人。"
76
  # target_text = "对,那就是你,万人敬仰的太白金星。"
77
  # parts_to_edit = [[0.84, 1.4], [1.92, 2.4], [4.26, 6.26], ]
 
7
 
8
  from f5_tts.model import CFM, UNetT, DiT
9
  from f5_tts.model.utils import (
 
10
  get_tokenizer,
11
  convert_char_to_pinyin,
12
  )
13
+ from f5_tts.infer.utils_infer import (
14
+ load_checkpoint,
15
+ save_spectrogram,
16
+ )
17
 
18
  device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
19
 
 
56
  # [leverage https://github.com/MahmoudAshraf97/ctc-forced-aligner to get char level alignment]
57
  # pip install git+https://github.com/MahmoudAshraf97/ctc-forced-aligner.git
58
  # [write the origin_text into a file, e.g. tests/test_edit.txt]
59
+ # ctc-forced-aligner --audio_path "src/f5_tts/infer/examples/basic/basic_ref_en.wav" --text_path "tests/test_edit.txt" --language "zho" --romanize --split_size "char"
60
  # [result will be saved at same path of audio file]
61
  # [--language "zho" for Chinese, "eng" for English]
62
  # [if local ckpt, set --alignment_model "../checkpoints/mms-300m-1130-forced-aligner"]
63
 
64
+ audio_to_edit = "src/f5_tts/infer/examples/basic/basic_ref_en.wav"
65
  origin_text = "Some call me nature, others call me mother nature."
66
  target_text = "Some call me optimist, others call me realist."
67
  parts_to_edit = [
 
73
  1,
74
  ] # fix duration for "optimist" & "realist", in seconds
75
 
76
+ # audio_to_edit = "src/f5_tts/infer/examples/basic/basic_ref_zh.wav"
77
  # origin_text = "对,这就是我,万人敬仰的太乙真人。"
78
  # target_text = "对,那就是你,万人敬仰的太白金星。"
79
  # parts_to_edit = [[0.84, 1.4], [1.92, 2.4], [4.26, 6.26], ]
src/f5_tts/train/README.md CHANGED
@@ -1,3 +1,4 @@
 
1
 
2
  ## Prepare Dataset
3
 
 
1
+ # Training
2
 
3
  ## Prepare Dataset
4