Spaces:

zhouzhou363
/

f5-tts

Configuration error

SWivid commited on Oct 24, 2024

Commit

f69a602

1 Parent(s): adca73b

finish infer dependencies; update readmes

Browse files

Files changed (6) hide show

README.md +34 -5
src/f5_tts/eval/README.md +5 -1
src/f5_tts/infer/README.md +74 -55
src/f5_tts/infer/infer_cli.py +12 -2
src/f5_tts/infer/speech_edit.py +7 -5
src/f5_tts/train/README.md +1 -0

README.md CHANGED Viewed

@@ -54,17 +54,46 @@ docker build -t f5tts:v1 .
 ## Inference
-### 1. Basic usage
 ```bash
-# cli inference
 f5-tts_infer-cli
-# gradio interface
-f5-tts_infer-gradio
 ```
-### 2. More instructions
 - In order to have better generation results, take a moment to read [detailed guidance](src/f5_tts/infer).
 - The [Issues](https://github.com/SWivid/F5-TTS/issues?q=is%3Aissue) are very useful, please try to find the solution by properly searching the keywords of problem encountered. If no answer found, then feel free to open an issue.

 ## Inference
+### 1. Gradio App
+Currently supported features:
+- Basic TTS with Chunk Inference
+- Multi-Style / Multi-Speaker Generation
+- Voice Chat powered by Qwen2.5-3B-Instruct
+```bash
+# Launch a Gradio app (web interface)
+f5-tts_infer-gradio
+# Specify the port/host
+f5-tts_infer-gradio --port 7860 --host 0.0.0.0
+# Launch a share link
+f5-tts_infer-gradio --share
+```
+### 2. CLI Inference
 ```bash
+# Run with flags
+# Leave --ref_text "" will have ASR model transcribe (extra GPU memory usage)
+f5-tts_infer-cli \
+--model "F5-TTS" \
+--ref_audio "ref_audio.wav" \
+--ref_text "The content, subtitle or transcription of reference audio." \
+--gen_text "Some text you want TTS model generate for you."
+# Run with default setting. src/f5_tts/infer/examples/basic/basic.toml
 f5-tts_infer-cli
+# Or with your own .toml file
+f5-tts_infer-cli -c custom.toml
+# Multi voice. See src/f5_tts/infer/README.md
+f5-tts_infer-cli -c src/f5_tts/infer/examples/multi/story.toml
 ```
+### 3. More instructions
 - In order to have better generation results, take a moment to read [detailed guidance](src/f5_tts/infer).
 - The [Issues](https://github.com/SWivid/F5-TTS/issues?q=is%3Aissue) are very useful, please try to find the solution by properly searching the keywords of problem encountered. If no answer found, then feel free to open an issue.

src/f5_tts/eval/README.md CHANGED Viewed

@@ -1,5 +1,5 @@
-## Evaluation
 Install packages for evaluation:
@@ -7,6 +7,8 @@ Install packages for evaluation:
 pip install -e .[eval]
 ```
 ### Prepare Test Datasets
 1. *Seed-TTS testset*: Download from [seed-tts-eval](https://github.com/BytedanceSpeech/seed-tts-eval).
@@ -25,6 +27,8 @@ accelerate config  # if not set before
 bash src/f5_tts/eval/eval_infer_batch.sh
 ```
 ### Download Evaluation Model Checkpoints
 1. Chinese ASR Model: [Paraformer-zh](https://huggingface.co/funasr/paraformer-zh)

+# Evaluation
 Install packages for evaluation:
 pip install -e .[eval]
 ```
+## Generating Samples for Evaluation
 ### Prepare Test Datasets
 1. *Seed-TTS testset*: Download from [seed-tts-eval](https://github.com/BytedanceSpeech/seed-tts-eval).
 bash src/f5_tts/eval/eval_infer_batch.sh
 ```
+## Objective Evaluation on Generated Results
 ### Download Evaluation Model Checkpoints
 1. Chinese ASR Model: [Paraformer-zh](https://huggingface.co/funasr/paraformer-zh)

src/f5_tts/infer/README.md CHANGED Viewed

@@ -1,8 +1,8 @@
-## Inference
 The pretrained model checkpoints can be reached at [🤗 Hugging Face](https://huggingface.co/SWivid/F5-TTS) and [🤖 Model Scope](https://www.modelscope.cn/models/SWivid/F5-TTS_Emilia-ZH-EN), or will be automatically downloaded when running inference scripts.
-Currently support **30s for a single** generation, which is the **total length** including both prompt and output audio. However, you can leverage `infer_cli` and `infer_gradio` for longer text, will automatically do chunk generation. Long reference audio will be clip short to ~15s.
 To avoid possible inference failures, make sure you have seen through the following instructions.
@@ -10,83 +10,102 @@ To avoid possible inference failures, make sure you have seen through the follow
 - Add some spaces (blank: " ") or punctuations (e.g. "," ".") to explicitly introduce some pauses.
 - Preprocess numbers to Chinese letters if you want to have them read in Chinese, otherwise in English.
-# TODO 👇 ...
-### CLI Inference
-It is possible to use cli `f5-tts_infer-cli` for following commands.
-Either you can specify everything in `inference-cli.toml` or override with flags. Leave `--ref_text ""` will have ASR model transcribe the reference audio automatically (use extra GPU memory). If encounter network error, consider use local ckpt, just set `ckpt_file` in `inference-cli.py`
-for change model use `--ckpt_file` to specify the model you want to load,
-for change vocab.txt use `--vocab_file` to provide your vocab.txt file.
-```bash
-# switch to the main directory
-cd f5_tts
-python inference-cli.py \
---model "F5-TTS" \
---ref_audio "tests/ref_audio/test_en_1_ref_short.wav" \
---ref_text "Some call me nature, others call me mother nature." \
---gen_text "I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring. Respect me and I'll nurture you; ignore me and you shall face the consequences."
-python inference-cli.py \
---model "E2-TTS" \
---ref_audio "tests/ref_audio/test_zh_1_ref_short.wav" \
---ref_text "对，这就是我，万人敬仰的太乙真人。" \
---gen_text "突然，身边一阵笑声。我看着他们，意气风发地挺直了胸膛，甩了甩那稍显肉感的双臂，轻笑道，我身上的肉，是为了掩饰我爆棚的魅力，否则，岂不吓坏了你们呢？"
-# Multi voice
-# https://github.com/SWivid/F5-TTS/pull/146#issue-2595207852
-python inference-cli.py -c samples/story.toml
-```
-### Gradio App
-Currently supported features:
-- Chunk inference
-- Podcast Generation
-- Multiple Speech-Type Generation
-- Voice Chat powered by Qwen2.5-3B-Instruct
-It is possible to use cli `f5-tts_infer-gradio` for following commands.
-You can launch a Gradio app (web interface) to launch a GUI for inference (will load ckpt from Huggingface, you may also use local file in `gradio_app.py`). Currently load ASR model, F5-TTS and E2 TTS all in once, thus use more GPU memory than `inference-cli`.
-```bash
-python f5_tts/gradio_app.py
 ```
-You can specify the port/host:
-```bash
-python f5_tts/gradio_app.py --port 7860 --host 0.0.0.0
-```
-Or launch a share link:
 ```bash
-python f5_tts/gradio_app.py --share
 ```
-```python
-import gradio as gr
-from f5_tts.gradio_app import app
-with gr.Blocks() as main_app:
-    gr.Markdown("# This is an example of using F5-TTS within a bigger Gradio app")
-    # ... other Gradio components
-    app.render()
-main_app.launch()
 ```
-### Speech Editing
-To test speech editing capabilities, use the following command.
 ```bash
-python f5_tts/speech_edit.py
 ```

+# Inference
 The pretrained model checkpoints can be reached at [🤗 Hugging Face](https://huggingface.co/SWivid/F5-TTS) and [🤖 Model Scope](https://www.modelscope.cn/models/SWivid/F5-TTS_Emilia-ZH-EN), or will be automatically downloaded when running inference scripts.
+Currently support **30s for a single** generation, which is the **total length** including both prompt and output audio. However, you can provide `infer_cli` and `infer_gradio` with longer text, will automatically do chunk generation. Long reference audio will be **clip short to ~15s**.
 To avoid possible inference failures, make sure you have seen through the following instructions.
 - Add some spaces (blank: " ") or punctuations (e.g. "," ".") to explicitly introduce some pauses.
 - Preprocess numbers to Chinese letters if you want to have them read in Chinese, otherwise in English.
+## Gradio App
+Currently supported features:
+- Basic TTS with Chunk Inference
+- Multi-Style / Multi-Speaker Generation
+- Voice Chat powered by Qwen2.5-3B-Instruct
+The cli command `f5-tts_infer-gradio` equals to `python src/f5_tts/infer/infer_gradio.py`, which launches a Gradio APP (web interface) for inference.
+The script will load model checkpoints from Huggingface. You can also manually download files and update the path to `load_model()` in `infer_gradio.py`. Currently only load TTS models first, will load ASR model to do transcription if `ref_text` not provided, will load LLM model if use Voice Chat.
+Could also be used as a component for larger application.
+```python
+import gradio as gr
+from f5_tts.infer.infer_gradio import app
+with gr.Blocks() as main_app:
+    gr.Markdown("# This is an example of using F5-TTS within a bigger Gradio app")
+    # ... other Gradio components
+    app.render()
+main_app.launch()
 ```
+## CLI Inference
+The cli command `f5-tts_infer-cli` equals to `python src/f5_tts/infer/infer_cli.py`, which is a command line tool for inference.
+The script will load model checkpoints from Huggingface. You can also manually download files and use `--ckpt_file` to specify the model you want to load, or directly update in `infer_cli.py`.
+For change vocab.txt use `--vocab_file` to provide your `vocab.txt` file.
+Basically you can inference with flags:
 ```bash
+# Leave --ref_text "" will have ASR model transcribe (extra GPU memory usage)
+f5-tts_infer-cli \
+--model "F5-TTS" \
+--ref_audio "ref_audio.wav" \
+--ref_text "The content, subtitle or transcription of reference audio." \
+--gen_text "Some text you want TTS model generate for you."
 ```
+And a `.toml` file would help with more flexible usage.
+```bash
+f5-tts_infer-cli -c custom.toml
+```
+For example, you can use `.toml` to pass in variables, refer to `src/f5_tts/infer/examples/basic/basic.toml`:
+```toml
+# F5-TTS | E2-TTS
+model = "F5-TTS"
+ref_audio = "infer/examples/basic/basic_ref_en.wav"
+# If an empty "", transcribes the reference audio automatically.
+ref_text = "Some call me nature, others call me mother nature."
+gen_text = "I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring."
+# File with text to generate. Ignores the text above.
+gen_file = ""
+remove_silence = false
+output_dir = "tests"
+```
+You can also leverage `.toml` file to do multi-style generation, refer to `src/f5_tts/infer/examples/multi/story.toml`.
+```toml
+# F5-TTS | E2-TTS
+model = "F5-TTS"
+ref_audio = "infer/examples/multi/main.flac"
+# If an empty "", transcribes the reference audio automatically.
+ref_text = ""
+gen_text = ""
+# File with text to generate. Ignores the text above.
+gen_file = "infer/examples/multi/story.txt"
+remove_silence = true
+output_dir = "tests"
+[voices.town]
+ref_audio = "infer/examples/multi/town.flac"
+ref_text = ""
+[voices.country]
+ref_audio = "infer/examples/multi/country.flac"
+ref_text = ""
 ```
+You should mark the voice with `[main]` `[town]` `[country]` whenever you want to change voice, refer to `src/f5_tts/infer/examples/multi/story.txt`.
+## Speech Editing
+To test speech editing capabilities, use the following command:
 ```bash
+python src/f5_tts/infer/speech_edit.py
 ```

src/f5_tts/infer/infer_cli.py CHANGED Viewed

@@ -80,11 +80,21 @@ args = parser.parse_args()
 config = tomli.load(open(args.config, "rb"))
 ref_audio = args.ref_audio if args.ref_audio else config["ref_audio"]
-if "infer/examples/" in ref_audio:  # for pip pkg user
-    ref_audio = str(files("f5_tts").joinpath(f"{ref_audio}"))
 ref_text = args.ref_text if args.ref_text != "666" else config["ref_text"]
 gen_text = args.gen_text if args.gen_text else config["gen_text"]
 gen_file = args.gen_file if args.gen_file else config["gen_file"]
 if gen_file:
     gen_text = codecs.open(gen_file, "r", "utf-8").read()
 output_dir = args.output_dir if args.output_dir else config["output_dir"]

 config = tomli.load(open(args.config, "rb"))
 ref_audio = args.ref_audio if args.ref_audio else config["ref_audio"]
 ref_text = args.ref_text if args.ref_text != "666" else config["ref_text"]
 gen_text = args.gen_text if args.gen_text else config["gen_text"]
 gen_file = args.gen_file if args.gen_file else config["gen_file"]
+# patches for pip pkg user
+if "infer/examples/" in ref_audio:
+    ref_audio = str(files("f5_tts").joinpath(f"{ref_audio}"))
+if "infer/examples/" in gen_file:
+    gen_file = str(files("f5_tts").joinpath(f"{gen_file}"))
+if "voices" in config:
+    for voice in config["voices"]:
+        voice_ref_audio = config["voices"][voice]["ref_audio"]
+        if "infer/examples/" in voice_ref_audio:
+            config["voices"][voice]["ref_audio"] = str(files("f5_tts").joinpath(f"{voice_ref_audio}"))
 if gen_file:
     gen_text = codecs.open(gen_file, "r", "utf-8").read()
 output_dir = args.output_dir if args.output_dir else config["output_dir"]

src/f5_tts/infer/speech_edit.py CHANGED Viewed

@@ -7,11 +7,13 @@ from vocos import Vocos
 from f5_tts.model import CFM, UNetT, DiT
 from f5_tts.model.utils import (
-    load_checkpoint,
     get_tokenizer,
     convert_char_to_pinyin,
 )
-from f5_tts.infer.utils_infer import save_spectrogram
 device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
@@ -54,12 +56,12 @@ output_dir = "tests"
 # [leverage https://github.com/MahmoudAshraf97/ctc-forced-aligner to get char level alignment]
 # pip install git+https://github.com/MahmoudAshraf97/ctc-forced-aligner.git
 # [write the origin_text into a file, e.g. tests/test_edit.txt]
-# ctc-forced-aligner --audio_path "tests/ref_audio/test_en_1_ref_short.wav" --text_path "tests/test_edit.txt" --language "zho" --romanize --split_size "char"
 # [result will be saved at same path of audio file]
 # [--language "zho" for Chinese, "eng" for English]
 # [if local ckpt, set --alignment_model "../checkpoints/mms-300m-1130-forced-aligner"]
-audio_to_edit = "tests/ref_audio/test_en_1_ref_short.wav"
 origin_text = "Some call me nature, others call me mother nature."
 target_text = "Some call me optimist, others call me realist."
 parts_to_edit = [
@@ -71,7 +73,7 @@ fix_duration = [
     1,
 ]  # fix duration for "optimist" & "realist", in seconds
-# audio_to_edit = "tests/ref_audio/test_zh_1_ref_short.wav"
 # origin_text = "对，这就是我，万人敬仰的太乙真人。"
 # target_text = "对，那就是你，万人敬仰的太白金星。"
 # parts_to_edit = [[0.84, 1.4], [1.92, 2.4], [4.26, 6.26], ]

 from f5_tts.model import CFM, UNetT, DiT
 from f5_tts.model.utils import (
     get_tokenizer,
     convert_char_to_pinyin,
 )
+from f5_tts.infer.utils_infer import (
+    load_checkpoint,
+    save_spectrogram,
+)
 device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
 # [leverage https://github.com/MahmoudAshraf97/ctc-forced-aligner to get char level alignment]
 # pip install git+https://github.com/MahmoudAshraf97/ctc-forced-aligner.git
 # [write the origin_text into a file, e.g. tests/test_edit.txt]
+# ctc-forced-aligner --audio_path "src/f5_tts/infer/examples/basic/basic_ref_en.wav" --text_path "tests/test_edit.txt" --language "zho" --romanize --split_size "char"
 # [result will be saved at same path of audio file]
 # [--language "zho" for Chinese, "eng" for English]
 # [if local ckpt, set --alignment_model "../checkpoints/mms-300m-1130-forced-aligner"]
+audio_to_edit = "src/f5_tts/infer/examples/basic/basic_ref_en.wav"
 origin_text = "Some call me nature, others call me mother nature."
 target_text = "Some call me optimist, others call me realist."
 parts_to_edit = [
     1,
 ]  # fix duration for "optimist" & "realist", in seconds
+# audio_to_edit = "src/f5_tts/infer/examples/basic/basic_ref_zh.wav"
 # origin_text = "对，这就是我，万人敬仰的太乙真人。"
 # target_text = "对，那就是你，万人敬仰的太白金星。"
 # parts_to_edit = [[0.84, 1.4], [1.92, 2.4], [4.26, 6.26], ]

src/f5_tts/train/README.md CHANGED Viewed

	@@ -1,3 +1,4 @@

1
2	## Prepare Dataset
3


1	+ # Training
2
3	## Prepare Dataset
4