Spaces:

Emmiq
/

EmmiSpace

Build error

App Files Files Community

SWivid commited on Oct 24, 2024

Commit

8640433

1 Parent(s): d280126

initial updates for infer stuffs

Browse files

Files changed (6) hide show

README.md +21 -82
pyproject.toml +2 -1
src/f5_tts/api.py +2 -2
src/f5_tts/infer/README.md +92 -0
src/f5_tts/infer/infer_cli.py +12 -6
src/f5_tts/infer/utils_infer.py +3 -4

README.md CHANGED Viewed

@@ -16,6 +16,9 @@
 ### Thanks to all the contributors !
 ## Installation
 ```bash
@@ -48,112 +51,48 @@ pip install -e .
 docker build -t f5tts:v1 .
 ```
-## Development
-Use pre-commit to ensure code quality (will run linters and formatters automatically)
-```bash
-pip install pre-commit
-pre-commit install
-```
-When making a pull request, before each commit, run:
-```bash
-pre-commit run --all-files
-```
-Note: Some model components have linting exceptions for E722 to accommodate tensor notation
 ## Inference
-```python
-import gradio as gr
-from f5_tts.gradio_app import app
-with gr.Blocks() as main_app:
-    gr.Markdown("# This is an example of using F5-TTS within a bigger Gradio app")
-    # ... other Gradio components
-    app.render()
-main_app.launch()
 ```
-The pretrained model checkpoints can be reached at [🤗 Hugging Face](https://huggingface.co/SWivid/F5-TTS) and [🤖 Model Scope](https://www.modelscope.cn/models/SWivid/F5-TTS_Emilia-ZH-EN), or automatically downloaded with `inference-cli` and `gradio_app`.
-Currently support 30s for a single generation, which is the **TOTAL** length of prompt audio and the generated. Batch inference with chunks is supported by `inference-cli` and `gradio_app`.
-- To avoid possible inference failures, make sure you have seen through the following instructions.
-- A longer prompt audio allows shorter generated output. The part longer than 30s cannot be generated properly. Consider using a prompt audio <15s.
-- Uppercased letters will be uttered letter by letter, so use lowercased letters for normal words.
-- Add some spaces (blank: " ") or punctuations (e.g. "," ".") to explicitly introduce some pauses. If first few words skipped in code-switched generation (cuz different speed with different languages), this might help.
-### CLI Inference
-Either you can specify everything in `inference-cli.toml` or override with flags. Leave `--ref_text ""` will have ASR model transcribe the reference audio automatically (use extra GPU memory). If encounter network error, consider use local ckpt, just set `ckpt_file` in `inference-cli.py`
-for change model use `--ckpt_file` to specify the model you want to load,
-for change vocab.txt use `--vocab_file` to provide your vocab.txt file.
-```bash
-# switch to the main directory
-cd f5_tts
-python inference-cli.py \
---model "F5-TTS" \
---ref_audio "tests/ref_audio/test_en_1_ref_short.wav" \
---ref_text "Some call me nature, others call me mother nature." \
---gen_text "I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring. Respect me and I'll nurture you; ignore me and you shall face the consequences."
-python inference-cli.py \
---model "E2-TTS" \
---ref_audio "tests/ref_audio/test_zh_1_ref_short.wav" \
---ref_text "对，这就是我，万人敬仰的太乙真人。" \
---gen_text "突然，身边一阵笑声。我看着他们，意气风发地挺直了胸膛，甩了甩那稍显肉感的双臂，轻笑道，我身上的肉，是为了掩饰我爆棚的魅力，否则，岂不吓坏了你们呢？"
-# Multi voice
-# https://github.com/SWivid/F5-TTS/pull/146#issue-2595207852
-python inference-cli.py -c samples/story.toml
-```
-### Gradio App
-Currently supported features:
-- Chunk inference
-- Podcast Generation
-- Multiple Speech-Type Generation
-- Voice Chat powered by Qwen2.5-3B-Instruct
-You can launch a Gradio app (web interface) to launch a GUI for inference (will load ckpt from Huggingface, you may also use local file in `gradio_app.py`). Currently load ASR model, F5-TTS and E2 TTS all in once, thus use more GPU memory than `inference-cli`.
-```bash
-python f5_tts/gradio_app.py
-```
-You can specify the port/host:
-```bash
-python f5_tts/gradio_app.py --port 7860 --host 0.0.0.0
-```
-Or launch a share link:
 ```bash
-python f5_tts/gradio_app.py --share
 ```
-### Speech Editing
-To test speech editing capabilities, use the following command.
 ```bash
-python f5_tts/speech_edit.py
 ```
-## [Training](src/f5_tts/train/README.md)
-## [Evaluation](src/f5_tts/eval/README.md)
 ## Acknowledgements

 ### Thanks to all the contributors !
+## News
+- **2024/10/08**: F5-TTS & E2 TTS base models on [🤗 Hugging Face](https://huggingface.co/SWivid/F5-TTS), [🤖 Model Scope](https://www.modelscope.cn/models/SWivid/F5-TTS_Emilia-ZH-EN).
 ## Installation
 ```bash
 docker build -t f5tts:v1 .
 ```
 ## Inference
+### 1. Basic usage
+```bash
+# cli inference
+f5-tts_infer-cli
+# gradio interface
+f5-tts_infer-gradio
 ```
+### 2. More instructions
+- In order to have better generation results, take a moment to read [detailed guidance](src/f5_tts/infer/README.md).
+- The [Issues](https://github.com/SWivid/F5-TTS/issues?q=is%3Aissue) are very useful, please try to find the solution by properly searching the keywords of problem encountered. If no answer found, then feel free to open an issue.
+## [Training](src/f5_tts/train/README.md)
+## [Evaluation](src/f5_tts/eval/README.md)
+## Development
+Use pre-commit to ensure code quality (will run linters and formatters automatically)
 ```bash
+pip install pre-commit
+pre-commit install
 ```
+When making a pull request, before each commit, run:
 ```bash
+pre-commit run --all-files
 ```
+Note: Some model components have linting exceptions for E722 to accommodate tensor notation
 ## Acknowledgements

pyproject.toml CHANGED Viewed

@@ -55,4 +55,5 @@ eval = [
 Homepage = "https://github.com/SWivid/F5-TTS"
 [project.scripts]
-"inference-cli" = "f5_tts.inference_cli:main"

 Homepage = "https://github.com/SWivid/F5-TTS"
 [project.scripts]
+"f5-tts_infer-cli" = "f5_tts.infer.infer_cli:main"
+"f5-tts_infer-gradio" = "f5_tts.infer.infer_gradio:main"

src/f5_tts/api.py CHANGED Viewed

@@ -130,8 +130,8 @@ if __name__ == "__main__":
         ref_file=str(files("f5_tts").joinpath("infer/examples/basic/basic_ref_en.wav")),
         ref_text="some call me nature, others call me mother nature.",
         gen_text="""I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring. Respect me and I'll nurture you; ignore me and you shall face the consequences.""",
-        file_wave=str(files("f5_tts").joinpath("../../api_test_out.wav")),
-        file_spect=str(files("f5_tts").joinpath("../../api_test_out.png")),
         seed=-1,  # random seed = -1
     )

         ref_file=str(files("f5_tts").joinpath("infer/examples/basic/basic_ref_en.wav")),
         ref_text="some call me nature, others call me mother nature.",
         gen_text="""I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring. Respect me and I'll nurture you; ignore me and you shall face the consequences.""",
+        file_wave=str(files("f5_tts").joinpath("../../tests/api_out.wav")),
+        file_spect=str(files("f5_tts").joinpath("../../tests/api_out.png")),
         seed=-1,  # random seed = -1
     )

src/f5_tts/infer/README.md ADDED Viewed

	@@ -0,0 +1,92 @@

+## Inference
+The pretrained model checkpoints can be reached at [🤗 Hugging Face](https://huggingface.co/SWivid/F5-TTS) and [🤖 Model Scope](https://www.modelscope.cn/models/SWivid/F5-TTS_Emilia-ZH-EN), or will be automatically downloaded when running inference scripts.
+Currently support **30s for a single** generation, which is the **total length** including both prompt and output audio. However, you can leverage `infer_cli` and `infer_gradio` for longer text, will automatically do chunk generation. Long reference audio will be clip short to ~15s.
+To avoid possible inference failures, make sure you have seen through the following instructions.
+- Uppercased letters will be uttered letter by letter, so use lowercased letters for normal words.
+- Add some spaces (blank: " ") or punctuations (e.g. "," ".") to explicitly introduce some pauses.
+- Preprocess numbers to Chinese letters if you want to have them read in Chinese, otherwise in English.
+# TODO 👇 ...
+### CLI Inference
+It is possible to use cli `f5-tts_infer-cli` for following commands.
+Either you can specify everything in `inference-cli.toml` or override with flags. Leave `--ref_text ""` will have ASR model transcribe the reference audio automatically (use extra GPU memory). If encounter network error, consider use local ckpt, just set `ckpt_file` in `inference-cli.py`
+for change model use `--ckpt_file` to specify the model you want to load,
+for change vocab.txt use `--vocab_file` to provide your vocab.txt file.
+```bash
+# switch to the main directory
+cd f5_tts
+python inference-cli.py \
+--model "F5-TTS" \
+--ref_audio "tests/ref_audio/test_en_1_ref_short.wav" \
+--ref_text "Some call me nature, others call me mother nature." \
+--gen_text "I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring. Respect me and I'll nurture you; ignore me and you shall face the consequences."
+python inference-cli.py \
+--model "E2-TTS" \
+--ref_audio "tests/ref_audio/test_zh_1_ref_short.wav" \
+--ref_text "对，这就是我，万人敬仰的太乙真人。" \
+--gen_text "突然，身边一阵笑声。我看着他们，意气风发地挺直了胸膛，甩了甩那稍显肉感的双臂，轻笑道，我身上的肉，是为了掩饰我爆棚的魅力，否则，岂不吓坏了你们呢？"
+# Multi voice
+# https://github.com/SWivid/F5-TTS/pull/146#issue-2595207852
+python inference-cli.py -c samples/story.toml
+```
+### Gradio App
+Currently supported features:
+- Chunk inference
+- Podcast Generation
+- Multiple Speech-Type Generation
+- Voice Chat powered by Qwen2.5-3B-Instruct
+It is possible to use cli `f5-tts_infer-gradio` for following commands.
+You can launch a Gradio app (web interface) to launch a GUI for inference (will load ckpt from Huggingface, you may also use local file in `gradio_app.py`). Currently load ASR model, F5-TTS and E2 TTS all in once, thus use more GPU memory than `inference-cli`.
+```bash
+python f5_tts/gradio_app.py
+```
+You can specify the port/host:
+```bash
+python f5_tts/gradio_app.py --port 7860 --host 0.0.0.0
+```
+Or launch a share link:
+```bash
+python f5_tts/gradio_app.py --share
+```
+```python
+import gradio as gr
+from f5_tts.gradio_app import app
+with gr.Blocks() as main_app:
+    gr.Markdown("# This is an example of using F5-TTS within a bigger Gradio app")
+    # ... other Gradio components
+    app.render()
+main_app.launch()
+```
+### Speech Editing
+To test speech editing capabilities, use the following command.
+```bash
+python f5_tts/speech_edit.py
+```

src/f5_tts/infer/infer_cli.py CHANGED Viewed

@@ -21,15 +21,15 @@ from f5_tts.infer.utils_infer import (
 parser = argparse.ArgumentParser(
-    prog="python3 inference-cli.py",
     description="Commandline interface for E2/F5 TTS with Advanced Batch Processing.",
-    epilog="Specify  options above  to override  one or more settings from config.",
 )
 parser.add_argument(
     "-c",
     "--config",
-    help="Configuration file. Default=inference-cli.toml",
-    default=os.path.join(files("f5_tts").joinpath("data"), "inference-cli.toml"),
 )
 parser.add_argument(
     "-m",
@@ -80,6 +80,8 @@ args = parser.parse_args()
 config = tomli.load(open(args.config, "rb"))
 ref_audio = args.ref_audio if args.ref_audio else config["ref_audio"]
 ref_text = args.ref_text if args.ref_text != "666" else config["ref_text"]
 gen_text = args.gen_text if args.gen_text else config["gen_text"]
 gen_file = args.gen_file if args.gen_file else config["gen_file"]
@@ -90,8 +92,8 @@ model = args.model if args.model else config["model"]
 ckpt_file = args.ckpt_file if args.ckpt_file else ""
 vocab_file = args.vocab_file if args.vocab_file else ""
 remove_silence = args.remove_silence if args.remove_silence else config["remove_silence"]
-wave_path = Path(output_dir) / "out.wav"
-spectrogram_path = Path(output_dir) / "out.png"
 vocos_local_path = "../checkpoints/charactr/vocos-mel-24khz"
 vocos = load_vocoder(is_local=args.load_vocoder_from_local, local_path=vocos_local_path)
@@ -161,6 +163,10 @@ def main_process(ref_audio, ref_text, text_gen, model_obj, remove_silence):
     if generated_audio_segments:
         final_wave = np.concatenate(generated_audio_segments)
         with open(wave_path, "wb") as f:
             sf.write(f.name, final_wave, final_sample_rate)
             # Remove silence

 parser = argparse.ArgumentParser(
+    prog="python3 infer-cli.py",
     description="Commandline interface for E2/F5 TTS with Advanced Batch Processing.",
+    epilog="Specify options above to override one or more settings from config.",
 )
 parser.add_argument(
     "-c",
     "--config",
+    help="Configuration file. Default=infer/examples/basic/basic.toml",
+    default=os.path.join(files("f5_tts").joinpath("infer/examples/basic"), "basic.toml"),
 )
 parser.add_argument(
     "-m",
 config = tomli.load(open(args.config, "rb"))
 ref_audio = args.ref_audio if args.ref_audio else config["ref_audio"]
+if "src/f5_tts/infer/examples/basic" in ref_audio:  # for pip pkg user
+    ref_audio = str(files("f5_tts").joinpath(f"../../{ref_audio}"))
 ref_text = args.ref_text if args.ref_text != "666" else config["ref_text"]
 gen_text = args.gen_text if args.gen_text else config["gen_text"]
 gen_file = args.gen_file if args.gen_file else config["gen_file"]
 ckpt_file = args.ckpt_file if args.ckpt_file else ""
 vocab_file = args.vocab_file if args.vocab_file else ""
 remove_silence = args.remove_silence if args.remove_silence else config["remove_silence"]
+wave_path = Path(output_dir) / "infer_cli_out.wav"
+# spectrogram_path = Path(output_dir) / "infer_cli_out.png"
 vocos_local_path = "../checkpoints/charactr/vocos-mel-24khz"
 vocos = load_vocoder(is_local=args.load_vocoder_from_local, local_path=vocos_local_path)
     if generated_audio_segments:
         final_wave = np.concatenate(generated_audio_segments)
+        if not os.path.exists(output_dir):
+            os.makedirs(output_dir)
         with open(wave_path, "wb") as f:
             sf.write(f.name, final_wave, final_sample_rate)
             # Remove silence

src/f5_tts/infer/utils_infer.py CHANGED Viewed

@@ -186,13 +186,12 @@ def preprocess_ref_audio_text(ref_audio_orig, ref_text, show_info=print, device=
         non_silent_segs = silence.split_on_silence(aseg, min_silence_len=1000, silence_thresh=-50, keep_silence=1000)
         non_silent_wave = AudioSegment.silent(duration=0)
         for non_silent_seg in non_silent_segs:
             non_silent_wave += non_silent_seg
         aseg = non_silent_wave
-        audio_duration = len(aseg)
-        if audio_duration > 15000:
-            show_info("Audio is over 15s, clipping to only first 15s.")
-            aseg = aseg[:15000]
         aseg.export(f.name, format="wav")
         ref_audio = f.name

         non_silent_segs = silence.split_on_silence(aseg, min_silence_len=1000, silence_thresh=-50, keep_silence=1000)
         non_silent_wave = AudioSegment.silent(duration=0)
         for non_silent_seg in non_silent_segs:
+            if len(non_silent_wave) > 10000 and len(non_silent_wave + non_silent_seg) > 18000:
+                show_info("Audio is over 18s, clipping short.")
+                break
             non_silent_wave += non_silent_seg
         aseg = non_silent_wave
         aseg.export(f.name, format="wav")
         ref_audio = f.name