Spaces:

zhouzhou363
/

f5-tts

Configuration error

App Files Files Community

SWivid commited on Dec 16, 2024

Commit

c93462f

1 Parent(s): 649b46e

feature. allow custom model config for gradio infer

Browse files

Files changed (7) hide show

README.md +1 -1
src/f5_tts/configs/F5TTS_Base_train.yaml +1 -1
src/f5_tts/configs/F5TTS_Small_train.yaml +1 -1
src/f5_tts/infer/SHARED.md +39 -34
src/f5_tts/infer/infer_cli.py +14 -5
src/f5_tts/infer/infer_gradio.py +69 -26
src/f5_tts/model/backbones/dit.py +1 -2

README.md CHANGED Viewed

@@ -150,7 +150,7 @@ Note: Some model components have linting exceptions for E722 to accommodate tens
 - [Emilia](https://arxiv.org/abs/2407.05361), [WenetSpeech4TTS](https://arxiv.org/abs/2406.05763), [LibriTTS](https://arxiv.org/abs/1904.02882), [LJSpeech](https://keithito.com/LJ-Speech-Dataset/) valuable datasets
 - [lucidrains](https://github.com/lucidrains) initial CFM structure with also [bfs18](https://github.com/bfs18) for discussion
 - [SD3](https://arxiv.org/abs/2403.03206) & [Hugging Face diffusers](https://github.com/huggingface/diffusers) DiT and MMDiT code structure
-- [torchdiffeq](https://github.com/rtqichen/torchdiffeq) as ODE solver, [Vocos](https://huggingface.co/charactr/vocos-mel-24khz) and [BigVGAN](https://github.com/NVIDIA/BigVGAN/tree/main) as vocoder
 - [FunASR](https://github.com/modelscope/FunASR), [faster-whisper](https://github.com/SYSTRAN/faster-whisper), [UniSpeech](https://github.com/microsoft/UniSpeech), [SpeechMOS](https://github.com/tarepan/SpeechMOS) for evaluation tools
 - [ctc-forced-aligner](https://github.com/MahmoudAshraf97/ctc-forced-aligner) for speech edit test
 - [mrfakename](https://x.com/realmrfakename) huggingface space demo ~

 - [Emilia](https://arxiv.org/abs/2407.05361), [WenetSpeech4TTS](https://arxiv.org/abs/2406.05763), [LibriTTS](https://arxiv.org/abs/1904.02882), [LJSpeech](https://keithito.com/LJ-Speech-Dataset/) valuable datasets
 - [lucidrains](https://github.com/lucidrains) initial CFM structure with also [bfs18](https://github.com/bfs18) for discussion
 - [SD3](https://arxiv.org/abs/2403.03206) & [Hugging Face diffusers](https://github.com/huggingface/diffusers) DiT and MMDiT code structure
+- [torchdiffeq](https://github.com/rtqichen/torchdiffeq) as ODE solver, [Vocos](https://huggingface.co/charactr/vocos-mel-24khz) and [BigVGAN](https://github.com/NVIDIA/BigVGAN) as vocoder
 - [FunASR](https://github.com/modelscope/FunASR), [faster-whisper](https://github.com/SYSTRAN/faster-whisper), [UniSpeech](https://github.com/microsoft/UniSpeech), [SpeechMOS](https://github.com/tarepan/SpeechMOS) for evaluation tools
 - [ctc-forced-aligner](https://github.com/MahmoudAshraf97/ctc-forced-aligner) for speech edit test
 - [mrfakename](https://x.com/realmrfakename) huggingface space demo ~

src/f5_tts/configs/F5TTS_Base_train.yaml CHANGED Viewed

@@ -28,7 +28,7 @@ model:
     ff_mult: 2
     text_dim: 512
     conv_layers: 4
-    checkpoint_activations: False # recompute activations and save memory for extra compute
   mel_spec:
     target_sample_rate: 24000
     n_mel_channels: 100

     ff_mult: 2
     text_dim: 512
     conv_layers: 4
+    checkpoint_activations: False  # recompute activations and save memory for extra compute
   mel_spec:
     target_sample_rate: 24000
     n_mel_channels: 100

src/f5_tts/configs/F5TTS_Small_train.yaml CHANGED Viewed

@@ -28,7 +28,7 @@ model:
     ff_mult: 2
     text_dim: 512
     conv_layers: 4
-    checkpoint_activations: False # recompute activations and save memory for extra compute
   mel_spec:
     target_sample_rate: 24000
     n_mel_channels: 100

     ff_mult: 2
     text_dim: 512
     conv_layers: 4
+    checkpoint_activations: False  # recompute activations and save memory for extra compute
   mel_spec:
     target_sample_rate: 24000
     n_mel_channels: 100

src/f5_tts/infer/SHARED.md CHANGED Viewed

@@ -16,33 +16,34 @@
 <!-- omit in toc -->
 ### Supported Languages
 - [Multilingual](#multilingual)
-    - [F5-TTS Base @ pretrain @ zh \& en](#f5-tts-base--pretrain--zh--en)
 - [English](#english)
 - [Finnish](#finnish)
-    - [Finnish Common\_Voice Vox\_Populi @ finetune @ fi](#finnish-common_voice-vox_populi--finetune--fi)
 - [French](#french)
-    - [French LibriVox @ finetune @ fr](#french-librivox--finetune--fr)
 - [Hindi](#hindi)
-    - [F5-TTS Small @ pretrain @ hi](#f5-tts-small--pretrain--hi)
 - [Italian](#italian)
-    - [F5-TTS Italian @ finetune @ it](#f5-tts-italian--finetune--it)
 - [Japanese](#japanese)
-    - [F5-TTS Japanese @ pretrain/finetune @ ja](#f5-tts-japanese--pretrainfinetune--ja)
 - [Mandarin](#mandarin)
 - [Spanish](#spanish)
-    - [F5-TTS Spanish @ pretrain/finetune @ es](#f5-tts-spanish--pretrainfinetune--es)
 ## Multilingual
-#### F5-TTS Base @ pretrain @ zh & en
 |Model|🤗Hugging Face|Data (Hours)|Model License|
 |:---:|:------------:|:-----------:|:-------------:|
 |F5-TTS Base|[ckpt & vocab](https://huggingface.co/SWivid/F5-TTS/tree/main/F5TTS_Base)|[Emilia 95K zh&en](https://huggingface.co/datasets/amphion/Emilia-Dataset/tree/fc71e07)|cc-by-nc-4.0|
 ```bash
-MODEL_CKPT: hf://SWivid/F5-TTS/F5TTS_Base/model_1200000.safetensors
-VOCAB_FILE: hf://SWivid/F5-TTS/F5TTS_Base/vocab.txt
 ```
 *Other infos, e.g. Author info, Github repo, Link to some sampled results, Usage instruction, Tutorial (Blog, Video, etc.) ...*
@@ -53,27 +54,29 @@ VOCAB_FILE: hf://SWivid/F5-TTS/F5TTS_Base/vocab.txt
 ## Finnish
-#### Finnish Common_Voice Vox_Populi @ finetune @ fi
 |Model|🤗Hugging Face|Data|Model License|
 |:---:|:------------:|:-----------:|:-------------:|
-|F5-TTS Finnish|[ckpt & vocab](https://huggingface.co/AsmoKoskinen/F5-TTS_Finnish_Model)|[Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0), [Vox Populi](https://huggingface.co/datasets/facebook/voxpopuli)|cc-by-nc-4.0|
 ```bash
-MODEL_CKPT: hf://AsmoKoskinen/F5-TTS_Finnish_Model/model_common_voice_fi_vox_populi_fi_20241206.safetensors
-VOCAB_FILE: hf://AsmoKoskinen/F5-TTS_Finnish_Model/vocab.txt
 ```
 ## French
-#### French LibriVox @ finetune @ fr
 |Model|🤗Hugging Face|Data (Hours)|Model License|
 |:---:|:------------:|:-----------:|:-------------:|
-|F5-TTS French|[ckpt & vocab](https://huggingface.co/RASPIAUDIO/F5-French-MixedSpeakers-reduced)|[LibriVox](https://librivox.org/)|cc-by-nc-4.0|
 ```bash
-MODEL_CKPT: hf://RASPIAUDIO/F5-French-MixedSpeakers-reduced/model_last_reduced.pt
-VOCAB_FILE: hf://RASPIAUDIO/F5-French-MixedSpeakers-reduced/vocab.txt
 ```
 - [Online Inference with Hugging Face Space](https://huggingface.co/spaces/RASPIAUDIO/f5-tts_french).
@@ -83,31 +86,32 @@ VOCAB_FILE: hf://RASPIAUDIO/F5-French-MixedSpeakers-reduced/vocab.txt
 ## Hindi
-#### F5-TTS Small @ pretrain @ hi
 |Model|🤗Hugging Face|Data (Hours)|Model License|
 |:---:|:------------:|:-----------:|:-------------:|
 |F5-TTS Small|[ckpt & vocab](https://huggingface.co/SPRINGLab/F5-Hindi-24KHz)|[IndicTTS Hi](https://huggingface.co/datasets/SPRINGLab/IndicTTS-Hindi) & [IndicVoices-R Hi](https://huggingface.co/datasets/SPRINGLab/IndicVoices-R_Hindi) |cc-by-4.0|
 ```bash
-MODEL_CKPT: hf://SPRINGLab/F5-Hindi-24KHz/model_2500000.safetensors
-VOCAB_FILE: hf://SPRINGLab/F5-Hindi-24KHz/vocab.txt
 ```
-Authors: SPRING Lab, Indian Institute of Technology, Madras
-<br>
-Website: https://asr.iitm.ac.in/
 ## Italian
-#### F5-TTS Italian @ finetune @ it
 |Model|🤗Hugging Face|Data|Model License|
 |:---:|:------------:|:-----------:|:-------------:|
-|F5-TTS Italian|[ckpt & vocab](https://huggingface.co/alien79/F5-TTS-italian)|[ylacombe/cml-tts](https://huggingface.co/datasets/ylacombe/cml-tts) |cc-by-nc-4.0|
 ```bash
-MODEL_CKPT: hf://alien79/F5-TTS-italian/model_159600.safetensors
-VOCAB_FILE: hf://alien79/F5-TTS-italian/vocab.txt
 ```
 - Trained by [Mithril Man](https://github.com/MithrilMan)
@@ -117,14 +121,15 @@ VOCAB_FILE: hf://alien79/F5-TTS-italian/vocab.txt
 ## Japanese
-#### F5-TTS Japanese @ pretrain/finetune @ ja
 |Model|🤗Hugging Face|Data (Hours)|Model License|
 |:---:|:------------:|:-----------:|:-------------:|
-|F5-TTS Japanese|[ckpt & vocab](https://huggingface.co/Jmica/F5TTS/tree/main/JA_8500000)|[Emilia 1.7k JA](https://huggingface.co/datasets/amphion/Emilia-Dataset/tree/fc71e07) & [Galgame Dataset 5.4k](https://huggingface.co/datasets/OOPPEENN/Galgame_Dataset)|cc-by-nc-4.0|
 ```bash
-MODEL_CKPT: hf://Jmica/F5TTS/JA_8500000/model_8499660.pt
-VOCAB_FILE: hf://Jmica/F5TTS/JA_8500000/vocab_updated.txt
 ```
@@ -133,9 +138,9 @@ VOCAB_FILE: hf://Jmica/F5TTS/JA_8500000/vocab_updated.txt
 ## Spanish
-#### F5-TTS Spanish @ pretrain/finetune @ es
 |Model|🤗Hugging Face|Data (Hours)|Model License|
 |:---:|:------------:|:-----------:|:-------------:|
-|F5-TTS Spanish|[ckpt & vocab](https://huggingface.co/jpgallegoar/F5-Spanish)|[Voxpopuli](https://huggingface.co/datasets/facebook/voxpopuli) & Crowdsourced & TEDx, 218 hours|cc0-1.0|
 - @jpgallegoar [GitHub repo](https://github.com/jpgallegoar/Spanish-F5), Jupyter Notebook and Gradio usage for Spanish model.

 <!-- omit in toc -->
 ### Supported Languages
 - [Multilingual](#multilingual)
+    - [F5-TTS Base @ zh \& en @ F5-TTS](#f5-tts-base--zh--en--f5-tts)
 - [English](#english)
 - [Finnish](#finnish)
+    - [F5-TTS Base @ fi @ AsmoKoskinen](#f5-tts-base--fi--asmokoskinen)
 - [French](#french)
+    - [F5-TTS Base @ fr @ RASPIAUDIO](#f5-tts-base--fr--raspiaudio)
 - [Hindi](#hindi)
+    - [F5-TTS Small @ hi @ SPRINGLab](#f5-tts-small--hi--springlab)
 - [Italian](#italian)
+    - [F5-TTS Base @ it @ alien79](#f5-tts-base--it--alien79)
 - [Japanese](#japanese)
+    - [F5-TTS Base @ ja @ Jmica](#f5-tts-base--ja--jmica)
 - [Mandarin](#mandarin)
 - [Spanish](#spanish)
+    - [F5-TTS Base @ es @ jpgallegoar](#f5-tts-base--es--jpgallegoar)
 ## Multilingual
+#### F5-TTS Base @ zh & en @ F5-TTS
 |Model|🤗Hugging Face|Data (Hours)|Model License|
 |:---:|:------------:|:-----------:|:-------------:|
 |F5-TTS Base|[ckpt & vocab](https://huggingface.co/SWivid/F5-TTS/tree/main/F5TTS_Base)|[Emilia 95K zh&en](https://huggingface.co/datasets/amphion/Emilia-Dataset/tree/fc71e07)|cc-by-nc-4.0|
 ```bash
+Model: hf://SWivid/F5-TTS/F5TTS_Base/model_1200000.safetensors
+Vocab: hf://SWivid/F5-TTS/F5TTS_Base/vocab.txt
+Config: {"dim": 1024, "depth": 22, "heads": 16, "ff_mult": 2, "text_dim": 512, "conv_layers": 4}
 ```
 *Other infos, e.g. Author info, Github repo, Link to some sampled results, Usage instruction, Tutorial (Blog, Video, etc.) ...*
 ## Finnish
+#### F5-TTS Base @ fi @ AsmoKoskinen
 |Model|🤗Hugging Face|Data|Model License|
 |:---:|:------------:|:-----------:|:-------------:|
+|F5-TTS Base|[ckpt & vocab](https://huggingface.co/AsmoKoskinen/F5-TTS_Finnish_Model)|[Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0), [Vox Populi](https://huggingface.co/datasets/facebook/voxpopuli)|cc-by-nc-4.0|
 ```bash
+Model: hf://AsmoKoskinen/F5-TTS_Finnish_Model/model_common_voice_fi_vox_populi_fi_20241206.safetensors
+Vocab: hf://AsmoKoskinen/F5-TTS_Finnish_Model/vocab.txt
+Config: {"dim": 1024, "depth": 22, "heads": 16, "ff_mult": 2, "text_dim": 512, "conv_layers": 4}
 ```
 ## French
+#### F5-TTS Base @ fr @ RASPIAUDIO
 |Model|🤗Hugging Face|Data (Hours)|Model License|
 |:---:|:------------:|:-----------:|:-------------:|
+|F5-TTS Base|[ckpt & vocab](https://huggingface.co/RASPIAUDIO/F5-French-MixedSpeakers-reduced)|[LibriVox](https://librivox.org/)|cc-by-nc-4.0|
 ```bash
+Model: hf://RASPIAUDIO/F5-French-MixedSpeakers-reduced/model_last_reduced.pt
+Vocab: hf://RASPIAUDIO/F5-French-MixedSpeakers-reduced/vocab.txt
+Config: {"dim": 1024, "depth": 22, "heads": 16, "ff_mult": 2, "text_dim": 512, "conv_layers": 4}
 ```
 - [Online Inference with Hugging Face Space](https://huggingface.co/spaces/RASPIAUDIO/f5-tts_french).
 ## Hindi
+#### F5-TTS Small @ hi @ SPRINGLab
 |Model|🤗Hugging Face|Data (Hours)|Model License|
 |:---:|:------------:|:-----------:|:-------------:|
 |F5-TTS Small|[ckpt & vocab](https://huggingface.co/SPRINGLab/F5-Hindi-24KHz)|[IndicTTS Hi](https://huggingface.co/datasets/SPRINGLab/IndicTTS-Hindi) & [IndicVoices-R Hi](https://huggingface.co/datasets/SPRINGLab/IndicVoices-R_Hindi) |cc-by-4.0|
 ```bash
+Model: hf://SPRINGLab/F5-Hindi-24KHz/model_2500000.safetensors
+Vocab: hf://SPRINGLab/F5-Hindi-24KHz/vocab.txt
+Config: {"dim": 768, "depth": 18, "heads": 12, "ff_mult": 2, "text_dim": 512, "conv_layers": 4}
 ```
+- Authors: SPRING Lab, Indian Institute of Technology, Madras
+- Website: https://asr.iitm.ac.in/
 ## Italian
+#### F5-TTS Base @ it @ alien79
 |Model|🤗Hugging Face|Data|Model License|
 |:---:|:------------:|:-----------:|:-------------:|
+|F5-TTS Base|[ckpt & vocab](https://huggingface.co/alien79/F5-TTS-italian)|[ylacombe/cml-tts](https://huggingface.co/datasets/ylacombe/cml-tts) |cc-by-nc-4.0|
 ```bash
+Model: hf://alien79/F5-TTS-italian/model_159600.safetensors
+Vocab: hf://alien79/F5-TTS-italian/vocab.txt
+Config: {"dim": 1024, "depth": 22, "heads": 16, "ff_mult": 2, "text_dim": 512, "conv_layers": 4}
 ```
 - Trained by [Mithril Man](https://github.com/MithrilMan)
 ## Japanese
+#### F5-TTS Base @ ja @ Jmica
 |Model|🤗Hugging Face|Data (Hours)|Model License|
 |:---:|:------------:|:-----------:|:-------------:|
+|F5-TTS Base|[ckpt & vocab](https://huggingface.co/Jmica/F5TTS/tree/main/JA_8500000)|[Emilia 1.7k JA](https://huggingface.co/datasets/amphion/Emilia-Dataset/tree/fc71e07) & [Galgame Dataset 5.4k](https://huggingface.co/datasets/OOPPEENN/Galgame_Dataset)|cc-by-nc-4.0|
 ```bash
+Model: hf://Jmica/F5TTS/JA_8500000/model_8499660.pt
+Vocab: hf://Jmica/F5TTS/JA_8500000/vocab_updated.txt
+Config: {"dim": 1024, "depth": 22, "heads": 16, "ff_mult": 2, "text_dim": 512, "conv_layers": 4}
 ```
 ## Spanish
+#### F5-TTS Base @ es @ jpgallegoar
 |Model|🤗Hugging Face|Data (Hours)|Model License|
 |:---:|:------------:|:-----------:|:-------------:|
+|F5-TTS Base|[ckpt & vocab](https://huggingface.co/jpgallegoar/F5-Spanish)|[Voxpopuli](https://huggingface.co/datasets/facebook/voxpopuli) & Crowdsourced & TEDx, 218 hours|cc0-1.0|
 - @jpgallegoar [GitHub repo](https://github.com/jpgallegoar/Spanish-F5), Jupyter Notebook and Gradio usage for Spanish model.

src/f5_tts/infer/infer_cli.py CHANGED Viewed

@@ -10,6 +10,7 @@ import numpy as np
 import soundfile as sf
 import tomli
 from cached_path import cached_path
 from f5_tts.infer.utils_infer import (
     mel_spec_type,
@@ -51,6 +52,12 @@ parser.add_argument(
     type=str,
     help="The model name: F5-TTS | E2-TTS",
 )
 parser.add_argument(
     "-p",
     "--ckpt_file",
@@ -166,6 +173,7 @@ config = tomli.load(open(args.config, "rb"))
 # command-line interface parameters
 model = args.model or config.get("model", "F5-TTS")
 ckpt_file = args.ckpt_file or config.get("ckpt_file", "")
 vocab_file = args.vocab_file or config.get("vocab_file", "")
@@ -179,9 +187,9 @@ output_file = args.output_file or config.get(
     "output_file", f"infer_cli_{datetime.now().strftime(r'%Y%m%d_%H%M%S')}.wav"
 )
-save_chunk = args.save_chunk
-remove_silence = args.remove_silence
-load_vocoder_from_local = args.load_vocoder_from_local
 vocoder_name = args.vocoder_name or config.get("vocoder_name", mel_spec_type)
 target_rms = args.target_rms or config.get("target_rms", target_rms)
@@ -235,7 +243,7 @@ vocoder = load_vocoder(vocoder_name=vocoder_name, is_local=load_vocoder_from_loc
 if model == "F5-TTS":
     model_cls = DiT
-    model_cfg = dict(dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4)
     if not ckpt_file:  # path not specified, download from repo
         if vocoder_name == "vocos":
             repo_name = "F5-TTS"
@@ -250,7 +258,8 @@ if model == "F5-TTS":
             ckpt_file = str(cached_path(f"hf://SWivid/{repo_name}/{exp_name}/model_{ckpt_step}.pt"))
 elif model == "E2-TTS":
-    assert vocoder_name == "vocos", "E2-TTS only supports vocoder vocos"
     model_cls = UNetT
     model_cfg = dict(dim=1024, depth=24, heads=16, ff_mult=4)
     if not ckpt_file:  # path not specified, download from repo

 import soundfile as sf
 import tomli
 from cached_path import cached_path
+from omegaconf import OmegaConf
 from f5_tts.infer.utils_infer import (
     mel_spec_type,
     type=str,
     help="The model name: F5-TTS | E2-TTS",
 )
+parser.add_argument(
+    "-mc",
+    "--model_cfg",
+    type=str,
+    help="The path to F5-TTS model config file .yaml",
+)
 parser.add_argument(
     "-p",
     "--ckpt_file",
 # command-line interface parameters
 model = args.model or config.get("model", "F5-TTS")
+model_cfg = args.model_cfg or config.get("model_cfg", str(files("f5_tts").joinpath("configs/F5TTS_Base_train.yaml")))
 ckpt_file = args.ckpt_file or config.get("ckpt_file", "")
 vocab_file = args.vocab_file or config.get("vocab_file", "")
     "output_file", f"infer_cli_{datetime.now().strftime(r'%Y%m%d_%H%M%S')}.wav"
 )
+save_chunk = args.save_chunk or config.get("save_chunk", False)
+remove_silence = args.remove_silence or config.get("remove_silence", False)
+load_vocoder_from_local = args.load_vocoder_from_local or config.get("load_vocoder_from_local", False)
 vocoder_name = args.vocoder_name or config.get("vocoder_name", mel_spec_type)
 target_rms = args.target_rms or config.get("target_rms", target_rms)
 if model == "F5-TTS":
     model_cls = DiT
+    model_cfg = OmegaConf.load(model_cfg).model.arch
     if not ckpt_file:  # path not specified, download from repo
         if vocoder_name == "vocos":
             repo_name = "F5-TTS"
             ckpt_file = str(cached_path(f"hf://SWivid/{repo_name}/{exp_name}/model_{ckpt_step}.pt"))
 elif model == "E2-TTS":
+    assert args.model_cfg is None, "E2-TTS does not support custom model_cfg yet"
+    assert vocoder_name == "vocos", "E2-TTS only supports vocoder vocos yet"
     model_cls = UNetT
     model_cfg = dict(dim=1024, depth=24, heads=16, ff_mult=4)
     if not ckpt_file:  # path not specified, download from repo

src/f5_tts/infer/infer_gradio.py CHANGED Viewed

@@ -1,6 +1,7 @@
 # ruff: noqa: E402
 # Above allows ruff to ignore E402: module level import not at top of file
 import re
 import tempfile
 from collections import OrderedDict
@@ -43,6 +44,12 @@ from f5_tts.infer.utils_infer import (
 DEFAULT_TTS_MODEL = "F5-TTS"
 tts_model_choice = DEFAULT_TTS_MODEL
 # load models
@@ -103,7 +110,15 @@ def generate_response(messages, model, tokenizer):
 @gpu_decorator
 def infer(
-    ref_audio_orig, ref_text, gen_text, model, remove_silence, cross_fade_duration=0.15, speed=1, show_info=gr.Info
 ):
     ref_audio, ref_text = preprocess_ref_audio_text(ref_audio_orig, ref_text, show_info=show_info)
@@ -120,7 +135,7 @@ def infer(
         global custom_ema_model, pre_custom_path
         if pre_custom_path != model[1]:
             show_info("Loading Custom TTS model...")
-            custom_ema_model = load_custom(model[1], vocab_path=model[2])
             pre_custom_path = model[1]
         ema_model = custom_ema_model
@@ -131,6 +146,7 @@ def infer(
         ema_model,
         vocoder,
         cross_fade_duration=cross_fade_duration,
         speed=speed,
         show_info=show_info,
         progress=gr.Progress(),
@@ -184,6 +200,14 @@ with gr.Blocks() as app_tts:
             step=0.1,
             info="Adjust the speed of the audio.",
         )
         cross_fade_duration_slider = gr.Slider(
             label="Cross-Fade Duration (s)",
             minimum=0.0,
@@ -203,6 +227,7 @@ with gr.Blocks() as app_tts:
         gen_text_input,
         remove_silence,
         cross_fade_duration_slider,
         speed_slider,
     ):
         audio_out, spectrogram_path, ref_text_out = infer(
@@ -211,8 +236,9 @@ with gr.Blocks() as app_tts:
             gen_text_input,
             tts_model_choice,
             remove_silence,
-            cross_fade_duration_slider,
-            speed_slider,
         )
         return audio_out, spectrogram_path, gr.update(value=ref_text_out)
@@ -224,6 +250,7 @@ with gr.Blocks() as app_tts:
             gen_text_input,
             remove_silence,
             cross_fade_duration_slider,
             speed_slider,
         ],
         outputs=[audio_output, spectrogram_output, ref_text_input],
@@ -744,34 +771,38 @@ If you're having issues, try converting your reference audio to WAV or MP3, clip
 """
     )
-    last_used_custom = files("f5_tts").joinpath("infer/.cache/last_used_custom.txt")
     def load_last_used_custom():
         try:
-            with open(last_used_custom, "r") as f:
-                return f.read().split(",")
         except FileNotFoundError:
             last_used_custom.parent.mkdir(parents=True, exist_ok=True)
-            return [
-                "hf://SWivid/F5-TTS/F5TTS_Base/model_1200000.safetensors",
-                "hf://SWivid/F5-TTS/F5TTS_Base/vocab.txt",
-            ]
     def switch_tts_model(new_choice):
         global tts_model_choice
         if new_choice == "Custom":  # override in case webpage is refreshed
-            custom_ckpt_path, custom_vocab_path = load_last_used_custom()
-            tts_model_choice = ["Custom", custom_ckpt_path, custom_vocab_path]
-            return gr.update(visible=True, value=custom_ckpt_path), gr.update(visible=True, value=custom_vocab_path)
         else:
             tts_model_choice = new_choice
-            return gr.update(visible=False), gr.update(visible=False)
-    def set_custom_model(custom_ckpt_path, custom_vocab_path):
         global tts_model_choice
-        tts_model_choice = ["Custom", custom_ckpt_path, custom_vocab_path]
-        with open(last_used_custom, "w") as f:
-            f.write(f"{custom_ckpt_path},{custom_vocab_path}")
     with gr.Row():
         if not USING_SPACES:
@@ -783,34 +814,46 @@ If you're having issues, try converting your reference audio to WAV or MP3, clip
                 choices=[DEFAULT_TTS_MODEL, "E2-TTS"], label="Choose TTS Model", value=DEFAULT_TTS_MODEL
             )
         custom_ckpt_path = gr.Dropdown(
-            choices=["hf://SWivid/F5-TTS/F5TTS_Base/model_1200000.safetensors"],
             value=load_last_used_custom()[0],
             allow_custom_value=True,
-            label="MODEL CKPT: local_path | hf://user_id/repo_id/model_ckpt",
             visible=False,
         )
         custom_vocab_path = gr.Dropdown(
-            choices=["hf://SWivid/F5-TTS/F5TTS_Base/vocab.txt"],
             value=load_last_used_custom()[1],
             allow_custom_value=True,
-            label="VOCAB FILE: local_path | hf://user_id/repo_id/vocab_file",
             visible=False,
         )
     choose_tts_model.change(
         switch_tts_model,
         inputs=[choose_tts_model],
-        outputs=[custom_ckpt_path, custom_vocab_path],
         show_progress="hidden",
     )
     custom_ckpt_path.change(
         set_custom_model,
-        inputs=[custom_ckpt_path, custom_vocab_path],
         show_progress="hidden",
     )
     custom_vocab_path.change(
         set_custom_model,
-        inputs=[custom_ckpt_path, custom_vocab_path],
         show_progress="hidden",
     )

 # ruff: noqa: E402
 # Above allows ruff to ignore E402: module level import not at top of file
+import json
 import re
 import tempfile
 from collections import OrderedDict
 DEFAULT_TTS_MODEL = "F5-TTS"
 tts_model_choice = DEFAULT_TTS_MODEL
+DEFAULT_TTS_MODEL_CFG = [
+    "hf://SWivid/F5-TTS/F5TTS_Base/model_1200000.safetensors",
+    "hf://SWivid/F5-TTS/F5TTS_Base/vocab.txt",
+    json.dumps(dict(dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4)),
+]
 # load models
 @gpu_decorator
 def infer(
+    ref_audio_orig,
+    ref_text,
+    gen_text,
+    model,
+    remove_silence,
+    cross_fade_duration=0.15,
+    nfe_step=32,
+    speed=1,
+    show_info=gr.Info,
 ):
     ref_audio, ref_text = preprocess_ref_audio_text(ref_audio_orig, ref_text, show_info=show_info)
         global custom_ema_model, pre_custom_path
         if pre_custom_path != model[1]:
             show_info("Loading Custom TTS model...")
+            custom_ema_model = load_custom(model[1], vocab_path=model[2], model_cfg=model[3])
             pre_custom_path = model[1]
         ema_model = custom_ema_model
         ema_model,
         vocoder,
         cross_fade_duration=cross_fade_duration,
+        nfe_step=nfe_step,
         speed=speed,
         show_info=show_info,
         progress=gr.Progress(),
             step=0.1,
             info="Adjust the speed of the audio.",
         )
+        nfe_slider = gr.Slider(
+            label="NFE Steps",
+            minimum=4,
+            maximum=64,
+            value=32,
+            step=2,
+            info="Set the number of denoising steps.",
+        )
         cross_fade_duration_slider = gr.Slider(
             label="Cross-Fade Duration (s)",
             minimum=0.0,
         gen_text_input,
         remove_silence,
         cross_fade_duration_slider,
+        nfe_slider,
         speed_slider,
     ):
         audio_out, spectrogram_path, ref_text_out = infer(
             gen_text_input,
             tts_model_choice,
             remove_silence,
+            cross_fade_duration=cross_fade_duration_slider,
+            nfe_step=nfe_slider,
+            speed=speed_slider,
         )
         return audio_out, spectrogram_path, gr.update(value=ref_text_out)
             gen_text_input,
             remove_silence,
             cross_fade_duration_slider,
+            nfe_slider,
             speed_slider,
         ],
         outputs=[audio_output, spectrogram_output, ref_text_input],
 """
     )
+    last_used_custom = files("f5_tts").joinpath("infer/.cache/last_used_custom_model_info.txt")
     def load_last_used_custom():
         try:
+            custom = []
+            with open(last_used_custom, "r", encoding='utf-8') as f:
+                for line in f:
+                    custom.append(line.strip())
+            return custom
         except FileNotFoundError:
             last_used_custom.parent.mkdir(parents=True, exist_ok=True)
+            return DEFAULT_TTS_MODEL_CFG
     def switch_tts_model(new_choice):
         global tts_model_choice
         if new_choice == "Custom":  # override in case webpage is refreshed
+            custom_ckpt_path, custom_vocab_path, custom_model_cfg = load_last_used_custom()
+            tts_model_choice = ["Custom", custom_ckpt_path, custom_vocab_path, json.loads(custom_model_cfg)]
+            return (
+                gr.update(visible=True, value=custom_ckpt_path),
+                gr.update(visible=True, value=custom_vocab_path),
+                gr.update(visible=True, value=custom_model_cfg),
+            )
         else:
             tts_model_choice = new_choice
+            return gr.update(visible=False), gr.update(visible=False), gr.update(visible=False)
+    def set_custom_model(custom_ckpt_path, custom_vocab_path, custom_model_cfg):
         global tts_model_choice
+        tts_model_choice = ["Custom", custom_ckpt_path, custom_vocab_path, json.loads(custom_model_cfg)]
+        with open(last_used_custom, "w", encoding='utf-8') as f:
+            f.write(custom_ckpt_path + "\n" + custom_vocab_path + "\n" + custom_model_cfg + "\n")
     with gr.Row():
         if not USING_SPACES:
                 choices=[DEFAULT_TTS_MODEL, "E2-TTS"], label="Choose TTS Model", value=DEFAULT_TTS_MODEL
             )
         custom_ckpt_path = gr.Dropdown(
+            choices=[DEFAULT_TTS_MODEL_CFG[0]],
             value=load_last_used_custom()[0],
             allow_custom_value=True,
+            label="Model: local_path | hf://user_id/repo_id/model_ckpt",
             visible=False,
         )
         custom_vocab_path = gr.Dropdown(
+            choices=[DEFAULT_TTS_MODEL_CFG[1]],
             value=load_last_used_custom()[1],
             allow_custom_value=True,
+            label="Vocab: local_path | hf://user_id/repo_id/vocab_file",
+            visible=False,
+        )
+        custom_model_cfg = gr.Dropdown(
+            choices=[DEFAULT_TTS_MODEL_CFG[2]],
+            value=load_last_used_custom()[2],
+            allow_custom_value=True,
+            label="Config: in a dictionary form",
             visible=False,
         )
     choose_tts_model.change(
         switch_tts_model,
         inputs=[choose_tts_model],
+        outputs=[custom_ckpt_path, custom_vocab_path, custom_model_cfg],
         show_progress="hidden",
     )
     custom_ckpt_path.change(
         set_custom_model,
+        inputs=[custom_ckpt_path, custom_vocab_path, custom_model_cfg],
         show_progress="hidden",
     )
     custom_vocab_path.change(
         set_custom_model,
+        inputs=[custom_ckpt_path, custom_vocab_path, custom_model_cfg],
+        show_progress="hidden",
+    )
+    custom_model_cfg.change(
+        set_custom_model,
+        inputs=[custom_ckpt_path, custom_vocab_path, custom_model_cfg],
         show_progress="hidden",
     )

src/f5_tts/model/backbones/dit.py CHANGED Viewed

@@ -131,8 +131,7 @@ class DiT(nn.Module):
         self.checkpoint_activations = checkpoint_activations
     def ckpt_wrapper(self, module):
-        """Code from https://github.com/chuanyangjin/fast-DiT/blob/1a8ecce58f346f877749f2dc67cdb190d295e4dc/models.py#L233-L237"""
         def ckpt_forward(*inputs):
             outputs = module(*inputs)
             return outputs

         self.checkpoint_activations = checkpoint_activations
     def ckpt_wrapper(self, module):
+        # https://github.com/chuanyangjin/fast-DiT/blob/main/models.py
         def ckpt_forward(*inputs):
             outputs = module(*inputs)
             return outputs