Spaces:

Emmiq
/

EmmiSpace

Build error

App Files Files Community

mrfakename commited on Oct 12, 2024

Commit

2afcda9

1 Parent(s): ceb1c72

Add Gradio app, MPS support

Browse files

Files changed (4) hide show

README.md +83 -25
gradio_app.py +265 -0
requirements_gradio.txt +3 -0
test_infer_single.py +1 -1

README.md CHANGED Viewed

@@ -1,26 +1,34 @@
 # F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
 [![arXiv](https://img.shields.io/badge/arXiv-2410.06885-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2410.06885)
 [![demo](https://img.shields.io/badge/GitHub-Demo%20page-blue.svg)](https://swivid.github.io/F5-TTS/)
-[![space](https://img.shields.io/badge/🤗-Space%20demo-yellow)](https://huggingface.co/spaces/mrfakename/E2-F5-TTS) \
-**F5-TTS**: Diffusion Transformer with ConvNeXt V2, faster trained and inference. \
-**E2 TTS**: Flat-UNet Transformer, closest reproduction.\
 **Sway Sampling**: Inference-time flow step sampling strategy, greatly improves performance
 ## Installation
-Clone this repository.
 ```bash
-git clone git@github.com:SWivid/F5-TTS.git
 cd F5-TTS
 ```
-Install packages.
 ```bash
 pip install -r requirements.txt
 ```
 ## Prepare Dataset
 Example data processing scripts for Emilia and Wenetspeech4TTS, and you may tailor your own one along with a Dataset class in `model/dataset.py`.
 ```bash
 # prepare custom dataset up to your need
 # download corresponding dataset first, and fill in the path in scripts
@@ -33,7 +41,9 @@ python scripts/prepare_wenetspeech4tts.py
 ```
 ## Training
 Once your datasets are prepared, you can start the training process.
 ```bash
 # setup accelerate config, e.g. use multi-gpu ddp, fp16
 # will be to: ~/.cache/huggingface/accelerate/default_config.yaml
@@ -42,10 +52,13 @@ accelerate launch test_train.py
 ```
 ## Inference
-To inference with pretrained models, download the checkpoints from [🤗](https://huggingface.co/SWivid/F5-TTS).
 ### Single Inference
 You can test single inference using the following command. Before running the command, modify the config up to your need.
 ```bash
 # modify the config up to your need,
 # e.g. fix_duration (the total length of prompt + to_generate, currently support up to 30s)
@@ -54,14 +67,46 @@ You can test single inference using the following command. Before running the co
 #                   ( though 'midpoint' is 2nd-order ode solver, slower compared to 1st-order 'Euler')
 python test_infer_single.py
 ```
-### Speech Edit
 To test speech editing capabilities, use the following command.
-```
 python test_infer_single_edit.py
 ```
 ## Evaluation
 ### Prepare Test Datasets
 1. Seed-TTS test set: Download from [seed-tts-eval](https://github.com/BytedanceSpeech/seed-tts-eval).
 2. LibriSpeech test-clean: Download from [OpenSLR](http://www.openslr.org/12/).
 3. Unzip the downloaded datasets and place them in the data/ directory.
@@ -69,7 +114,9 @@ python test_infer_single_edit.py
 5. Our filtered LibriSpeech-PC 4-10s subset is already under data/ in this repo
 ### Batch Inference for Test Set
 To run batch inference for evaluations, execute the following commands:
 ```bash
 # batch inference for evaluations
 accelerate config  # if not set before
@@ -77,16 +124,26 @@ bash test_infer_batch.sh
 ```
 ### Download Evaluation Model Checkpoints
 1. Chinese ASR Model: [Paraformer-zh](https://huggingface.co/funasr/paraformer-zh)
 2. English ASR Model: [Faster-Whisper](https://huggingface.co/Systran/faster-whisper-large-v3)
 3. WavLM Model: Download from [Google Drive](https://drive.google.com/file/d/1-aE1NfzpRCLxA4GUxX9ITI3F9LlbtEGP/view).
 ### Objective Evaluation
-**Some Notes**\
-For faster-whisper with CUDA 11: \
-`pip install --force-reinstall ctranslate2==3.24.0`\
-(Recommended) To avoid possible ASR failures, such as abnormal repetitions in output:\
-`pip install faster-whisper==0.10.1`
 Update the path with your batch-inferenced results, and carry out WER / SIM evaluations:
 ```bash
@@ -99,14 +156,14 @@ python scripts/eval_librispeech_test_clean.py
 ## Acknowledgements
-- <a href="https://arxiv.org/abs/2406.18009">E2-TTS</a> brilliant work, simple and effective
-- <a href="https://arxiv.org/abs/2407.05361">Emilia</a>, <a href="https://arxiv.org/abs/2406.05763">WenetSpeech4TTS</a> valuable datasets
-- <a href="https://github.com/lucidrains/e2-tts-pytorch">lucidrains</a> initial CFM structure</a> with also <a href="https://github.com/bfs18">bfs18</a> for discussion</a>
-- <a href="https://arxiv.org/abs/2403.03206">SD3</a> & <a href="https://github.com/huggingface/diffusers">Huggingface diffusers</a> DiT and MMDiT code structure
-- <a href="https://github.com/rtqichen/torchdiffeq">torchdiffeq</a> as ODE solver, <a href="https://huggingface.co/charactr/vocos-mel-24khz">Vocos</a> as vocoder
-- <a href="https://x.com/realmrfakename">mrfakename</a> huggingface space demo ~
-- <a href="https://github.com/modelscope/FunASR">FunASR</a>, <a href="https://github.com/SYSTRAN/faster-whisper">faster-whisper</a> & <a href="https://github.com/microsoft/UniSpeech">UniSpeech</a> for evaluation tools
-- <a href="https://github.com/MahmoudAshraf97/ctc-forced-aligner">ctc-forced-aligner</a> for speech edit test
 ## Citation
 ```
@@ -117,5 +174,6 @@ python scripts/eval_librispeech_test_clean.py
       year={2024},
 }
 ```
-## LICENSE
-Our code is released under MIT License.

 # F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
 [![arXiv](https://img.shields.io/badge/arXiv-2410.06885-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2410.06885)
 [![demo](https://img.shields.io/badge/GitHub-Demo%20page-blue.svg)](https://swivid.github.io/F5-TTS/)
+[![space](https://img.shields.io/badge/🤗-Space%20demo-yellow)](https://huggingface.co/spaces/mrfakename/E2-F5-TTS)
+**F5-TTS**: Diffusion Transformer with ConvNeXt V2, faster trained and inference.
+**E2 TTS**: Flat-UNet Transformer, closest reproduction.
 **Sway Sampling**: Inference-time flow step sampling strategy, greatly improves performance
 ## Installation
+Clone the repository:
 ```bash
+git clone https://github.com/SWivid/F5-TTS.git
 cd F5-TTS
 ```
+Install packages:
 ```bash
 pip install -r requirements.txt
 ```
 ## Prepare Dataset
 Example data processing scripts for Emilia and Wenetspeech4TTS, and you may tailor your own one along with a Dataset class in `model/dataset.py`.
 ```bash
 # prepare custom dataset up to your need
 # download corresponding dataset first, and fill in the path in scripts
 ```
 ## Training
 Once your datasets are prepared, you can start the training process.
 ```bash
 # setup accelerate config, e.g. use multi-gpu ddp, fp16
 # will be to: ~/.cache/huggingface/accelerate/default_config.yaml
 ```
 ## Inference
+To run inference with pretrained models, download the checkpoints from [🤗 Hugging Face](https://huggingface.co/SWivid/F5-TTS).
 ### Single Inference
 You can test single inference using the following command. Before running the command, modify the config up to your need.
 ```bash
 # modify the config up to your need,
 # e.g. fix_duration (the total length of prompt + to_generate, currently support up to 30s)
 #                   ( though 'midpoint' is 2nd-order ode solver, slower compared to 1st-order 'Euler')
 python test_infer_single.py
 ```
+### Speech Editing
 To test speech editing capabilities, use the following command.
+```bash
 python test_infer_single_edit.py
 ```
+### Gradio App
+You can launch a Gradio app (web interface) to launch a GUI for inference.
+First, make sure you have the dependencies installed (`pip install -r requirements.txt`). Then, install the Gradio app dependencies:
+```bash
+pip install -r requirements_gradio.txt
+```
+After installing the dependencies, launch the app:
+```bash
+python gradio_app.py
+```
+You can specify the port/host:
+```bash
+python gradio_app.py --port 7860 --host 0.0.0.0
+```
+Or launch a share link:
+```bash
+python gradio_app.py --share
+```
 ## Evaluation
 ### Prepare Test Datasets
 1. Seed-TTS test set: Download from [seed-tts-eval](https://github.com/BytedanceSpeech/seed-tts-eval).
 2. LibriSpeech test-clean: Download from [OpenSLR](http://www.openslr.org/12/).
 3. Unzip the downloaded datasets and place them in the data/ directory.
 5. Our filtered LibriSpeech-PC 4-10s subset is already under data/ in this repo
 ### Batch Inference for Test Set
 To run batch inference for evaluations, execute the following commands:
 ```bash
 # batch inference for evaluations
 accelerate config  # if not set before
 ```
 ### Download Evaluation Model Checkpoints
 1. Chinese ASR Model: [Paraformer-zh](https://huggingface.co/funasr/paraformer-zh)
 2. English ASR Model: [Faster-Whisper](https://huggingface.co/Systran/faster-whisper-large-v3)
 3. WavLM Model: Download from [Google Drive](https://drive.google.com/file/d/1-aE1NfzpRCLxA4GUxX9ITI3F9LlbtEGP/view).
 ### Objective Evaluation
+**Some Notes**
+For faster-whisper with CUDA 11:
+```bash
+pip install --force-reinstall ctranslate2==3.24.0
+```
+(Recommended) To avoid possible ASR failures, such as abnormal repetitions in output:
+```bash
+pip install faster-whisper==0.10.1
+```
 Update the path with your batch-inferenced results, and carry out WER / SIM evaluations:
 ```bash
 ## Acknowledgements
+- [E2-TTS](https://arxiv.org/abs/2406.18009) brilliant work, simple and effective
+- [Emilia](https://arxiv.org/abs/2407.05361), [WenetSpeech4TTS](https://arxiv.org/abs/2406.05763) valuable datasets
+- [lucidrains](https://github.com/lucidrains) initial CFM structure with also [bfs18](https://github.com/bfs18) for discussion
+- [SD3](https://arxiv.org/abs/2403.03206) & [Hugging Face diffusers](https://github.com/huggingface/diffusers) DiT and MMDiT code structure
+- [torchdiffeq](https://github.com/rtqichen/torchdiffeq) as ODE solver, [Vocos](https://huggingface.co/charactr/vocos-mel-24khz) as vocoder
+- [mrfakename](https://x.com/realmrfakename) huggingface space demo ~
+- [FunASR](https://github.com/modelscope/FunASR), [faster-whisper](https://github.com/SYSTRAN/faster-whisper), [UniSpeech](https://github.com/microsoft/UniSpeech) for evaluation tools
+- [ctc-forced-aligner](https://github.com/MahmoudAshraf97/ctc-forced-aligner) for speech edit test
 ## Citation
 ```
       year={2024},
 }
 ```
+## License
+Our code is released under MIT License.

gradio_app.py ADDED Viewed

	@@ -0,0 +1,265 @@

+import os
+import re
+import torch
+import torchaudio
+import gradio as gr
+import numpy as np
+import tempfile
+from einops import rearrange
+from ema_pytorch import EMA
+from vocos import Vocos
+from pydub import AudioSegment
+from model import CFM, UNetT, DiT, MMDiT
+from cached_path import cached_path
+from model.utils import (
+    get_tokenizer,
+    convert_char_to_pinyin,
+    save_spectrogram,
+)
+from transformers import pipeline
+import librosa
+import click
+device = (
+    "cuda"
+    if torch.cuda.is_available()
+    else "mps" if torch.backends.mps.is_available() else "cpu"
+)
+print(f"Using {device} device")
+pipe = pipeline(
+    "automatic-speech-recognition",
+    model="openai/whisper-large-v3-turbo",
+    torch_dtype=torch.float16,
+    device=device,
+)
+# --------------------- Settings -------------------- #
+target_sample_rate = 24000
+n_mel_channels = 100
+hop_length = 256
+target_rms = 0.1
+nfe_step = 32  # 16, 32
+cfg_strength = 2.0
+ode_method = "euler"
+sway_sampling_coef = -1.0
+speed = 1.0
+# fix_duration = 27  # None or float (duration in seconds)
+fix_duration = None
+def load_model(exp_name, model_cls, model_cfg, ckpt_step):
+    checkpoint = torch.load(
+        str(cached_path(f"hf://SWivid/F5-TTS/{exp_name}/model_{ckpt_step}.pt")),
+        map_location=device,
+    )
+    vocab_char_map, vocab_size = get_tokenizer("Emilia_ZH_EN", "pinyin")
+    model = CFM(
+        transformer=model_cls(
+            **model_cfg, text_num_embeds=vocab_size, mel_dim=n_mel_channels
+        ),
+        mel_spec_kwargs=dict(
+            target_sample_rate=target_sample_rate,
+            n_mel_channels=n_mel_channels,
+            hop_length=hop_length,
+        ),
+        odeint_kwargs=dict(
+            method=ode_method,
+        ),
+        vocab_char_map=vocab_char_map,
+    ).to(device)
+    ema_model = EMA(model, include_online_model=False).to(device)
+    ema_model.load_state_dict(checkpoint["ema_model_state_dict"])
+    ema_model.copy_params_from_ema_to_model()
+    return ema_model, model
+# load models
+F5TTS_model_cfg = dict(
+    dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4
+)
+E2TTS_model_cfg = dict(dim=1024, depth=24, heads=16, ff_mult=4)
+F5TTS_ema_model, F5TTS_base_model = load_model(
+    "F5TTS_Base", DiT, F5TTS_model_cfg, 1200000
+)
+E2TTS_ema_model, E2TTS_base_model = load_model(
+    "E2TTS_Base", UNetT, E2TTS_model_cfg, 1200000
+)
+def infer(ref_audio_orig, ref_text, gen_text, exp_name, remove_silence):
+    print(gen_text)
+    if len(gen_text) > 200:
+        raise gr.Error("Please keep your text under 200 chars.")
+    gr.Info("Converting audio...")
+    with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as f:
+        aseg = AudioSegment.from_file(ref_audio_orig)
+        audio_duration = len(aseg)
+        if audio_duration > 15000:
+            gr.Warning("Audio is over 15s, clipping to only first 15s.")
+            aseg = aseg[:15000]
+        aseg.export(f.name, format="wav")
+        ref_audio = f.name
+    if exp_name == "F5-TTS":
+        ema_model = F5TTS_ema_model
+        base_model = F5TTS_base_model
+    elif exp_name == "E2-TTS":
+        ema_model = E2TTS_ema_model
+        base_model = E2TTS_base_model
+    if not ref_text.strip():
+        gr.Info("No reference text provided, transcribing reference audio...")
+        ref_text = outputs = pipe(
+            ref_audio,
+            chunk_length_s=30,
+            batch_size=128,
+            generate_kwargs={"task": "transcribe"},
+            return_timestamps=False,
+        )["text"].strip()
+        gr.Info("Finished transcription")
+    else:
+        gr.Info("Using custom reference text...")
+    audio, sr = torchaudio.load(ref_audio)
+    rms = torch.sqrt(torch.mean(torch.square(audio)))
+    if rms < target_rms:
+        audio = audio * target_rms / rms
+    if sr != target_sample_rate:
+        resampler = torchaudio.transforms.Resample(sr, target_sample_rate)
+        audio = resampler(audio)
+    audio = audio.to(device)
+    # Prepare the text
+    text_list = [ref_text + gen_text]
+    final_text_list = convert_char_to_pinyin(text_list)
+    # Calculate duration
+    ref_audio_len = audio.shape[-1] // hop_length
+    # if fix_duration is not None:
+    #     duration = int(fix_duration * target_sample_rate / hop_length)
+    # else:
+    zh_pause_punc = r"。，、；：？！"
+    ref_text_len = len(ref_text) + len(re.findall(zh_pause_punc, ref_text))
+    gen_text_len = len(gen_text) + len(re.findall(zh_pause_punc, gen_text))
+    duration = ref_audio_len + int(ref_audio_len / ref_text_len * gen_text_len / speed)
+    # inference
+    gr.Info(f"Generating audio using {exp_name}")
+    with torch.inference_mode():
+        generated, _ = base_model.sample(
+            cond=audio,
+            text=final_text_list,
+            duration=duration,
+            steps=nfe_step,
+            cfg_strength=cfg_strength,
+            sway_sampling_coef=sway_sampling_coef,
+        )
+    generated = generated[:, ref_audio_len:, :]
+    generated_mel_spec = rearrange(generated, "1 n d -> 1 d n")
+    gr.Info("Running vocoder")
+    vocos = Vocos.from_pretrained("charactr/vocos-mel-24khz")
+    generated_wave = vocos.decode(generated_mel_spec.cpu())
+    if rms < target_rms:
+        generated_wave = generated_wave * rms / target_rms
+    # wav -> numpy
+    generated_wave = generated_wave.squeeze().cpu().numpy()
+    if remove_silence:
+        gr.Info("Removing audio silences... This may take a moment")
+        non_silent_intervals = librosa.effects.split(generated_wave, top_db=30)
+        non_silent_wave = np.array([])
+        for interval in non_silent_intervals:
+            start, end = interval
+            non_silent_wave = np.concatenate(
+                [non_silent_wave, generated_wave[start:end]]
+            )
+        generated_wave = non_silent_wave
+    # spectogram
+    with tempfile.NamedTemporaryFile(suffix=".png", delete=False) as tmp_spectrogram:
+        spectrogram_path = tmp_spectrogram.name
+        save_spectrogram(generated_mel_spec[0].cpu().numpy(), spectrogram_path)
+    return (target_sample_rate, generated_wave), spectrogram_path
+with gr.Blocks() as app:
+    gr.Markdown(
+        """
+# E2/F5 TTS
+This is a local web UI for F5 TTS, based on the unofficial [online demo](https://huggingface.co/spaces/mrfakename/E2-F5-TTS). This app supports the following TTS models:
+* [F5-TTS](https://arxiv.org/abs/2410.06885) (A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching)
+* [E2-TTS](https://arxiv.org/abs/2406.18009) (Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS)
+The checkpoints support English and Chinese.
+If you're having issues, try converting your reference audio to WAV or MP3, clipping it to 15s, and shortening your prompt.
+**NOTE: Reference text will be automatically transcribed with Whisper if not provided. For best results, keep your reference clips short (<15s). Ensure the audio is fully uploaded before generating.**
+"""
+    )
+    ref_audio_input = gr.Audio(label="Reference Audio", type="filepath")
+    gen_text_input = gr.Textbox(label="Text to Generate (max 200 chars.)", lines=4)
+    model_choice = gr.Radio(
+        choices=["F5-TTS", "E2-TTS"], label="Choose TTS Model", value="F5-TTS"
+    )
+    generate_btn = gr.Button("Synthesize", variant="primary")
+    with gr.Accordion("Advanced Settings", open=False):
+        ref_text_input = gr.Textbox(
+            label="Reference Text",
+            info="Leave blank to automatically transcribe the reference audio. If you enter text it will override automatic transcription.",
+            lines=2,
+        )
+        remove_silence = gr.Checkbox(
+            label="Remove Silences",
+            info="The model tends to produce silences, especially on longer audio. We can manually remove silences if needed. Note that this is an experimental feature and may produce strange results. This will also increase generation time.",
+            value=True,
+        )
+    audio_output = gr.Audio(label="Synthesized Audio")
+    spectrogram_output = gr.Image(label="Spectrogram")
+    generate_btn.click(
+        infer,
+        inputs=[
+            ref_audio_input,
+            ref_text_input,
+            gen_text_input,
+            model_choice,
+            remove_silence,
+        ],
+        outputs=[audio_output, spectrogram_output],
+    )
+@click.command()
+@click.option("--port", "-p", default=None, help="Port to run the app on")
+@click.option("--host", "-H", default=None, help="Host to run the app on")
+@click.option(
+    "--share",
+    "-s",
+    default=False,
+    is_flag=True,
+    help="Share the app via Gradio share link",
+)
+@click.option("--api", "-a", default=True, is_flag=True, help="Allow API access")
+def main(port, host, share, api):
+    global app
+    print(f"Starting app...")
+    app.queue(api_open=api).launch(
+        server_name=host, server_port=port, share=share, show_api=api
+    )
+if __name__ == "__main__":
+    main()

requirements_gradio.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+cached_path
+pydub
+click

test_infer_single.py CHANGED Viewed

@@ -14,7 +14,7 @@ from model.utils import (
     save_spectrogram,
 )
-device = "cuda" if torch.cuda.is_available() else "cpu"
 # --------------------- Dataset Settings -------------------- #

     save_spectrogram,
 )
+device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu"
 # --------------------- Dataset Settings -------------------- #