Spaces:

zhouzhou363
/

f5-tts

Configuration error

App Files Files Community

SWivid commited on Oct 23, 2024

Commit

d8638a6

1 Parent(s): c4eee0f

.

Browse files

Files changed (12) hide show

.github/workflows/publish-docker-image.yaml +0 -61
Dockerfile +1 -2
README.md +34 -52
app.py +0 -3
gradio.Dockerfile +0 -27
pyproject.toml +11 -7
requirements.txt +0 -22
requirements_eval.txt +0 -5
src/f5_tts/finetune_cli.py +4 -3
src/f5_tts/inference_cli.py +3 -1
src/f5_tts/model/utils.py +1 -1
src/f5_tts/scripts/eval_infer_batch.py +6 -10

.github/workflows/publish-docker-image.yaml DELETED Viewed

@@ -1,61 +0,0 @@
-name: Create and publish a Docker image
-# Configures this workflow to run every time a change is pushed to the branch called `release`.
-on:
-  push:
-    branches: ['main']
-# Defines two custom environment variables for the workflow. These are used for the Container registry domain, and a name for the Docker image that this workflow builds.
-env:
-  REGISTRY: ghcr.io
-  IMAGE_NAME: ${{ github.repository }}
-# There is a single job in this workflow. It's configured to run on the latest available version of Ubuntu.
-jobs:
-  build-and-push-image:
-    runs-on: ubuntu-latest
-    # Sets the permissions granted to the `GITHUB_TOKEN` for the actions in this job.
-    permissions:
-      contents: read
-      packages: write
-      #
-    steps:
-      - name: Checkout repository
-        uses: actions/checkout@v4
-      - name: Free Up GitHub Actions Ubuntu Runner Disk Space 🔧
-        uses: jlumbroso/free-disk-space@main
-        with:
-          # This might remove tools that are actually needed, if set to "true" but frees about 6 GB
-          tool-cache: false
-          # All of these default to true, but feel free to set to "false" if necessary for your workflow
-          android: true
-          dotnet: true
-          haskell: true
-          large-packages: false
-          swap-storage: false
-          docker-images: false
-      # Uses the `docker/login-action` action to log in to the Container registry registry using the account and password that will publish the packages. Once published, the packages are scoped to the account defined here.
-      - name: Log in to the Container registry
-        uses: docker/login-action@65b78e6e13532edd9afa3aa52ac7964289d1a9c1
-        with:
-          registry: ${{ env.REGISTRY }}
-          username: ${{ github.actor }}
-          password: ${{ secrets.GITHUB_TOKEN }}
-      # This step uses [docker/metadata-action](https://github.com/docker/metadata-action#about) to extract tags and labels that will be applied to the specified image. The `id` "meta" allows the output of this step to be referenced in a subsequent step. The `images` value provides the base name for the tags and labels.
-      - name: Extract metadata (tags, labels) for Docker
-        id: meta
-        uses: docker/metadata-action@9ec57ed1fcdbf14dcef7dfbe97b2010124a938b7
-        with:
-          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
-      # This step uses the `docker/build-push-action` action to build the image, based on your repository's `Dockerfile`. If the build succeeds, it pushes the image to GitHub Packages.
-      # It uses the `context` parameter to define the build's context as the set of files located in the specified path. For more information, see "[Usage](https://github.com/docker/build-push-action#usage)" in the README of the `docker/build-push-action` repository.
-      # It uses the `tags` and `labels` parameters to tag and label the image with the output from the "meta" step.
-      - name: Build and push Docker image
-        uses: docker/build-push-action@f2a1d5e99d037542a71f64918e516c093c6f3fc4
-        with:
-          context: .
-          file: ./gradio.Dockerfile
-          push: true
-          tags: ${{ steps.meta.outputs.tags }}
-          labels: ${{ steps.meta.outputs.labels }}

Dockerfile CHANGED Viewed

@@ -17,8 +17,7 @@ WORKDIR /workspace
 RUN git clone https://github.com/SWivid/F5-TTS.git \
     && cd F5-TTS \
-    && pip install --no-cache-dir -r requirements.txt \
-    && pip install --no-cache-dir -r requirements_eval.txt
 ENV SHELL=/bin/bash

 RUN git clone https://github.com/SWivid/F5-TTS.git \
     && cd F5-TTS \
+    && pip install -e .[eval]
 ENV SHELL=/bin/bash

README.md CHANGED Viewed

@@ -18,43 +18,46 @@
 ## Installation
-Clone the repository:
 ```bash
-git clone https://github.com/SWivid/F5-TTS.git
-cd F5-TTS
 ```
-Install torch with your CUDA version, e.g. :
 ```bash
-pip install torch==2.3.0+cu118 --extra-index-url https://download.pytorch.org/whl/cu118
-pip install torchaudio==2.3.0+cu118 --extra-index-url https://download.pytorch.org/whl/cu118
 ```
-Install other packages:
 ```bash
-pip install -r requirements.txt
 ```
-**[Optional]**: We provide [Dockerfile](https://github.com/SWivid/F5-TTS/blob/main/Dockerfile) and you can use the following command to build it.
 ```bash
 docker build -t f5tts:v1 .
 ```
-### Development
-When making a pull request, please use pre-commit to ensure code quality:
 ```bash
 pip install pre-commit
 pre-commit install
 ```
-This will run linters and formatters automatically before each commit.
-Manually run using:
 ```bash
 pre-commit run --all-files
@@ -62,28 +65,6 @@ pre-commit run --all-files
 Note: Some model components have linting exceptions for E722 to accommodate tensor notation
-### As a pip package
-```bash
-pip install git+https://github.com/SWivid/F5-TTS.git
-```
-```python
-import gradio as gr
-from f5_tts.gradio_app import app
-with gr.Blocks() as main_app:
-    gr.Markdown("# This is an example of using F5-TTS within a bigger Gradio app")
-    # ... other Gradio components
-    app.render()
-main_app.launch()
-```
 ## Prepare Dataset
 Example data processing scripts for Emilia and Wenetspeech4TTS, and you may tailor your own one along with a Dataset class in `f5_tts/model/dataset.py`.
@@ -147,6 +128,21 @@ export WANDB_MODE=offline
 ## Inference
 The pretrained model checkpoints can be reached at [🤗 Hugging Face](https://huggingface.co/SWivid/F5-TTS) and [🤖 Model Scope](https://www.modelscope.cn/models/SWivid/F5-TTS_Emilia-ZH-EN), or automatically downloaded with `inference-cli` and `gradio_app`.
 Currently support 30s for a single generation, which is the **TOTAL** length of prompt audio and the generated. Batch inference with chunks is supported by `inference-cli` and `gradio_app`.
@@ -248,21 +244,7 @@ bash scripts/eval_infer_batch.sh
 Install packages for evaluation:
 ```bash
-pip install -r requirements_eval.txt
-```
-**Some Notes**
-For faster-whisper with CUDA 11:
-```bash
-pip install --force-reinstall ctranslate2==3.24.0
-```
-(Recommended) To avoid possible ASR failures, such as abnormal repetitions in output:
-```bash
-pip install faster-whisper==0.10.1
 ```
 Update the path with your batch-inferenced results, and carry out WER / SIM evaluations:

 ## Installation
 ```bash
+# Create a python 3.10 conda env (you could also use virtualenv)
+conda create -n f5-tts python=3.10
+conda activate f5-tts
+# Install pytorch with your CUDA version, e.g.
+pip install torch==2.3.0+cu118 torchaudio==2.3.0+cu118 --extra-index-url https://download.pytorch.org/whl/cu118
 ```
+Then you can choose from a few options below:
+### 1. Local editable
 ```bash
+git clone https://github.com/SWivid/F5-TTS.git
+cd F5-TTS
+pip install -e .
 ```
+### 2. As a pip package
 ```bash
+pip install git+https://github.com/SWivid/F5-TTS.git
 ```
+### 3. Build from dockerfile
 ```bash
 docker build -t f5tts:v1 .
 ```
+## Development
+Use pre-commit to ensure code quality (will run linters and formatters automatically)
 ```bash
 pip install pre-commit
 pre-commit install
 ```
+When making a pull request, before each commit, run:
 ```bash
 pre-commit run --all-files
 Note: Some model components have linting exceptions for E722 to accommodate tensor notation
 ## Prepare Dataset
 Example data processing scripts for Emilia and Wenetspeech4TTS, and you may tailor your own one along with a Dataset class in `f5_tts/model/dataset.py`.
 ## Inference
+```python
+import gradio as gr
+from f5_tts.gradio_app import app
+with gr.Blocks() as main_app:
+    gr.Markdown("# This is an example of using F5-TTS within a bigger Gradio app")
+    # ... other Gradio components
+    app.render()
+main_app.launch()
+```
 The pretrained model checkpoints can be reached at [🤗 Hugging Face](https://huggingface.co/SWivid/F5-TTS) and [🤖 Model Scope](https://www.modelscope.cn/models/SWivid/F5-TTS_Emilia-ZH-EN), or automatically downloaded with `inference-cli` and `gradio_app`.
 Currently support 30s for a single generation, which is the **TOTAL** length of prompt audio and the generated. Batch inference with chunks is supported by `inference-cli` and `gradio_app`.
 Install packages for evaluation:
 ```bash
+pip install -e .[eval]
 ```
 Update the path with your batch-inferenced results, and carry out WER / SIM evaluations:

app.py DELETED Viewed

@@ -1,3 +0,0 @@
-from f5_tts.gradio_app import app
-app.queue().launch()

gradio.Dockerfile DELETED Viewed

@@ -1,27 +0,0 @@
-FROM pytorch/pytorch:2.4.0-cuda12.4-cudnn9-devel
-USER root
-ARG DEBIAN_FRONTEND=noninteractive
-LABEL github_repo="https://github.com/rsxdalv/F5-TTS"
-RUN set -x \
-    && apt-get update \
-    && apt-get -y install wget curl man git less openssl libssl-dev unzip unar build-essential aria2 tmux vim \
-    && apt-get install -y openssh-server sox libsox-fmt-all libsox-fmt-mp3 libsndfile1-dev ffmpeg \
-    && rm -rf /var/lib/apt/lists/* \
-    && apt-get clean
-WORKDIR /workspace
-RUN git clone https://github.com/rsxdalv/F5-TTS.git \
-    && cd F5-TTS \
-    && pip install --no-cache-dir -r requirements.txt
-ENV SHELL=/bin/bash
-WORKDIR /workspace/F5-TTS/f5_tts
-EXPOSE 7860
-CMD python gradio_app.py

pyproject.toml CHANGED Viewed

@@ -7,6 +7,7 @@ name = "f5-tts"
 dynamic = ["version"]
 description = "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching"
 readme = "README.md"
 classifiers = [
     "License :: OSI Approved :: MIT License",
     "Operating System :: OS Independent",
@@ -14,11 +15,10 @@ classifiers = [
 ]
 dependencies = [
     "accelerate>=0.33.0",
-    "cached_path @ git+https://github.com/rsxdalv/cached_path@main",
     "click",
     "datasets",
-    "einops>=0.8.0",
-    "einx>=0.3.0",
     "ema_pytorch>=0.5.2",
     "gradio",
     "jieba",
@@ -40,13 +40,17 @@ dependencies = [
     "x_transformers>=1.31.14",
 ]
-[[project.authors]]
-name = "Yushen Chen and Zhikang Niu and Ziyang Ma and Keqi Deng and Chunhui Wang and Jian Zhao and Kai Yu and Xie Chen"
 [project.urls]
 Homepage = "https://github.com/SWivid/F5-TTS"
 [project.scripts]
-"finetune-cli" = "f5_tts.finetune_cli:main"
 "inference-cli" = "f5_tts.inference_cli:main"
-"eval_infer_batch" = "f5_tts.scripts.eval_infer_batch:main"

 dynamic = ["version"]
 description = "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching"
 readme = "README.md"
+license = {text = "MIT License"}
 classifiers = [
     "License :: OSI Approved :: MIT License",
     "Operating System :: OS Independent",
 ]
 dependencies = [
     "accelerate>=0.33.0",
+    "bitsandbytes>0.37.0",
+    "cached_path",
     "click",
     "datasets",
     "ema_pytorch>=0.5.2",
     "gradio",
     "jieba",
     "x_transformers>=1.31.14",
 ]
+[project.optional-dependencies]
+eval = [
+    "faster_whisper==0.10.1",
+    "funasr",
+    "jiwer",
+    "zhconv",
+    "zhon",
+]
 [project.urls]
 Homepage = "https://github.com/SWivid/F5-TTS"
 [project.scripts]
 "inference-cli" = "f5_tts.inference_cli:main"

requirements.txt DELETED Viewed

@@ -1,22 +0,0 @@
-accelerate>=0.33.0
-bitsandbytes>0.37.0
-cached_path
-click
-datasets
-ema_pytorch>=0.5.2
-gradio
-jieba
-librosa
-matplotlib
-numpy<=1.26.4
-pydub
-pypinyin
-safetensors
-soundfile
-tomli
-torchdiffeq
-tqdm>=4.65.0
-transformers
-vocos
-wandb
-x_transformers>=1.31.14

requirements_eval.txt DELETED Viewed

@@ -1,5 +0,0 @@
-faster_whisper
-funasr
-jiwer
-zhconv
-zhon

src/f5_tts/finetune_cli.py CHANGED Viewed

@@ -1,10 +1,11 @@
 import argparse
 from f5_tts.model import CFM, UNetT, DiT, Trainer
 from f5_tts.model.utils import get_tokenizer
 from f5_tts.model.dataset import load_dataset
-from cached_path import cached_path
-import shutil
-import os
 # -------------------------- Dataset Settings --------------------------- #
 target_sample_rate = 24000

 import argparse
+import os
+import shutil
+from cached_path import cached_path
 from f5_tts.model import CFM, UNetT, DiT, Trainer
 from f5_tts.model.utils import get_tokenizer
 from f5_tts.model.dataset import load_dataset
 # -------------------------- Dataset Settings --------------------------- #
 target_sample_rate = 24000

src/f5_tts/inference_cli.py CHANGED Viewed

@@ -29,7 +29,7 @@ parser.add_argument(
     "-c",
     "--config",
     help="Configuration file. Default=inference-cli.toml",
-    default=os.path.join(files('f5_tts').joinpath('data'), 'inference-cli.toml')
 )
 parser.add_argument(
     "-m",
@@ -168,8 +168,10 @@ def main_process(ref_audio, ref_text, text_gen, model_obj, remove_silence):
                 remove_silence_for_generated_wav(f.name)
             print(f.name)
 def main():
     main_process(ref_audio, ref_text, gen_text, ema_model, remove_silence)
 if __name__ == "__main__":
     main()

     "-c",
     "--config",
     help="Configuration file. Default=inference-cli.toml",
+    default=os.path.join(files("f5_tts").joinpath("data"), "inference-cli.toml"),
 )
 parser.add_argument(
     "-m",
                 remove_silence_for_generated_wav(f.name)
             print(f.name)
 def main():
     main_process(ref_audio, ref_text, gen_text, ema_model, remove_silence)
 if __name__ == "__main__":
     main()

src/f5_tts/model/utils.py CHANGED Viewed

@@ -122,7 +122,7 @@ def get_tokenizer(dataset_name, tokenizer: str = "pinyin"):
                 - if use "byte", set to 256 (unicode byte range)
     """
     if tokenizer in ["pinyin", "char"]:
-        tokenizer_path = os.path.join(files('f5_tts').joinpath('data'), f"{dataset_name}_{tokenizer}/vocab.txt")
         with open(tokenizer_path, "r", encoding="utf-8") as f:
             vocab_char_map = {}
             for i, char in enumerate(f):

                 - if use "byte", set to 256 (unicode byte range)
     """
     if tokenizer in ["pinyin", "char"]:
+        tokenizer_path = os.path.join(files("f5_tts").joinpath("data"), f"{dataset_name}_{tokenizer}/vocab.txt")
         with open(tokenizer_path, "r", encoding="utf-8") as f:
             vocab_char_map = {}
             for i, char in enumerate(f):

src/f5_tts/scripts/eval_infer_batch.py CHANGED Viewed

@@ -36,6 +36,7 @@ target_rms = 0.1
 tokenizer = "pinyin"
 def main():
     # ---------------------- infer setting ---------------------- #
@@ -54,7 +55,6 @@ def main():
     args = parser.parse_args()
     seed = args.seed
     dataset_name = args.dataset
     exp_name = args.expname
@@ -67,14 +67,12 @@ def main():
     testset = args.testset
     infer_batch_size = 1  # max frames. 1 for ddp single inference (recommended)
     cfg_strength = 2.0
     speed = 1.0
     use_truth_duration = False
     no_ref_audio = False
     if exp_name == "F5TTS_Base":
         model_cls = DiT
         model_cfg = dict(dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4)
@@ -83,23 +81,21 @@ def main():
         model_cls = UNetT
         model_cfg = dict(dim=1024, depth=24, heads=16, ff_mult=4)
-    datapath = files('f5_tts').joinpath('data')
     if testset == "ls_pc_test_clean":
-        metalst = os.path.join(datapath,"librispeech_pc_test_clean_cross_sentence.lst")
         librispeech_test_clean_path = "<SOME_PATH>/LibriSpeech/test-clean"  # test-clean path
         metainfo = get_librispeech_test_clean_metainfo(metalst, librispeech_test_clean_path)
     elif testset == "seedtts_test_zh":
-        metalst = os.path.join(datapath,"seedtts_testset/zh/meta.lst")
         metainfo = get_seedtts_testset_metainfo(metalst)
     elif testset == "seedtts_test_en":
-        metalst = os.path.join(datapath,"seedtts_testset/en/meta.lst")
         metainfo = get_seedtts_testset_metainfo(metalst)
     # path to save genereted wavs
     if seed is None:
         seed = random.randint(-10000, 10000)
@@ -112,7 +108,6 @@ def main():
         f"{'_no-ref-audio' if no_ref_audio else ''}"
     )
     # -------------------------------------------------#
     use_ema = True
@@ -200,5 +195,6 @@ def main():
         timediff = time.time() - start
         print(f"Done batch inference in {timediff / 60 :.2f} minutes.")
 if __name__ == "__main__":
     main()

 tokenizer = "pinyin"
 def main():
     # ---------------------- infer setting ---------------------- #
     args = parser.parse_args()
     seed = args.seed
     dataset_name = args.dataset
     exp_name = args.expname
     testset = args.testset
     infer_batch_size = 1  # max frames. 1 for ddp single inference (recommended)
     cfg_strength = 2.0
     speed = 1.0
     use_truth_duration = False
     no_ref_audio = False
     if exp_name == "F5TTS_Base":
         model_cls = DiT
         model_cfg = dict(dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4)
         model_cls = UNetT
         model_cfg = dict(dim=1024, depth=24, heads=16, ff_mult=4)
+    datapath = files("f5_tts").joinpath("data")
     if testset == "ls_pc_test_clean":
+        metalst = os.path.join(datapath, "librispeech_pc_test_clean_cross_sentence.lst")
         librispeech_test_clean_path = "<SOME_PATH>/LibriSpeech/test-clean"  # test-clean path
         metainfo = get_librispeech_test_clean_metainfo(metalst, librispeech_test_clean_path)
     elif testset == "seedtts_test_zh":
+        metalst = os.path.join(datapath, "seedtts_testset/zh/meta.lst")
         metainfo = get_seedtts_testset_metainfo(metalst)
     elif testset == "seedtts_test_en":
+        metalst = os.path.join(datapath, "seedtts_testset/en/meta.lst")
         metainfo = get_seedtts_testset_metainfo(metalst)
     # path to save genereted wavs
     if seed is None:
         seed = random.randint(-10000, 10000)
         f"{'_no-ref-audio' if no_ref_audio else ''}"
     )
     # -------------------------------------------------#
     use_ema = True
         timediff = time.time() - start
         print(f"Done batch inference in {timediff / 60 :.2f} minutes.")
 if __name__ == "__main__":
     main()