SWivid commited on
Commit
d8638a6
·
1 Parent(s): c4eee0f
.github/workflows/publish-docker-image.yaml DELETED
@@ -1,61 +0,0 @@
1
- name: Create and publish a Docker image
2
-
3
- # Configures this workflow to run every time a change is pushed to the branch called `release`.
4
- on:
5
- push:
6
- branches: ['main']
7
-
8
- # Defines two custom environment variables for the workflow. These are used for the Container registry domain, and a name for the Docker image that this workflow builds.
9
- env:
10
- REGISTRY: ghcr.io
11
- IMAGE_NAME: ${{ github.repository }}
12
-
13
- # There is a single job in this workflow. It's configured to run on the latest available version of Ubuntu.
14
- jobs:
15
- build-and-push-image:
16
- runs-on: ubuntu-latest
17
- # Sets the permissions granted to the `GITHUB_TOKEN` for the actions in this job.
18
- permissions:
19
- contents: read
20
- packages: write
21
- #
22
- steps:
23
- - name: Checkout repository
24
- uses: actions/checkout@v4
25
- - name: Free Up GitHub Actions Ubuntu Runner Disk Space 🔧
26
- uses: jlumbroso/free-disk-space@main
27
- with:
28
- # This might remove tools that are actually needed, if set to "true" but frees about 6 GB
29
- tool-cache: false
30
-
31
- # All of these default to true, but feel free to set to "false" if necessary for your workflow
32
- android: true
33
- dotnet: true
34
- haskell: true
35
- large-packages: false
36
- swap-storage: false
37
- docker-images: false
38
- # Uses the `docker/login-action` action to log in to the Container registry registry using the account and password that will publish the packages. Once published, the packages are scoped to the account defined here.
39
- - name: Log in to the Container registry
40
- uses: docker/login-action@65b78e6e13532edd9afa3aa52ac7964289d1a9c1
41
- with:
42
- registry: ${{ env.REGISTRY }}
43
- username: ${{ github.actor }}
44
- password: ${{ secrets.GITHUB_TOKEN }}
45
- # This step uses [docker/metadata-action](https://github.com/docker/metadata-action#about) to extract tags and labels that will be applied to the specified image. The `id` "meta" allows the output of this step to be referenced in a subsequent step. The `images` value provides the base name for the tags and labels.
46
- - name: Extract metadata (tags, labels) for Docker
47
- id: meta
48
- uses: docker/metadata-action@9ec57ed1fcdbf14dcef7dfbe97b2010124a938b7
49
- with:
50
- images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
51
- # This step uses the `docker/build-push-action` action to build the image, based on your repository's `Dockerfile`. If the build succeeds, it pushes the image to GitHub Packages.
52
- # It uses the `context` parameter to define the build's context as the set of files located in the specified path. For more information, see "[Usage](https://github.com/docker/build-push-action#usage)" in the README of the `docker/build-push-action` repository.
53
- # It uses the `tags` and `labels` parameters to tag and label the image with the output from the "meta" step.
54
- - name: Build and push Docker image
55
- uses: docker/build-push-action@f2a1d5e99d037542a71f64918e516c093c6f3fc4
56
- with:
57
- context: .
58
- file: ./gradio.Dockerfile
59
- push: true
60
- tags: ${{ steps.meta.outputs.tags }}
61
- labels: ${{ steps.meta.outputs.labels }}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Dockerfile CHANGED
@@ -17,8 +17,7 @@ WORKDIR /workspace
17
 
18
  RUN git clone https://github.com/SWivid/F5-TTS.git \
19
  && cd F5-TTS \
20
- && pip install --no-cache-dir -r requirements.txt \
21
- && pip install --no-cache-dir -r requirements_eval.txt
22
 
23
  ENV SHELL=/bin/bash
24
 
 
17
 
18
  RUN git clone https://github.com/SWivid/F5-TTS.git \
19
  && cd F5-TTS \
20
+ && pip install -e .[eval]
 
21
 
22
  ENV SHELL=/bin/bash
23
 
README.md CHANGED
@@ -18,43 +18,46 @@
18
 
19
  ## Installation
20
 
21
- Clone the repository:
22
-
23
  ```bash
24
- git clone https://github.com/SWivid/F5-TTS.git
25
- cd F5-TTS
 
 
 
 
26
  ```
27
 
28
- Install torch with your CUDA version, e.g. :
 
 
29
 
30
  ```bash
31
- pip install torch==2.3.0+cu118 --extra-index-url https://download.pytorch.org/whl/cu118
32
- pip install torchaudio==2.3.0+cu118 --extra-index-url https://download.pytorch.org/whl/cu118
 
33
  ```
34
 
35
- Install other packages:
36
 
37
  ```bash
38
- pip install -r requirements.txt
39
  ```
40
 
41
- **[Optional]**: We provide [Dockerfile](https://github.com/SWivid/F5-TTS/blob/main/Dockerfile) and you can use the following command to build it.
42
  ```bash
43
  docker build -t f5tts:v1 .
44
  ```
45
 
46
- ### Development
47
 
48
- When making a pull request, please use pre-commit to ensure code quality:
49
 
50
  ```bash
51
  pip install pre-commit
52
  pre-commit install
53
  ```
54
 
55
- This will run linters and formatters automatically before each commit.
56
-
57
- Manually run using:
58
 
59
  ```bash
60
  pre-commit run --all-files
@@ -62,28 +65,6 @@ pre-commit run --all-files
62
 
63
  Note: Some model components have linting exceptions for E722 to accommodate tensor notation
64
 
65
-
66
- ### As a pip package
67
-
68
- ```bash
69
- pip install git+https://github.com/SWivid/F5-TTS.git
70
- ```
71
-
72
- ```python
73
- import gradio as gr
74
- from f5_tts.gradio_app import app
75
-
76
- with gr.Blocks() as main_app:
77
- gr.Markdown("# This is an example of using F5-TTS within a bigger Gradio app")
78
-
79
- # ... other Gradio components
80
-
81
- app.render()
82
-
83
- main_app.launch()
84
-
85
- ```
86
-
87
  ## Prepare Dataset
88
 
89
  Example data processing scripts for Emilia and Wenetspeech4TTS, and you may tailor your own one along with a Dataset class in `f5_tts/model/dataset.py`.
@@ -147,6 +128,21 @@ export WANDB_MODE=offline
147
 
148
  ## Inference
149
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
150
  The pretrained model checkpoints can be reached at [🤗 Hugging Face](https://huggingface.co/SWivid/F5-TTS) and [🤖 Model Scope](https://www.modelscope.cn/models/SWivid/F5-TTS_Emilia-ZH-EN), or automatically downloaded with `inference-cli` and `gradio_app`.
151
 
152
  Currently support 30s for a single generation, which is the **TOTAL** length of prompt audio and the generated. Batch inference with chunks is supported by `inference-cli` and `gradio_app`.
@@ -248,21 +244,7 @@ bash scripts/eval_infer_batch.sh
248
  Install packages for evaluation:
249
 
250
  ```bash
251
- pip install -r requirements_eval.txt
252
- ```
253
-
254
- **Some Notes**
255
-
256
- For faster-whisper with CUDA 11:
257
-
258
- ```bash
259
- pip install --force-reinstall ctranslate2==3.24.0
260
- ```
261
-
262
- (Recommended) To avoid possible ASR failures, such as abnormal repetitions in output:
263
-
264
- ```bash
265
- pip install faster-whisper==0.10.1
266
  ```
267
 
268
  Update the path with your batch-inferenced results, and carry out WER / SIM evaluations:
 
18
 
19
  ## Installation
20
 
 
 
21
  ```bash
22
+ # Create a python 3.10 conda env (you could also use virtualenv)
23
+ conda create -n f5-tts python=3.10
24
+ conda activate f5-tts
25
+
26
+ # Install pytorch with your CUDA version, e.g.
27
+ pip install torch==2.3.0+cu118 torchaudio==2.3.0+cu118 --extra-index-url https://download.pytorch.org/whl/cu118
28
  ```
29
 
30
+ Then you can choose from a few options below:
31
+
32
+ ### 1. Local editable
33
 
34
  ```bash
35
+ git clone https://github.com/SWivid/F5-TTS.git
36
+ cd F5-TTS
37
+ pip install -e .
38
  ```
39
 
40
+ ### 2. As a pip package
41
 
42
  ```bash
43
+ pip install git+https://github.com/SWivid/F5-TTS.git
44
  ```
45
 
46
+ ### 3. Build from dockerfile
47
  ```bash
48
  docker build -t f5tts:v1 .
49
  ```
50
 
51
+ ## Development
52
 
53
+ Use pre-commit to ensure code quality (will run linters and formatters automatically)
54
 
55
  ```bash
56
  pip install pre-commit
57
  pre-commit install
58
  ```
59
 
60
+ When making a pull request, before each commit, run:
 
 
61
 
62
  ```bash
63
  pre-commit run --all-files
 
65
 
66
  Note: Some model components have linting exceptions for E722 to accommodate tensor notation
67
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68
  ## Prepare Dataset
69
 
70
  Example data processing scripts for Emilia and Wenetspeech4TTS, and you may tailor your own one along with a Dataset class in `f5_tts/model/dataset.py`.
 
128
 
129
  ## Inference
130
 
131
+ ```python
132
+ import gradio as gr
133
+ from f5_tts.gradio_app import app
134
+
135
+ with gr.Blocks() as main_app:
136
+ gr.Markdown("# This is an example of using F5-TTS within a bigger Gradio app")
137
+
138
+ # ... other Gradio components
139
+
140
+ app.render()
141
+
142
+ main_app.launch()
143
+
144
+ ```
145
+
146
  The pretrained model checkpoints can be reached at [🤗 Hugging Face](https://huggingface.co/SWivid/F5-TTS) and [🤖 Model Scope](https://www.modelscope.cn/models/SWivid/F5-TTS_Emilia-ZH-EN), or automatically downloaded with `inference-cli` and `gradio_app`.
147
 
148
  Currently support 30s for a single generation, which is the **TOTAL** length of prompt audio and the generated. Batch inference with chunks is supported by `inference-cli` and `gradio_app`.
 
244
  Install packages for evaluation:
245
 
246
  ```bash
247
+ pip install -e .[eval]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
248
  ```
249
 
250
  Update the path with your batch-inferenced results, and carry out WER / SIM evaluations:
app.py DELETED
@@ -1,3 +0,0 @@
1
- from f5_tts.gradio_app import app
2
-
3
- app.queue().launch()
 
 
 
 
gradio.Dockerfile DELETED
@@ -1,27 +0,0 @@
1
- FROM pytorch/pytorch:2.4.0-cuda12.4-cudnn9-devel
2
-
3
- USER root
4
-
5
- ARG DEBIAN_FRONTEND=noninteractive
6
-
7
- LABEL github_repo="https://github.com/rsxdalv/F5-TTS"
8
-
9
- RUN set -x \
10
- && apt-get update \
11
- && apt-get -y install wget curl man git less openssl libssl-dev unzip unar build-essential aria2 tmux vim \
12
- && apt-get install -y openssh-server sox libsox-fmt-all libsox-fmt-mp3 libsndfile1-dev ffmpeg \
13
- && rm -rf /var/lib/apt/lists/* \
14
- && apt-get clean
15
-
16
- WORKDIR /workspace
17
-
18
- RUN git clone https://github.com/rsxdalv/F5-TTS.git \
19
- && cd F5-TTS \
20
- && pip install --no-cache-dir -r requirements.txt
21
-
22
- ENV SHELL=/bin/bash
23
-
24
- WORKDIR /workspace/F5-TTS/f5_tts
25
-
26
- EXPOSE 7860
27
- CMD python gradio_app.py
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
pyproject.toml CHANGED
@@ -7,6 +7,7 @@ name = "f5-tts"
7
  dynamic = ["version"]
8
  description = "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching"
9
  readme = "README.md"
 
10
  classifiers = [
11
  "License :: OSI Approved :: MIT License",
12
  "Operating System :: OS Independent",
@@ -14,11 +15,10 @@ classifiers = [
14
  ]
15
  dependencies = [
16
  "accelerate>=0.33.0",
17
- "cached_path @ git+https://github.com/rsxdalv/cached_path@main",
 
18
  "click",
19
  "datasets",
20
- "einops>=0.8.0",
21
- "einx>=0.3.0",
22
  "ema_pytorch>=0.5.2",
23
  "gradio",
24
  "jieba",
@@ -40,13 +40,17 @@ dependencies = [
40
  "x_transformers>=1.31.14",
41
  ]
42
 
43
- [[project.authors]]
44
- name = "Yushen Chen and Zhikang Niu and Ziyang Ma and Keqi Deng and Chunhui Wang and Jian Zhao and Kai Yu and Xie Chen"
 
 
 
 
 
 
45
 
46
  [project.urls]
47
  Homepage = "https://github.com/SWivid/F5-TTS"
48
 
49
  [project.scripts]
50
- "finetune-cli" = "f5_tts.finetune_cli:main"
51
  "inference-cli" = "f5_tts.inference_cli:main"
52
- "eval_infer_batch" = "f5_tts.scripts.eval_infer_batch:main"
 
7
  dynamic = ["version"]
8
  description = "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching"
9
  readme = "README.md"
10
+ license = {text = "MIT License"}
11
  classifiers = [
12
  "License :: OSI Approved :: MIT License",
13
  "Operating System :: OS Independent",
 
15
  ]
16
  dependencies = [
17
  "accelerate>=0.33.0",
18
+ "bitsandbytes>0.37.0",
19
+ "cached_path",
20
  "click",
21
  "datasets",
 
 
22
  "ema_pytorch>=0.5.2",
23
  "gradio",
24
  "jieba",
 
40
  "x_transformers>=1.31.14",
41
  ]
42
 
43
+ [project.optional-dependencies]
44
+ eval = [
45
+ "faster_whisper==0.10.1",
46
+ "funasr",
47
+ "jiwer",
48
+ "zhconv",
49
+ "zhon",
50
+ ]
51
 
52
  [project.urls]
53
  Homepage = "https://github.com/SWivid/F5-TTS"
54
 
55
  [project.scripts]
 
56
  "inference-cli" = "f5_tts.inference_cli:main"
 
requirements.txt DELETED
@@ -1,22 +0,0 @@
1
- accelerate>=0.33.0
2
- bitsandbytes>0.37.0
3
- cached_path
4
- click
5
- datasets
6
- ema_pytorch>=0.5.2
7
- gradio
8
- jieba
9
- librosa
10
- matplotlib
11
- numpy<=1.26.4
12
- pydub
13
- pypinyin
14
- safetensors
15
- soundfile
16
- tomli
17
- torchdiffeq
18
- tqdm>=4.65.0
19
- transformers
20
- vocos
21
- wandb
22
- x_transformers>=1.31.14
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
requirements_eval.txt DELETED
@@ -1,5 +0,0 @@
1
- faster_whisper
2
- funasr
3
- jiwer
4
- zhconv
5
- zhon
 
 
 
 
 
 
src/f5_tts/finetune_cli.py CHANGED
@@ -1,10 +1,11 @@
1
  import argparse
 
 
 
 
2
  from f5_tts.model import CFM, UNetT, DiT, Trainer
3
  from f5_tts.model.utils import get_tokenizer
4
  from f5_tts.model.dataset import load_dataset
5
- from cached_path import cached_path
6
- import shutil
7
- import os
8
 
9
  # -------------------------- Dataset Settings --------------------------- #
10
  target_sample_rate = 24000
 
1
  import argparse
2
+ import os
3
+ import shutil
4
+
5
+ from cached_path import cached_path
6
  from f5_tts.model import CFM, UNetT, DiT, Trainer
7
  from f5_tts.model.utils import get_tokenizer
8
  from f5_tts.model.dataset import load_dataset
 
 
 
9
 
10
  # -------------------------- Dataset Settings --------------------------- #
11
  target_sample_rate = 24000
src/f5_tts/inference_cli.py CHANGED
@@ -29,7 +29,7 @@ parser.add_argument(
29
  "-c",
30
  "--config",
31
  help="Configuration file. Default=inference-cli.toml",
32
- default=os.path.join(files('f5_tts').joinpath('data'), 'inference-cli.toml')
33
  )
34
  parser.add_argument(
35
  "-m",
@@ -168,8 +168,10 @@ def main_process(ref_audio, ref_text, text_gen, model_obj, remove_silence):
168
  remove_silence_for_generated_wav(f.name)
169
  print(f.name)
170
 
 
171
  def main():
172
  main_process(ref_audio, ref_text, gen_text, ema_model, remove_silence)
173
 
 
174
  if __name__ == "__main__":
175
  main()
 
29
  "-c",
30
  "--config",
31
  help="Configuration file. Default=inference-cli.toml",
32
+ default=os.path.join(files("f5_tts").joinpath("data"), "inference-cli.toml"),
33
  )
34
  parser.add_argument(
35
  "-m",
 
168
  remove_silence_for_generated_wav(f.name)
169
  print(f.name)
170
 
171
+
172
  def main():
173
  main_process(ref_audio, ref_text, gen_text, ema_model, remove_silence)
174
 
175
+
176
  if __name__ == "__main__":
177
  main()
src/f5_tts/model/utils.py CHANGED
@@ -122,7 +122,7 @@ def get_tokenizer(dataset_name, tokenizer: str = "pinyin"):
122
  - if use "byte", set to 256 (unicode byte range)
123
  """
124
  if tokenizer in ["pinyin", "char"]:
125
- tokenizer_path = os.path.join(files('f5_tts').joinpath('data'), f"{dataset_name}_{tokenizer}/vocab.txt")
126
  with open(tokenizer_path, "r", encoding="utf-8") as f:
127
  vocab_char_map = {}
128
  for i, char in enumerate(f):
 
122
  - if use "byte", set to 256 (unicode byte range)
123
  """
124
  if tokenizer in ["pinyin", "char"]:
125
+ tokenizer_path = os.path.join(files("f5_tts").joinpath("data"), f"{dataset_name}_{tokenizer}/vocab.txt")
126
  with open(tokenizer_path, "r", encoding="utf-8") as f:
127
  vocab_char_map = {}
128
  for i, char in enumerate(f):
src/f5_tts/scripts/eval_infer_batch.py CHANGED
@@ -36,6 +36,7 @@ target_rms = 0.1
36
 
37
  tokenizer = "pinyin"
38
 
 
39
  def main():
40
  # ---------------------- infer setting ---------------------- #
41
 
@@ -54,7 +55,6 @@ def main():
54
 
55
  args = parser.parse_args()
56
 
57
-
58
  seed = args.seed
59
  dataset_name = args.dataset
60
  exp_name = args.expname
@@ -67,14 +67,12 @@ def main():
67
 
68
  testset = args.testset
69
 
70
-
71
  infer_batch_size = 1 # max frames. 1 for ddp single inference (recommended)
72
  cfg_strength = 2.0
73
  speed = 1.0
74
  use_truth_duration = False
75
  no_ref_audio = False
76
 
77
-
78
  if exp_name == "F5TTS_Base":
79
  model_cls = DiT
80
  model_cfg = dict(dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4)
@@ -83,23 +81,21 @@ def main():
83
  model_cls = UNetT
84
  model_cfg = dict(dim=1024, depth=24, heads=16, ff_mult=4)
85
 
86
-
87
- datapath = files('f5_tts').joinpath('data')
88
 
89
  if testset == "ls_pc_test_clean":
90
- metalst = os.path.join(datapath,"librispeech_pc_test_clean_cross_sentence.lst")
91
  librispeech_test_clean_path = "<SOME_PATH>/LibriSpeech/test-clean" # test-clean path
92
  metainfo = get_librispeech_test_clean_metainfo(metalst, librispeech_test_clean_path)
93
 
94
  elif testset == "seedtts_test_zh":
95
- metalst = os.path.join(datapath,"seedtts_testset/zh/meta.lst")
96
  metainfo = get_seedtts_testset_metainfo(metalst)
97
 
98
  elif testset == "seedtts_test_en":
99
- metalst = os.path.join(datapath,"seedtts_testset/en/meta.lst")
100
  metainfo = get_seedtts_testset_metainfo(metalst)
101
 
102
-
103
  # path to save genereted wavs
104
  if seed is None:
105
  seed = random.randint(-10000, 10000)
@@ -112,7 +108,6 @@ def main():
112
  f"{'_no-ref-audio' if no_ref_audio else ''}"
113
  )
114
 
115
-
116
  # -------------------------------------------------#
117
 
118
  use_ema = True
@@ -200,5 +195,6 @@ def main():
200
  timediff = time.time() - start
201
  print(f"Done batch inference in {timediff / 60 :.2f} minutes.")
202
 
 
203
  if __name__ == "__main__":
204
  main()
 
36
 
37
  tokenizer = "pinyin"
38
 
39
+
40
  def main():
41
  # ---------------------- infer setting ---------------------- #
42
 
 
55
 
56
  args = parser.parse_args()
57
 
 
58
  seed = args.seed
59
  dataset_name = args.dataset
60
  exp_name = args.expname
 
67
 
68
  testset = args.testset
69
 
 
70
  infer_batch_size = 1 # max frames. 1 for ddp single inference (recommended)
71
  cfg_strength = 2.0
72
  speed = 1.0
73
  use_truth_duration = False
74
  no_ref_audio = False
75
 
 
76
  if exp_name == "F5TTS_Base":
77
  model_cls = DiT
78
  model_cfg = dict(dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4)
 
81
  model_cls = UNetT
82
  model_cfg = dict(dim=1024, depth=24, heads=16, ff_mult=4)
83
 
84
+ datapath = files("f5_tts").joinpath("data")
 
85
 
86
  if testset == "ls_pc_test_clean":
87
+ metalst = os.path.join(datapath, "librispeech_pc_test_clean_cross_sentence.lst")
88
  librispeech_test_clean_path = "<SOME_PATH>/LibriSpeech/test-clean" # test-clean path
89
  metainfo = get_librispeech_test_clean_metainfo(metalst, librispeech_test_clean_path)
90
 
91
  elif testset == "seedtts_test_zh":
92
+ metalst = os.path.join(datapath, "seedtts_testset/zh/meta.lst")
93
  metainfo = get_seedtts_testset_metainfo(metalst)
94
 
95
  elif testset == "seedtts_test_en":
96
+ metalst = os.path.join(datapath, "seedtts_testset/en/meta.lst")
97
  metainfo = get_seedtts_testset_metainfo(metalst)
98
 
 
99
  # path to save genereted wavs
100
  if seed is None:
101
  seed = random.randint(-10000, 10000)
 
108
  f"{'_no-ref-audio' if no_ref_audio else ''}"
109
  )
110
 
 
111
  # -------------------------------------------------#
112
 
113
  use_ema = True
 
195
  timediff = time.time() - start
196
  print(f"Done batch inference in {timediff / 60 :.2f} minutes.")
197
 
198
+
199
  if __name__ == "__main__":
200
  main()