Spaces:
Configuration error
Configuration error
- .github/workflows/publish-docker-image.yaml +0 -61
- Dockerfile +1 -2
- README.md +34 -52
- app.py +0 -3
- gradio.Dockerfile +0 -27
- pyproject.toml +11 -7
- requirements.txt +0 -22
- requirements_eval.txt +0 -5
- src/f5_tts/finetune_cli.py +4 -3
- src/f5_tts/inference_cli.py +3 -1
- src/f5_tts/model/utils.py +1 -1
- src/f5_tts/scripts/eval_infer_batch.py +6 -10
.github/workflows/publish-docker-image.yaml
DELETED
@@ -1,61 +0,0 @@
|
|
1 |
-
name: Create and publish a Docker image
|
2 |
-
|
3 |
-
# Configures this workflow to run every time a change is pushed to the branch called `release`.
|
4 |
-
on:
|
5 |
-
push:
|
6 |
-
branches: ['main']
|
7 |
-
|
8 |
-
# Defines two custom environment variables for the workflow. These are used for the Container registry domain, and a name for the Docker image that this workflow builds.
|
9 |
-
env:
|
10 |
-
REGISTRY: ghcr.io
|
11 |
-
IMAGE_NAME: ${{ github.repository }}
|
12 |
-
|
13 |
-
# There is a single job in this workflow. It's configured to run on the latest available version of Ubuntu.
|
14 |
-
jobs:
|
15 |
-
build-and-push-image:
|
16 |
-
runs-on: ubuntu-latest
|
17 |
-
# Sets the permissions granted to the `GITHUB_TOKEN` for the actions in this job.
|
18 |
-
permissions:
|
19 |
-
contents: read
|
20 |
-
packages: write
|
21 |
-
#
|
22 |
-
steps:
|
23 |
-
- name: Checkout repository
|
24 |
-
uses: actions/checkout@v4
|
25 |
-
- name: Free Up GitHub Actions Ubuntu Runner Disk Space 🔧
|
26 |
-
uses: jlumbroso/free-disk-space@main
|
27 |
-
with:
|
28 |
-
# This might remove tools that are actually needed, if set to "true" but frees about 6 GB
|
29 |
-
tool-cache: false
|
30 |
-
|
31 |
-
# All of these default to true, but feel free to set to "false" if necessary for your workflow
|
32 |
-
android: true
|
33 |
-
dotnet: true
|
34 |
-
haskell: true
|
35 |
-
large-packages: false
|
36 |
-
swap-storage: false
|
37 |
-
docker-images: false
|
38 |
-
# Uses the `docker/login-action` action to log in to the Container registry registry using the account and password that will publish the packages. Once published, the packages are scoped to the account defined here.
|
39 |
-
- name: Log in to the Container registry
|
40 |
-
uses: docker/login-action@65b78e6e13532edd9afa3aa52ac7964289d1a9c1
|
41 |
-
with:
|
42 |
-
registry: ${{ env.REGISTRY }}
|
43 |
-
username: ${{ github.actor }}
|
44 |
-
password: ${{ secrets.GITHUB_TOKEN }}
|
45 |
-
# This step uses [docker/metadata-action](https://github.com/docker/metadata-action#about) to extract tags and labels that will be applied to the specified image. The `id` "meta" allows the output of this step to be referenced in a subsequent step. The `images` value provides the base name for the tags and labels.
|
46 |
-
- name: Extract metadata (tags, labels) for Docker
|
47 |
-
id: meta
|
48 |
-
uses: docker/metadata-action@9ec57ed1fcdbf14dcef7dfbe97b2010124a938b7
|
49 |
-
with:
|
50 |
-
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
|
51 |
-
# This step uses the `docker/build-push-action` action to build the image, based on your repository's `Dockerfile`. If the build succeeds, it pushes the image to GitHub Packages.
|
52 |
-
# It uses the `context` parameter to define the build's context as the set of files located in the specified path. For more information, see "[Usage](https://github.com/docker/build-push-action#usage)" in the README of the `docker/build-push-action` repository.
|
53 |
-
# It uses the `tags` and `labels` parameters to tag and label the image with the output from the "meta" step.
|
54 |
-
- name: Build and push Docker image
|
55 |
-
uses: docker/build-push-action@f2a1d5e99d037542a71f64918e516c093c6f3fc4
|
56 |
-
with:
|
57 |
-
context: .
|
58 |
-
file: ./gradio.Dockerfile
|
59 |
-
push: true
|
60 |
-
tags: ${{ steps.meta.outputs.tags }}
|
61 |
-
labels: ${{ steps.meta.outputs.labels }}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Dockerfile
CHANGED
@@ -17,8 +17,7 @@ WORKDIR /workspace
|
|
17 |
|
18 |
RUN git clone https://github.com/SWivid/F5-TTS.git \
|
19 |
&& cd F5-TTS \
|
20 |
-
&& pip install
|
21 |
-
&& pip install --no-cache-dir -r requirements_eval.txt
|
22 |
|
23 |
ENV SHELL=/bin/bash
|
24 |
|
|
|
17 |
|
18 |
RUN git clone https://github.com/SWivid/F5-TTS.git \
|
19 |
&& cd F5-TTS \
|
20 |
+
&& pip install -e .[eval]
|
|
|
21 |
|
22 |
ENV SHELL=/bin/bash
|
23 |
|
README.md
CHANGED
@@ -18,43 +18,46 @@
|
|
18 |
|
19 |
## Installation
|
20 |
|
21 |
-
Clone the repository:
|
22 |
-
|
23 |
```bash
|
24 |
-
|
25 |
-
|
|
|
|
|
|
|
|
|
26 |
```
|
27 |
|
28 |
-
|
|
|
|
|
29 |
|
30 |
```bash
|
31 |
-
|
32 |
-
|
|
|
33 |
```
|
34 |
|
35 |
-
|
36 |
|
37 |
```bash
|
38 |
-
pip install -
|
39 |
```
|
40 |
|
41 |
-
|
42 |
```bash
|
43 |
docker build -t f5tts:v1 .
|
44 |
```
|
45 |
|
46 |
-
|
47 |
|
48 |
-
|
49 |
|
50 |
```bash
|
51 |
pip install pre-commit
|
52 |
pre-commit install
|
53 |
```
|
54 |
|
55 |
-
|
56 |
-
|
57 |
-
Manually run using:
|
58 |
|
59 |
```bash
|
60 |
pre-commit run --all-files
|
@@ -62,28 +65,6 @@ pre-commit run --all-files
|
|
62 |
|
63 |
Note: Some model components have linting exceptions for E722 to accommodate tensor notation
|
64 |
|
65 |
-
|
66 |
-
### As a pip package
|
67 |
-
|
68 |
-
```bash
|
69 |
-
pip install git+https://github.com/SWivid/F5-TTS.git
|
70 |
-
```
|
71 |
-
|
72 |
-
```python
|
73 |
-
import gradio as gr
|
74 |
-
from f5_tts.gradio_app import app
|
75 |
-
|
76 |
-
with gr.Blocks() as main_app:
|
77 |
-
gr.Markdown("# This is an example of using F5-TTS within a bigger Gradio app")
|
78 |
-
|
79 |
-
# ... other Gradio components
|
80 |
-
|
81 |
-
app.render()
|
82 |
-
|
83 |
-
main_app.launch()
|
84 |
-
|
85 |
-
```
|
86 |
-
|
87 |
## Prepare Dataset
|
88 |
|
89 |
Example data processing scripts for Emilia and Wenetspeech4TTS, and you may tailor your own one along with a Dataset class in `f5_tts/model/dataset.py`.
|
@@ -147,6 +128,21 @@ export WANDB_MODE=offline
|
|
147 |
|
148 |
## Inference
|
149 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
150 |
The pretrained model checkpoints can be reached at [🤗 Hugging Face](https://huggingface.co/SWivid/F5-TTS) and [🤖 Model Scope](https://www.modelscope.cn/models/SWivid/F5-TTS_Emilia-ZH-EN), or automatically downloaded with `inference-cli` and `gradio_app`.
|
151 |
|
152 |
Currently support 30s for a single generation, which is the **TOTAL** length of prompt audio and the generated. Batch inference with chunks is supported by `inference-cli` and `gradio_app`.
|
@@ -248,21 +244,7 @@ bash scripts/eval_infer_batch.sh
|
|
248 |
Install packages for evaluation:
|
249 |
|
250 |
```bash
|
251 |
-
pip install -
|
252 |
-
```
|
253 |
-
|
254 |
-
**Some Notes**
|
255 |
-
|
256 |
-
For faster-whisper with CUDA 11:
|
257 |
-
|
258 |
-
```bash
|
259 |
-
pip install --force-reinstall ctranslate2==3.24.0
|
260 |
-
```
|
261 |
-
|
262 |
-
(Recommended) To avoid possible ASR failures, such as abnormal repetitions in output:
|
263 |
-
|
264 |
-
```bash
|
265 |
-
pip install faster-whisper==0.10.1
|
266 |
```
|
267 |
|
268 |
Update the path with your batch-inferenced results, and carry out WER / SIM evaluations:
|
|
|
18 |
|
19 |
## Installation
|
20 |
|
|
|
|
|
21 |
```bash
|
22 |
+
# Create a python 3.10 conda env (you could also use virtualenv)
|
23 |
+
conda create -n f5-tts python=3.10
|
24 |
+
conda activate f5-tts
|
25 |
+
|
26 |
+
# Install pytorch with your CUDA version, e.g.
|
27 |
+
pip install torch==2.3.0+cu118 torchaudio==2.3.0+cu118 --extra-index-url https://download.pytorch.org/whl/cu118
|
28 |
```
|
29 |
|
30 |
+
Then you can choose from a few options below:
|
31 |
+
|
32 |
+
### 1. Local editable
|
33 |
|
34 |
```bash
|
35 |
+
git clone https://github.com/SWivid/F5-TTS.git
|
36 |
+
cd F5-TTS
|
37 |
+
pip install -e .
|
38 |
```
|
39 |
|
40 |
+
### 2. As a pip package
|
41 |
|
42 |
```bash
|
43 |
+
pip install git+https://github.com/SWivid/F5-TTS.git
|
44 |
```
|
45 |
|
46 |
+
### 3. Build from dockerfile
|
47 |
```bash
|
48 |
docker build -t f5tts:v1 .
|
49 |
```
|
50 |
|
51 |
+
## Development
|
52 |
|
53 |
+
Use pre-commit to ensure code quality (will run linters and formatters automatically)
|
54 |
|
55 |
```bash
|
56 |
pip install pre-commit
|
57 |
pre-commit install
|
58 |
```
|
59 |
|
60 |
+
When making a pull request, before each commit, run:
|
|
|
|
|
61 |
|
62 |
```bash
|
63 |
pre-commit run --all-files
|
|
|
65 |
|
66 |
Note: Some model components have linting exceptions for E722 to accommodate tensor notation
|
67 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
68 |
## Prepare Dataset
|
69 |
|
70 |
Example data processing scripts for Emilia and Wenetspeech4TTS, and you may tailor your own one along with a Dataset class in `f5_tts/model/dataset.py`.
|
|
|
128 |
|
129 |
## Inference
|
130 |
|
131 |
+
```python
|
132 |
+
import gradio as gr
|
133 |
+
from f5_tts.gradio_app import app
|
134 |
+
|
135 |
+
with gr.Blocks() as main_app:
|
136 |
+
gr.Markdown("# This is an example of using F5-TTS within a bigger Gradio app")
|
137 |
+
|
138 |
+
# ... other Gradio components
|
139 |
+
|
140 |
+
app.render()
|
141 |
+
|
142 |
+
main_app.launch()
|
143 |
+
|
144 |
+
```
|
145 |
+
|
146 |
The pretrained model checkpoints can be reached at [🤗 Hugging Face](https://huggingface.co/SWivid/F5-TTS) and [🤖 Model Scope](https://www.modelscope.cn/models/SWivid/F5-TTS_Emilia-ZH-EN), or automatically downloaded with `inference-cli` and `gradio_app`.
|
147 |
|
148 |
Currently support 30s for a single generation, which is the **TOTAL** length of prompt audio and the generated. Batch inference with chunks is supported by `inference-cli` and `gradio_app`.
|
|
|
244 |
Install packages for evaluation:
|
245 |
|
246 |
```bash
|
247 |
+
pip install -e .[eval]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
248 |
```
|
249 |
|
250 |
Update the path with your batch-inferenced results, and carry out WER / SIM evaluations:
|
app.py
DELETED
@@ -1,3 +0,0 @@
|
|
1 |
-
from f5_tts.gradio_app import app
|
2 |
-
|
3 |
-
app.queue().launch()
|
|
|
|
|
|
|
|
gradio.Dockerfile
DELETED
@@ -1,27 +0,0 @@
|
|
1 |
-
FROM pytorch/pytorch:2.4.0-cuda12.4-cudnn9-devel
|
2 |
-
|
3 |
-
USER root
|
4 |
-
|
5 |
-
ARG DEBIAN_FRONTEND=noninteractive
|
6 |
-
|
7 |
-
LABEL github_repo="https://github.com/rsxdalv/F5-TTS"
|
8 |
-
|
9 |
-
RUN set -x \
|
10 |
-
&& apt-get update \
|
11 |
-
&& apt-get -y install wget curl man git less openssl libssl-dev unzip unar build-essential aria2 tmux vim \
|
12 |
-
&& apt-get install -y openssh-server sox libsox-fmt-all libsox-fmt-mp3 libsndfile1-dev ffmpeg \
|
13 |
-
&& rm -rf /var/lib/apt/lists/* \
|
14 |
-
&& apt-get clean
|
15 |
-
|
16 |
-
WORKDIR /workspace
|
17 |
-
|
18 |
-
RUN git clone https://github.com/rsxdalv/F5-TTS.git \
|
19 |
-
&& cd F5-TTS \
|
20 |
-
&& pip install --no-cache-dir -r requirements.txt
|
21 |
-
|
22 |
-
ENV SHELL=/bin/bash
|
23 |
-
|
24 |
-
WORKDIR /workspace/F5-TTS/f5_tts
|
25 |
-
|
26 |
-
EXPOSE 7860
|
27 |
-
CMD python gradio_app.py
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
pyproject.toml
CHANGED
@@ -7,6 +7,7 @@ name = "f5-tts"
|
|
7 |
dynamic = ["version"]
|
8 |
description = "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching"
|
9 |
readme = "README.md"
|
|
|
10 |
classifiers = [
|
11 |
"License :: OSI Approved :: MIT License",
|
12 |
"Operating System :: OS Independent",
|
@@ -14,11 +15,10 @@ classifiers = [
|
|
14 |
]
|
15 |
dependencies = [
|
16 |
"accelerate>=0.33.0",
|
17 |
-
"
|
|
|
18 |
"click",
|
19 |
"datasets",
|
20 |
-
"einops>=0.8.0",
|
21 |
-
"einx>=0.3.0",
|
22 |
"ema_pytorch>=0.5.2",
|
23 |
"gradio",
|
24 |
"jieba",
|
@@ -40,13 +40,17 @@ dependencies = [
|
|
40 |
"x_transformers>=1.31.14",
|
41 |
]
|
42 |
|
43 |
-
[
|
44 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
45 |
|
46 |
[project.urls]
|
47 |
Homepage = "https://github.com/SWivid/F5-TTS"
|
48 |
|
49 |
[project.scripts]
|
50 |
-
"finetune-cli" = "f5_tts.finetune_cli:main"
|
51 |
"inference-cli" = "f5_tts.inference_cli:main"
|
52 |
-
"eval_infer_batch" = "f5_tts.scripts.eval_infer_batch:main"
|
|
|
7 |
dynamic = ["version"]
|
8 |
description = "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching"
|
9 |
readme = "README.md"
|
10 |
+
license = {text = "MIT License"}
|
11 |
classifiers = [
|
12 |
"License :: OSI Approved :: MIT License",
|
13 |
"Operating System :: OS Independent",
|
|
|
15 |
]
|
16 |
dependencies = [
|
17 |
"accelerate>=0.33.0",
|
18 |
+
"bitsandbytes>0.37.0",
|
19 |
+
"cached_path",
|
20 |
"click",
|
21 |
"datasets",
|
|
|
|
|
22 |
"ema_pytorch>=0.5.2",
|
23 |
"gradio",
|
24 |
"jieba",
|
|
|
40 |
"x_transformers>=1.31.14",
|
41 |
]
|
42 |
|
43 |
+
[project.optional-dependencies]
|
44 |
+
eval = [
|
45 |
+
"faster_whisper==0.10.1",
|
46 |
+
"funasr",
|
47 |
+
"jiwer",
|
48 |
+
"zhconv",
|
49 |
+
"zhon",
|
50 |
+
]
|
51 |
|
52 |
[project.urls]
|
53 |
Homepage = "https://github.com/SWivid/F5-TTS"
|
54 |
|
55 |
[project.scripts]
|
|
|
56 |
"inference-cli" = "f5_tts.inference_cli:main"
|
|
requirements.txt
DELETED
@@ -1,22 +0,0 @@
|
|
1 |
-
accelerate>=0.33.0
|
2 |
-
bitsandbytes>0.37.0
|
3 |
-
cached_path
|
4 |
-
click
|
5 |
-
datasets
|
6 |
-
ema_pytorch>=0.5.2
|
7 |
-
gradio
|
8 |
-
jieba
|
9 |
-
librosa
|
10 |
-
matplotlib
|
11 |
-
numpy<=1.26.4
|
12 |
-
pydub
|
13 |
-
pypinyin
|
14 |
-
safetensors
|
15 |
-
soundfile
|
16 |
-
tomli
|
17 |
-
torchdiffeq
|
18 |
-
tqdm>=4.65.0
|
19 |
-
transformers
|
20 |
-
vocos
|
21 |
-
wandb
|
22 |
-
x_transformers>=1.31.14
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
requirements_eval.txt
DELETED
@@ -1,5 +0,0 @@
|
|
1 |
-
faster_whisper
|
2 |
-
funasr
|
3 |
-
jiwer
|
4 |
-
zhconv
|
5 |
-
zhon
|
|
|
|
|
|
|
|
|
|
|
|
src/f5_tts/finetune_cli.py
CHANGED
@@ -1,10 +1,11 @@
|
|
1 |
import argparse
|
|
|
|
|
|
|
|
|
2 |
from f5_tts.model import CFM, UNetT, DiT, Trainer
|
3 |
from f5_tts.model.utils import get_tokenizer
|
4 |
from f5_tts.model.dataset import load_dataset
|
5 |
-
from cached_path import cached_path
|
6 |
-
import shutil
|
7 |
-
import os
|
8 |
|
9 |
# -------------------------- Dataset Settings --------------------------- #
|
10 |
target_sample_rate = 24000
|
|
|
1 |
import argparse
|
2 |
+
import os
|
3 |
+
import shutil
|
4 |
+
|
5 |
+
from cached_path import cached_path
|
6 |
from f5_tts.model import CFM, UNetT, DiT, Trainer
|
7 |
from f5_tts.model.utils import get_tokenizer
|
8 |
from f5_tts.model.dataset import load_dataset
|
|
|
|
|
|
|
9 |
|
10 |
# -------------------------- Dataset Settings --------------------------- #
|
11 |
target_sample_rate = 24000
|
src/f5_tts/inference_cli.py
CHANGED
@@ -29,7 +29,7 @@ parser.add_argument(
|
|
29 |
"-c",
|
30 |
"--config",
|
31 |
help="Configuration file. Default=inference-cli.toml",
|
32 |
-
default=os.path.join(files(
|
33 |
)
|
34 |
parser.add_argument(
|
35 |
"-m",
|
@@ -168,8 +168,10 @@ def main_process(ref_audio, ref_text, text_gen, model_obj, remove_silence):
|
|
168 |
remove_silence_for_generated_wav(f.name)
|
169 |
print(f.name)
|
170 |
|
|
|
171 |
def main():
|
172 |
main_process(ref_audio, ref_text, gen_text, ema_model, remove_silence)
|
173 |
|
|
|
174 |
if __name__ == "__main__":
|
175 |
main()
|
|
|
29 |
"-c",
|
30 |
"--config",
|
31 |
help="Configuration file. Default=inference-cli.toml",
|
32 |
+
default=os.path.join(files("f5_tts").joinpath("data"), "inference-cli.toml"),
|
33 |
)
|
34 |
parser.add_argument(
|
35 |
"-m",
|
|
|
168 |
remove_silence_for_generated_wav(f.name)
|
169 |
print(f.name)
|
170 |
|
171 |
+
|
172 |
def main():
|
173 |
main_process(ref_audio, ref_text, gen_text, ema_model, remove_silence)
|
174 |
|
175 |
+
|
176 |
if __name__ == "__main__":
|
177 |
main()
|
src/f5_tts/model/utils.py
CHANGED
@@ -122,7 +122,7 @@ def get_tokenizer(dataset_name, tokenizer: str = "pinyin"):
|
|
122 |
- if use "byte", set to 256 (unicode byte range)
|
123 |
"""
|
124 |
if tokenizer in ["pinyin", "char"]:
|
125 |
-
tokenizer_path = os.path.join(files(
|
126 |
with open(tokenizer_path, "r", encoding="utf-8") as f:
|
127 |
vocab_char_map = {}
|
128 |
for i, char in enumerate(f):
|
|
|
122 |
- if use "byte", set to 256 (unicode byte range)
|
123 |
"""
|
124 |
if tokenizer in ["pinyin", "char"]:
|
125 |
+
tokenizer_path = os.path.join(files("f5_tts").joinpath("data"), f"{dataset_name}_{tokenizer}/vocab.txt")
|
126 |
with open(tokenizer_path, "r", encoding="utf-8") as f:
|
127 |
vocab_char_map = {}
|
128 |
for i, char in enumerate(f):
|
src/f5_tts/scripts/eval_infer_batch.py
CHANGED
@@ -36,6 +36,7 @@ target_rms = 0.1
|
|
36 |
|
37 |
tokenizer = "pinyin"
|
38 |
|
|
|
39 |
def main():
|
40 |
# ---------------------- infer setting ---------------------- #
|
41 |
|
@@ -54,7 +55,6 @@ def main():
|
|
54 |
|
55 |
args = parser.parse_args()
|
56 |
|
57 |
-
|
58 |
seed = args.seed
|
59 |
dataset_name = args.dataset
|
60 |
exp_name = args.expname
|
@@ -67,14 +67,12 @@ def main():
|
|
67 |
|
68 |
testset = args.testset
|
69 |
|
70 |
-
|
71 |
infer_batch_size = 1 # max frames. 1 for ddp single inference (recommended)
|
72 |
cfg_strength = 2.0
|
73 |
speed = 1.0
|
74 |
use_truth_duration = False
|
75 |
no_ref_audio = False
|
76 |
|
77 |
-
|
78 |
if exp_name == "F5TTS_Base":
|
79 |
model_cls = DiT
|
80 |
model_cfg = dict(dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4)
|
@@ -83,23 +81,21 @@ def main():
|
|
83 |
model_cls = UNetT
|
84 |
model_cfg = dict(dim=1024, depth=24, heads=16, ff_mult=4)
|
85 |
|
86 |
-
|
87 |
-
datapath = files('f5_tts').joinpath('data')
|
88 |
|
89 |
if testset == "ls_pc_test_clean":
|
90 |
-
metalst = os.path.join(datapath,"librispeech_pc_test_clean_cross_sentence.lst")
|
91 |
librispeech_test_clean_path = "<SOME_PATH>/LibriSpeech/test-clean" # test-clean path
|
92 |
metainfo = get_librispeech_test_clean_metainfo(metalst, librispeech_test_clean_path)
|
93 |
|
94 |
elif testset == "seedtts_test_zh":
|
95 |
-
metalst = os.path.join(datapath,"seedtts_testset/zh/meta.lst")
|
96 |
metainfo = get_seedtts_testset_metainfo(metalst)
|
97 |
|
98 |
elif testset == "seedtts_test_en":
|
99 |
-
metalst = os.path.join(datapath,"seedtts_testset/en/meta.lst")
|
100 |
metainfo = get_seedtts_testset_metainfo(metalst)
|
101 |
|
102 |
-
|
103 |
# path to save genereted wavs
|
104 |
if seed is None:
|
105 |
seed = random.randint(-10000, 10000)
|
@@ -112,7 +108,6 @@ def main():
|
|
112 |
f"{'_no-ref-audio' if no_ref_audio else ''}"
|
113 |
)
|
114 |
|
115 |
-
|
116 |
# -------------------------------------------------#
|
117 |
|
118 |
use_ema = True
|
@@ -200,5 +195,6 @@ def main():
|
|
200 |
timediff = time.time() - start
|
201 |
print(f"Done batch inference in {timediff / 60 :.2f} minutes.")
|
202 |
|
|
|
203 |
if __name__ == "__main__":
|
204 |
main()
|
|
|
36 |
|
37 |
tokenizer = "pinyin"
|
38 |
|
39 |
+
|
40 |
def main():
|
41 |
# ---------------------- infer setting ---------------------- #
|
42 |
|
|
|
55 |
|
56 |
args = parser.parse_args()
|
57 |
|
|
|
58 |
seed = args.seed
|
59 |
dataset_name = args.dataset
|
60 |
exp_name = args.expname
|
|
|
67 |
|
68 |
testset = args.testset
|
69 |
|
|
|
70 |
infer_batch_size = 1 # max frames. 1 for ddp single inference (recommended)
|
71 |
cfg_strength = 2.0
|
72 |
speed = 1.0
|
73 |
use_truth_duration = False
|
74 |
no_ref_audio = False
|
75 |
|
|
|
76 |
if exp_name == "F5TTS_Base":
|
77 |
model_cls = DiT
|
78 |
model_cfg = dict(dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4)
|
|
|
81 |
model_cls = UNetT
|
82 |
model_cfg = dict(dim=1024, depth=24, heads=16, ff_mult=4)
|
83 |
|
84 |
+
datapath = files("f5_tts").joinpath("data")
|
|
|
85 |
|
86 |
if testset == "ls_pc_test_clean":
|
87 |
+
metalst = os.path.join(datapath, "librispeech_pc_test_clean_cross_sentence.lst")
|
88 |
librispeech_test_clean_path = "<SOME_PATH>/LibriSpeech/test-clean" # test-clean path
|
89 |
metainfo = get_librispeech_test_clean_metainfo(metalst, librispeech_test_clean_path)
|
90 |
|
91 |
elif testset == "seedtts_test_zh":
|
92 |
+
metalst = os.path.join(datapath, "seedtts_testset/zh/meta.lst")
|
93 |
metainfo = get_seedtts_testset_metainfo(metalst)
|
94 |
|
95 |
elif testset == "seedtts_test_en":
|
96 |
+
metalst = os.path.join(datapath, "seedtts_testset/en/meta.lst")
|
97 |
metainfo = get_seedtts_testset_metainfo(metalst)
|
98 |
|
|
|
99 |
# path to save genereted wavs
|
100 |
if seed is None:
|
101 |
seed = random.randint(-10000, 10000)
|
|
|
108 |
f"{'_no-ref-audio' if no_ref_audio else ''}"
|
109 |
)
|
110 |
|
|
|
111 |
# -------------------------------------------------#
|
112 |
|
113 |
use_ema = True
|
|
|
195 |
timediff = time.time() - start
|
196 |
print(f"Done batch inference in {timediff / 60 :.2f} minutes.")
|
197 |
|
198 |
+
|
199 |
if __name__ == "__main__":
|
200 |
main()
|