Spaces:
Runtime error
Runtime error
|  | |
| IMS Toucan is a toolkit for teaching, training and using state-of-the-art Speech Synthesis models, developed at the | |
| **Institute for Natural Language Processing (IMS), University of Stuttgart, Germany**. Everything is pure Python and | |
| PyTorch based to keep it as simple and beginner-friendly, yet powerful as possible. | |
| The PyTorch Modules of [Tacotron 2](https://arxiv.org/abs/1712.05884) | |
| and [FastSpeech 2](https://arxiv.org/abs/2006.04558) are taken from | |
| [ESPnet](https://github.com/espnet/espnet), the PyTorch Modules of [HiFiGAN](https://arxiv.org/abs/2010.05646) are taken | |
| from the [ParallelWaveGAN repository](https://github.com/kan-bayashi/ParallelWaveGAN) | |
| which are also authored by the brilliant [Tomoki Hayashi](https://github.com/kan-bayashi). | |
| For a version of the toolkit that includes TransformerTTS instead of Tacotron 2 and MelGAN instead of HiFiGAN, check out | |
| the TransformerTTS and MelGAN branch. They are separated to keep the code clean, simple and minimal. | |
| --- | |
| ## Contents | |
| - [New Features](#new-features) | |
| - [Demonstration](#demonstration) | |
| - [Installation](#installation) | |
| + [Basic Requirements](#basic-requirements) | |
| + [Speaker Embedding](#speaker-embedding) | |
| + [espeak-ng](#espeak-ng) | |
| - [Creating a new Pipeline](#creating-a-new-pipeline) | |
| * [Build a HiFi-GAN Pipeline](#build-a-hifi-gan-pipeline) | |
| * [Build a FastSpeech 2 Pipeline](#build-a-fastspeech-2-pipeline) | |
| - [Training a Model](#training-a-model) | |
| - [Creating a new InferenceInterface](#creating-a-new-inferenceinterface) | |
| - [Using a trained Model for Inference](#using-a-trained-model-for-inference) | |
| - [FAQ](#faq) | |
| - [Citation](#citation) | |
| --- | |
| ## New Features | |
| - [As shown in this paper](http://festvox.org/blizzard/bc2021/BC21_DelightfulTTS.pdf) vocoders can be used to perform | |
| super-resolution and spectrogram inversion simultaneously. We added this to our HiFi-GAN vocoder. It now takes 16kHz | |
| spectrograms as input, but produces 48kHz waveforms. | |
| - We officially introduced IMS Toucan in | |
| [our contribution to the Blizzard Challenge 2021](http://festvox.org/blizzard/bc2021/BC21_IMS.pdf). Check out the | |
| bottom of the readme for a bibtex entry. | |
| - We now use articulatory representations of phonemes as the input for all models. This allows us to easily use | |
| multilingual data. | |
| - We provide a checkpoint trained with [model agnostic meta learning](https://arxiv.org/abs/1703.03400) from which you | |
| should be able to fine-tune a model with very little data in almost any language. | |
| - We now use a small self-contained Aligner that is trained with CTC, inspired by | |
| [this implementation](https://github.com/as-ideas/DeepForcedAligner). This allows us to get rid of the dependence on | |
| autoregressive models. Tacotron 2 is thus now also no longer in this branch, but still present in other branches, | |
| similar to TransformerTTS. | |
| --- | |
| ## Demonstration | |
| [Here are two sentences](https://drive.google.com/file/d/1ltAyR2EwAbmDo2hgkx1mvUny4FuxYmru/view?usp=sharing) | |
| produced by Tacotron 2 combined with HiFi-GAN, trained on | |
| [Nancy Krebs](https://www.cstr.ed.ac.uk/projects/blizzard/2011/lessac_blizzard2011/) using this toolkit. | |
| [Here is some speech](https://drive.google.com/file/d/1mZ1LvTlY6pJ5ZQ4UXZ9jbzB651mufBrB/view?usp=sharing) | |
| produced by FastSpeech 2 and MelGAN trained on [LJSpeech](https://keithito.com/LJ-Speech-Dataset/) | |
| using this toolkit. | |
| And [here is a sentence](https://drive.google.com/file/d/1FT49Jf0yyibwMDbsEJEO9mjwHkHRIGXc/view?usp=sharing) | |
| produced by TransformerTTS and MelGAN trained on [Thorsten](https://github.com/thorstenMueller/deep-learning-german-tts) | |
| using this toolkit. | |
| [Here is some speech](https://drive.google.com/file/d/14nPo2o1VKtWLPGF7e_0TxL8XGI3n7tAs/view?usp=sharing) | |
| produced by a multi-speaker FastSpeech 2 with MelGAN trained on | |
| [LibriTTS](https://research.google/tools/datasets/libri-tts/) using this toolkit. Fans of the videogame Portal may | |
| recognize who was used as the reference speaker for this utterance. | |
| [Interactive Demo of our entry to the Blizzard Challenge 2021.](https://colab.research.google.com/drive/1bRaySf8U55MRPaxqBr8huWrzCOzlxVqw) | |
| This is based on an older version of the toolkit though. It uses FastSpeech2 and MelGAN as vocoder and is trained on 5 | |
| hours of Spanish. | |
| --- | |
| ## Installation | |
| #### Basic Requirements | |
| To install this toolkit, clone it onto the machine you want to use it on | |
| (should have at least one GPU if you intend to train models on that machine. For inference, you can get by without GPU). | |
| Navigate to the directory you have cloned. We are going to create and activate a | |
| [conda virtual environment](https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html) | |
| to install the basic requirements into. After creating the environment, the command you need to use to activate the | |
| virtual environment is displayed. The commands below show everything you need to do. | |
| ``` | |
| conda create --prefix ./toucan_conda_venv --no-default-packages python=3.8 | |
| pip install --no-cache-dir -r requirements.txt | |
| pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html | |
| ``` | |
| #### Speaker Embedding | |
| As [NVIDIA has shown](https://arxiv.org/pdf/2110.05798.pdf), you get better results by fine-tuning a pretrained model on | |
| a new speaker, rather than training a multispeaker model. We have thus dropped support for zero-shot multispeaker models | |
| using speaker embeddings. However we still | |
| use [Speechbrain's ECAPA-TDNN](https://huggingface.co/speechbrain/spkrec-ecapa-voxceleb) for a cycle consistency loss to | |
| make adapting to new speakers a bit faster. | |
| In the current version of the toolkit no further action should be required. When you are using multispeaker for the | |
| first time, it requires an internet connection to download the pretrained models though. | |
| #### espeak-ng | |
| And finally you need to have espeak-ng installed on your system, because it is used as backend for the phonemizer. If | |
| you replace the phonemizer, you don't need it. On most Linux environments it will be installed already, and if it is | |
| not, and you have the sufficient rights, you can install it by simply running | |
| ``` | |
| apt-get install espeak-ng | |
| ``` | |
| --- | |
| ## Creating a new Pipeline | |
| To create a new pipeline to train a HiFiGAN vocoder, you only need a set of audio files. To create a new pipeline for a | |
| FastSpeech 2, you need audio files, corresponding text labels, and an already trained Aligner model to estimate the | |
| duration information that FastSpeech 2 needs as input. Let's go through them in order of increasing complexity. | |
| ### Build a HiFi-GAN Pipeline | |
| In the directory called | |
| *Utility* there is a file called | |
| *file_lists.py*. In this file you should write a function that returns a list of all the absolute paths to each of the | |
| audio files in your dataset as strings. | |
| Then go to the directory | |
| *TrainingInterfaces/TrainingPipelines*. In there, make a copy of any existing pipeline that has HiFiGAN in its name. We | |
| will use this as reference and only make the necessary changes to use the new dataset. Import the function you have just | |
| written as | |
| *get_file_list*. Now look out for a variable called | |
| *model_save_dir*. This is the default directory that checkpoints will be saved into, unless you specify another one when | |
| calling the training script. Change it to whatever you like. | |
| Now you need to add your newly created pipeline to the pipeline dictionary in the file | |
| *run_training_pipeline.py* in the top level of the toolkit. In this file, import the | |
| *run* function from the pipeline you just created and give it a speaking name. Now in the | |
| *pipeline_dict*, add your imported function as value and use as key a shorthand that makes sense. And just like that | |
| you're done. | |
| ### Build a FastSpeech 2 Pipeline | |
| In the directory called | |
| *Utility* there is a file called | |
| *path_to_transcript_dicts.py*. In this file you should write a function that returns a dictionary that has all the | |
| absolute paths to each of the audio files in your dataset as strings as the keys and the textual transcriptions of the | |
| corresponding audios as the values. | |
| Then go to the directory | |
| *TrainingInterfaces/TrainingPipelines*. In there, make a copy of any existing pipeline that has FastSpeech 2 in its | |
| name. We will use this copy as reference and only make the necessary changes to use the new dataset. Import the function | |
| you have just written as | |
| *build_path_to_transcript_dict*. Since the data will be processed a considerable amount, a cache will be built and saved | |
| as file for quick and easy restarts. So find the variable | |
| *cache_dir* and adapt it to your needs. The same goes for the variable | |
| *save_dir*, which is where the checkpoints will be saved to. This is a default value, you can overwrite it when calling | |
| the pipeline later using a command line argument, in case you want to fine-tune from a checkpoint and thus save into a | |
| different directory. | |
| In your new pipeline file, look out for the line in which the | |
| *acoustic_model* is loaded. Change the path to the checkpoint of an Aligner model. It can either be the one that is | |
| supplied with the toolkit in the download script, or one that you trained yourself. In the example pipelines, the one | |
| that we provide is finetuned to the dataset it is applied to before it is used to extract durations. | |
| Since we are using text here, we have to make sure that the text processing is adequate for the language. So check in | |
| *Preprocessing/TextFrontend* whether the TextFrontend already has a language ID (e.g. 'en' and 'de') for the language of | |
| your dataset. If not, you'll have to implement handling for that, but it should be pretty simple by just doing it | |
| analogous to what is there already. Now back in the pipeline, change the | |
| *lang* argument in the creation of the dataset and in the call to the train loop function to the language ID that | |
| matches your data. | |
| Now navigate to the implementation of the | |
| *train_loop* that is called in the pipeline. In this file, find the function called | |
| *plot_progress_spec*. This function will produce spectrogram plots during training, which is the most important way to | |
| monitor the progress of the training. In there, you may need to add an example sentence for the language of the data you | |
| are using. It should all be pretty clear from looking at it. | |
| Once this is done, we are almost done, now we just need to make it available to the | |
| *run_training_pipeline.py* file in the top level. In said file, import the | |
| *run* function from the pipeline you just created and give it a speaking name. Now in the | |
| *pipeline_dict*, add your imported function as value and use as key a shorthand that makes sense. And that's it. | |
| --- | |
| ## Training a Model | |
| Once you have a pipeline built, training is super easy. Just activate your virtual environment and run the command | |
| below. You might want to use something like nohup to keep it running after you log out from the server (then you should | |
| also add -u as option to python) and add an & to start it in the background. Also, you might want to direct the std:out | |
| and std:err into a file using > but all of that is just standard shell use and has nothing to do with the toolkit. | |
| ``` | |
| python run_training_pipeline.py <shorthand of the pipeline> | |
| ``` | |
| You can supply any of the following arguments, but don't have to (although for training you should definitely specify at | |
| least a GPU ID). | |
| ``` | |
| --gpu_id <ID of the GPU you wish to use, as displayed with nvidia-smi, default is cpu> | |
| --resume_checkpoint <path to a checkpoint to load> | |
| --resume (if this is present, the furthest checkpoint available will be loaded automatically) | |
| --finetune (if this is present, the provided checkpoint will be fine-tuned on the data from this pipeline) | |
| --model_save_dir <path to a directory where the checkpoints should be saved> | |
| ``` | |
| After every epoch, some logs will be written to the console. If the loss becomes NaN, you'll need to use a smaller | |
| learning rate or more warmup steps in the arguments of the call to the training_loop in the pipeline you are running. | |
| If you get cuda out of memory errors, you need to decrease the batchsize in the arguments of the call to the | |
| training_loop in the pipeline you are running. Try decreasing the batchsize in small steps until you get no more out of | |
| cuda memory errors. Decreasing the batchsize may also require you to use a smaller learning rate. The use of GroupNorm | |
| should make it so that the training remains mostly stable. | |
| Speaking of plots: in the directory you specified for saving model's checkpoint files and self-explanatory visualization | |
| data will appear. Since the checkpoints are quite big, only the five most recent ones will be kept. Training will stop | |
| after 500,000 for FastSpeech 2, and after 2,500,000 steps for HiFiGAN. Depending on the machine and configuration you | |
| are using this will take multiple days, so verify that everything works on small tests before running the big thing. If | |
| you want to stop earlier, just kill the process, since everything is daemonic all the child-processes should die with | |
| it. In case there are some ghost-processes left behind, you can use the following command to find them and kill them | |
| manually. | |
| ``` | |
| fuser -v /dev/nvidia* | |
| ``` | |
| After training is complete, it is recommended to run | |
| *run_weight_averaging.py*. If you made no changes to the architectures and stuck to the default directory layout, it | |
| will automatically load any models you produced with one pipeline, average their parameters to get a slightly more | |
| robust model and save the result as | |
| *best.pt* in the same directory where all the corresponding checkpoints lie. This also compresses the file size | |
| significantly, so you should do this and then use the | |
| *best.pt* model for inference. | |
| --- | |
| ## Creating a new InferenceInterface | |
| To build a new | |
| *InferenceInterface*, which you can then use for super simple inference, we're going to use an existing one as template | |
| again. Make a copy of the | |
| *InferenceInterface*. Change the name of the class in the copy and change the paths to the models to use the trained | |
| models of your choice. Instantiate the model with the same hyperparameters that you used when you created it in the | |
| corresponding training pipeline. The last thing to check is the language that you supply to the text frontend. Make sure | |
| it matches what you used during training. | |
| With your newly created | |
| *InferenceInterface*, you can use your trained models pretty much anywhere, e.g. in other projects. All you need is the | |
| *Utility* directory, the | |
| *Layers* | |
| directory, the | |
| *Preprocessing* directory and the | |
| *InferenceInterfaces* directory (and of course your model checkpoint). That's all the code you need, it works | |
| standalone. | |
| --- | |
| ## Using a trained Model for Inference | |
| An | |
| *InferenceInterface* contains two useful methods. They are | |
| *read_to_file* and | |
| *read_aloud*. | |
| - *read_to_file* takes as input a list of strings and a filename. It will synthesize the sentences in the list and | |
| concatenate them with a short pause inbetween and write them to the filepath you supply as the other argument. | |
| - *read_aloud* takes just a string, which it will then convert to speech and immediately play using the system's | |
| speakers. If you set the optional argument | |
| *view* to | |
| *True* when calling it, it will also show a plot of the phonemes it produced, the spectrogram it came up with, and the | |
| wave it created from that spectrogram. So all the representations can be seen, text to phoneme, phoneme to spectrogram | |
| and finally spectrogram to wave. | |
| Those methods are used in demo code in the toolkit. In | |
| *run_interactive_demo.py* and | |
| *run_text_to_file_reader.py*, you can import | |
| *InferenceInterfaces* that you created and add them to the dictionary in each of the files with a shorthand that makes | |
| sense. In the interactive demo, you can just call the python script, then type in the shorthand when prompted and | |
| immediately listen to your synthesis saying whatever you put in next (be wary of out of memory errors for too long | |
| inputs). In the text reader demo script you have to call the function that wraps around the | |
| *InferenceInterface* and supply the shorthand of your choice. It should be pretty clear from looking at it. | |
| --- | |
| ## FAQ | |
| Here are a few points that were brought up by users: | |
| - My error message shows GPU0, even though I specified a different GPU - The way GPU selection works is that the | |
| specified GPU is set as the only visible device, in order to avoid backend stuff running accidentally on different | |
| GPUs. So internally the program will name the device GPU0, because it is the only GPU it can see. It is actually | |
| running on the GPU you specified. | |
| --- | |
| This toolkit has been written by Florian Lux (except for the pytorch modules taken | |
| from [ESPnet](https://github.com/espnet/espnet) and | |
| [ParallelWaveGAN](https://github.com/kan-bayashi/ParallelWaveGAN), as mentioned above), so if you come across problems | |
| or questions, feel free to [write a mail](mailto:[email protected]). Also let me know if you do something | |
| cool with it. Thank you for reading. | |
| ## Citation | |
| ``` | |
| @inproceedings{lux2021toucan, | |
| title={{The IMS Toucan system for the Blizzard Challenge 2021}}, | |
| author={Florian Lux and Julia Koch and Antje Schweitzer and Ngoc Thang Vu}, | |
| year={2021}, | |
| booktitle={Proc. Blizzard Challenge Workshop}, | |
| volume={2021}, | |
| publisher={{Speech Synthesis SIG}} | |
| } | |
| ``` | |