Spaces:
Sleeping
Sleeping
File size: 7,984 Bytes
7694c84 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 |
# tts-arabic-pytorch
TTS models (Tacotron2, FastPitch), trained on [Nawar Halabi](https://github.com/nawarhalabi)'s [Arabic Speech Corpus](http://en.arabicspeechcorpus.com/), including the [HiFi-GAN vocoder](https://github.com/jik876/hifi-gan) for direct TTS inference.
<div align="center">
<img src="https://user-images.githubusercontent.com/28433296/227660976-0d1e2033-276e-45e5-b232-a5a9b6b3f2a8.png" width="95%"></img>
</div>
Papers:
Tacotron2 | Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions ([arXiv](https://arxiv.org/abs/1712.05884))
FastPitch | FastPitch: Parallel Text-to-speech with Pitch Prediction ([arXiv](https://arxiv.org/abs/2006.06873))
HiFi-GAN | HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis ([arXiv](https://arxiv.org/abs/2010.05646))
## Audio Samples
You can listen to some audio samples [here](https://nipponjo.github.io/tts-arabic-samples).
## Multispeaker model (in progress)
Multispeaker weights are available for the FastPitch model.
Currently, another male voice and two female voices have been added.
Audio samples can be found [here](https://nipponjo.github.io/tts-arabic-speakers). Download weights [here](https://drive.google.com/u/0/uc?id=18IYUSRXvLErVjaDORj_TKzUxs90l61Ja&export=download).
The multispeaker dataset was created by synthesizing data with [Coqui](https://github.com/coqui-ai)'s [XTTS-v2](https://huggingface.co/coqui/XTTS-v2) model and a mix of voices from the [Tunisian_MSA](https://www.openslr.org/46/) dataset.
## Quick Setup
The models were trained with the mse loss as described in the papers. I also trained the models using an additional adversarial loss (adv). The difference is not large, but I think that the (adv) version often sounds a bit clearer. You can compare them yourself.
Download the pretrained weights for the Tacotron2 model ([mse](https://drive.google.com/u/0/uc?id=1GCu-ZAcfJuT5qfzlKItcNqtuVNa7CNy9&export=download) | [adv](https://drive.google.com/u/0/uc?id=1FusCFZIXSVCQ9Q6PLb91GIkEnhn_zWRS&export=download)).
Download the pretrained weights for the FastPitch model ([mse](https://drive.google.com/u/0/uc?id=1sliRc62wjPTnPWBVQ95NDUgnCSH5E8M0&export=download) | [adv](https://drive.google.com/u/0/uc?id=1-vZOhi9To_78-yRslC6sFLJBUjwgJT-D&export=download)).
Download the [HiFi-GAN vocoder](https://github.com/jik876/hifi-gan) weights ([link](https://drive.google.com/u/0/uc?id=1zSYYnJFS-gQox-IeI71hVY-fdPysxuFK&export=download)). Either put them into `pretrained/hifigan-asc-v1` or edit the following lines in `configs/basic.yaml`.
```yaml
# vocoder
vocoder_state_path: pretrained/hifigan-asc-v1/hifigan-asc.pth
vocoder_config_path: pretrained/hifigan-asc-v1/config.json
```
This repo includes the diacritization models [Shakkala](https://github.com/Barqawiz/Shakkala) and [Shakkelha](https://github.com/AliOsm/shakkelha).
The weights can be downloaded [here](https://drive.google.com/u/1/uc?id=1MIZ_t7pqAQP-R3vwWWQTJMER8yPm1uB1&export=download). There also exists a [separate repo](https://github.com/nipponjo/arabic-vocalization) and [package](https://github.com/nipponjo/arabic_vocalizer).
-> Alternatively, [download all models](https://drive.google.com/u/1/uc?id=1FD2J-xUk48JPF9TeS8ZKHzDC_ZNBfLd8&export=download) and put the content of the zip file into the `pretrained` folder.
## Required packages:
`torch torchaudio pyyaml`
~ for training: `librosa matplotlib tensorboard`
~ for the demo app: `fastapi "uvicorn[standard]"`
## Using the models
The `Tacotron2`/`FastPitch` from `models.tacotron2`/`models.fastpitch` are wrappers that simplify text-to-mel inference. The `Tacotron2Wave`/`FastPitch2Wave` models includes the [HiFi-GAN vocoder](https://github.com/jik876/hifi-gan) for direct text-to-speech inference.
## Inferring the Mel spectrogram
```python
from models.tacotron2 import Tacotron2
model = Tacotron2('pretrained/tacotron2_ar_adv.pth')
model = model.cuda()
mel_spec = model.ttmel("ุงููุณูููุงู
ู ุนููููููู
ููุง ุตูุฏููููู")
```
```python
from models.fastpitch import FastPitch
model = FastPitch('pretrained/fastpitch_ar_adv.pth')
model = model.cuda()
mel_spec = model.ttmel("ุงููุณูููุงู
ู ุนููููููู
ููุง ุตูุฏููููู")
```
## End-to-end Text-to-Speech
```python
from models.tacotron2 import Tacotron2Wave
model = Tacotron2Wave('pretrained/tacotron2_ar_adv.pth')
model = model.cuda()
wave = model.tts("ุงููุณูููุงู
ู ุนููููููู
ููุง ุตูุฏููููู")
wave_list = model.tts(["ุตููุฑ" ,"ูุงุญูุฏ" ,"ุฅูุซูุงู", "ุซููุงุซูุฉ" ,"ุฃูุฑุจูุนูุฉ" ,"ุฎูู
ุณูุฉ", "ุณูุชููุฉ" ,"ุณูุจุนูุฉ" ,"ุซูู
ุงููููุฉ", "ุชูุณุนูุฉ" ,"ุนูุดูุฑูุฉ"])
```
```python
from models.fastpitch import FastPitch2Wave
model = FastPitch2Wave('pretrained/fastpitch_ar_adv.pth')
model = model.cuda()
wave = model.tts("ุงููุณูููุงู
ู ุนููููููู
ููุง ุตูุฏููููู")
wave_list = model.tts(["ุตููุฑ" ,"ูุงุญูุฏ" ,"ุฅูุซูุงู", "ุซููุงุซูุฉ" ,"ุฃูุฑุจูุนูุฉ" ,"ุฎูู
ุณูุฉ", "ุณูุชููุฉ" ,"ุณูุจุนูุฉ" ,"ุซูู
ุงููููุฉ", "ุชูุณุนูุฉ" ,"ุนูุดูุฑูุฉ"])
```
By default, Arabic letters are converted using the [Buckwalter transliteration](https://en.wikipedia.org/wiki/Buckwalter_transliteration), which can also be used directly.
```python
wave = model.tts(">als~alAmu Ealaykum yA Sadiyqiy")
wave_list = model.tts(["Sifr", "wAHid", "<i^nAn", "^alA^ap", ">arbaEap", "xamsap", "sit~ap", "sabEap", "^amAniyap", "tisEap", "Ea$arap"])
```
## Unvocalized text
```python
text_unvoc = "ุงููุบุฉ ุงูุนุฑุจูุฉ ูู ุฃูุซุฑ ุงููุบุงุช ุงูุณุงู
ูุฉ ุชุญุฏุซุงุ ูุฅุญุฏู ุฃูุซุฑ ุงููุบุงุช ุงูุชุดุงุฑุง ูู ุงูุนุงูู
"
wave_shakkala = model.tts(text_unvoc, vowelizer='shakkala')
wave_shakkelha = model.tts(text_unvoc, vowelizer='shakkelha')
```
### Inference from text file
```bash
python inference.py
# default parameters:
python inference.py --list data/infer_text.txt --out_dir samples/results --model fastpitch --checkpoint pretrained/fastpitch_ar_adv.pth --batch_size 2 --denoise 0
```
## Testing the model
To test the model run:
```bash
python test.py
# default parameters:
python test.py --model fastpitch --checkpoint pretrained/fastpitch_ar_adv.pth --out_dir samples/test
```
## Processing details
This repo uses Nawar Halabi's [Arabic-Phonetiser](https://github.com/nawarhalabi/Arabic-Phonetiser) but simplifies the result such that different contexts are ignored (see `text/symbols.py`). Further, a doubled consonant is represented as consonant + doubling-token.
The Tacotron2 model can sometimes struggle to pronounce the last phoneme of a sentence when it ends in an unvocalized consonant. The pronunciation is more reliable if one appends a word-separator token at the end and cuts it off using the alignments weights (details in `models.networks`). This option is implemented as a default postprocessing step that can be disabled by setting `postprocess_mel=False`.
## Training the model
Before training, the audio files must be resampled. The model was trained after preprocessing the files using `scripts/preprocess_audio.py`.
To train the model with options specified in the config file run:
```bash
python train.py
# default parameters:
python train.py --config configs/nawar.yaml
```
## Web app
The web app uses the FastAPI library. To run the app you need the following packages:
fastapi: for the backend api | uvicorn: for serving the app
Install with: `pip install fastapi "uvicorn[standard]"`
Run with: `python app.py`
Preview:
<div align="center">
<img src="https://user-images.githubusercontent.com/28433296/212092260-57b2ced3-da69-48ad-8be7-50e621423687.png" width="66%"></img>
</div>
## Acknowledgements
I referred to NVIDIA's [Tacotron2 implementation](https://github.com/NVIDIA/tacotron2) for details on model training.
The FastPitch files stem from NVIDIA's [DeepLearningExamples](https://github.com/NVIDIA/DeepLearningExamples/)
|