Mastering-Python-HF commited on
Commit
02dd47f
·
1 Parent(s): 490f29b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +26 -12
README.md CHANGED
@@ -9,20 +9,23 @@ pipeline_tag: text-to-speech
9
 
10
  # NVIDIA FastPitch Multispeaker (en-US)
11
 
12
- FastPitch [1] is a fully-parallel transformer architecture with prosody control over pitch and individual phoneme duration. Additionally, it uses an unsupervised speech-text aligner [2]. See the model architecture section for complete architecture details.
13
 
14
- It is also compatible with NVIDIA Riva for production-grade server deployments.
15
 
16
  ## Usage
 
17
  The model is available for use in the NeMo toolkit [3] and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
18
 
19
  To train, fine-tune or play with the model you will need to install NVIDIA NeMo. We recommend you install it after you've installed the latest PyTorch version.
 
20
  ```
21
  pip install nemo-toolkit['all']
22
  ```
23
 
24
  ## instantiate the model
25
- Note: This model generates only spectrograms and a vocoder is needed to convert the spectrograms to waveforms. In this example HiFiTTS_HiFiGAN is used.
 
26
 
27
  ```
28
  from huggingface_hub import hf_hub_download
@@ -53,28 +56,39 @@ audio = model.convert_spectrogram_to_audio(spec=spectrogram)
53
  sf.write("speech.wav", audio.to('cpu').detach().numpy()[0], 44100)
54
  ```
55
 
56
- ## Input
 
 
 
 
 
 
57
  This model accepts batches of text.
58
 
59
- ## Output
 
60
  This model generates mel spectrograms.
61
 
62
  ## Model Architecture
63
- FastPitch is a fully-parallel text-to-speech model based on FastSpeech, conditioned on fundamental frequency contours. The model predicts pitch contours during inference. By altering these predictions, the generated speech can be more expressive, better match the semantic of the utterance, and in the end more engaging to the listener. FastPitch is based on a fully-parallel Transformer architecture, with a much higher real-time factor than Tacotron2 for the mel-spectrogram synthesis of a typical utterance. It uses an unsupervised speech-text aligner.
 
64
 
65
  ## Training
66
- The NeMo toolkit [3] was used for training the models for 1000 epochs. These model are trained with this example script and this base config.
 
67
 
68
  ## Datasets
69
- This model is trained on LJSpeech sampled at 22050Hz, and has been tested on generating female English voices with an American accent.
 
70
 
71
  ## Performance
 
72
  No performance information is available at this time.
73
 
74
  ## Limitations
75
- This checkpoint only works well with vocoders that were trained on 22050Hz data. Otherwise, the generated audio may be scratchy or choppy-sounding.
76
 
77
  ## References
78
- -[1] FastPitch: Parallel Text-to-speech with Pitch Prediction
79
- -[2] One TTS Alignment To Rule Them All
80
- -[3] NVIDIA NeMo Toolkit
 
9
 
10
  # NVIDIA FastPitch Multispeaker (en-US)
11
 
12
+ FastPitch [1] is a fully-parallel transformer architecture with prosody control over pitch and individual phoneme duration. Additionally, it uses an unsupervised speech-text aligner [2]. See the [model architecture](#model-architecture) section for complete architecture details.
13
 
14
+ It is also compatible with NVIDIA Riva for [production-grade server deployments](#deployment-with-nvidia-riva).
15
 
16
  ## Usage
17
+
18
  The model is available for use in the NeMo toolkit [3] and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
19
 
20
  To train, fine-tune or play with the model you will need to install NVIDIA NeMo. We recommend you install it after you've installed the latest PyTorch version.
21
+
22
  ```
23
  pip install nemo-toolkit['all']
24
  ```
25
 
26
  ## instantiate the model
27
+
28
+ Note: This model generates only spectrograms and a vocoder is needed to convert the spectrograms to waveforms. In this example HiFiGAN is used.
29
 
30
  ```
31
  from huggingface_hub import hf_hub_download
 
56
  sf.write("speech.wav", audio.to('cpu').detach().numpy()[0], 44100)
57
  ```
58
 
59
+ ## Colab example
60
+
61
+ #### LINK : [nvidia_tts_en_fastpitch_multispeaker](https://colab.research.google.com/drive/1ZJFCMVVjl7VtfVGlkQ-G1cXKyaucBzJf?usp=sharing)
62
+
63
+
64
+ ### Input
65
+
66
  This model accepts batches of text.
67
 
68
+ ### Output
69
+
70
  This model generates mel spectrograms.
71
 
72
  ## Model Architecture
73
+
74
+ FastPitch multispeaker is a fully-parallel text-to-speech model based on FastSpeech, conditioned on fundamental frequency contours. The model predicts pitch contours during inference. By altering these predictions, the generated speech can be more expressive, better match the semantic of the utterance, and in the end more engaging to the listener. FastPitch is based on a fully-parallel Transformer architecture, with a much higher real-time factor than Tacotron2 for the mel-spectrogram synthesis of a typical utterance. It uses an unsupervised speech-text aligner.
75
 
76
  ## Training
77
+
78
+ The NeMo toolkit [3] was used for training the models for 1000 epochs. These model are trained with this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/tts/fastpitch.py) and this [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/tts/conf/fastpitch_align_v1.05.yaml).
79
 
80
  ## Datasets
81
+
82
+ This model is trained on HiFiTTS sampled at 44100Hz, and has been tested on generating multispeaker English voices with an American and UK accent.
83
 
84
  ## Performance
85
+
86
  No performance information is available at this time.
87
 
88
  ## Limitations
89
+ This checkpoint only works well with vocoders that were trained on 44100Hz data. Otherwise, the generated audio may be scratchy or choppy-sounding.
90
 
91
  ## References
92
+ - [1] [FastPitch: Parallel Text-to-speech with Pitch Prediction](https://arxiv.org/abs/2006.06873)
93
+ - [2] [One TTS Alignment To Rule Them All](https://arxiv.org/abs/2108.10447)
94
+ - [3] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)