|
|
--- |
|
|
license: mit |
|
|
license_link: https://huggingface.co/nvidia/BigVGAN/blob/main/LICENSE |
|
|
tags: |
|
|
- neural-vocoder |
|
|
- audio-generation |
|
|
library_name: PyTorch |
|
|
pipeline_tag: audio-to-audio |
|
|
--- |
|
|
|
|
|
## BigVGAN with different mel spectrogram input |
|
|
These BigVGAN checkpoints are from continued training of https://huggingface.co/nvidia/bigvgan_v2_24khz_100band_256x, with the input mel spectrogram generated from this code from [[vocos]](https://github.com/gemelo-ai/vocos/blob/c859e3b7b534f3776a357983029d34170ddd6fc3/vocos/feature_extractors.py#L28C1-L49C24): |
|
|
|
|
|
```py |
|
|
class MelSpectrogramFeatures(FeatureExtractor): |
|
|
def __init__(self, sample_rate=24000, n_fft=1024, hop_length=256, n_mels=100, padding="center"): |
|
|
super().__init__() |
|
|
if padding not in ["center", "same"]: |
|
|
raise ValueError("Padding must be 'center' or 'same'.") |
|
|
self.padding = padding |
|
|
self.mel_spec = torchaudio.transforms.MelSpectrogram( |
|
|
sample_rate=sample_rate, |
|
|
n_fft=n_fft, |
|
|
hop_length=hop_length, |
|
|
n_mels=n_mels, |
|
|
center=padding == "center", |
|
|
power=1, |
|
|
) |
|
|
|
|
|
def forward(self, audio, **kwargs): |
|
|
if self.padding == "same": |
|
|
pad = self.mel_spec.win_length - self.mel_spec.hop_length |
|
|
audio = torch.nn.functional.pad(audio, (pad // 2, pad // 2), mode="reflect") |
|
|
mel = self.mel_spec(audio) |
|
|
features = safe_log(mel) |
|
|
return features |
|
|
``` |
|
|
|
|
|
Training was done with segment_size=65536 (unchanged) and batch_size=24 (vs 32 from the Nvidia team). Final eval PESQ is 4.340 (vs 4.362 from the Nvidia checkpoint, on their own mel spectrogram code). |
|
|
|
|
|
<center><img src="https://huggingface.co/cckm/bigvgan_melspec/resolve/main/assets/bigvgan_pesq.png" width="800"></center> |