cckm
/

bigvgan_melspec

audio-generation

Model card Files Files and versions

bigvgan_melspec / README.md

cckm's picture

Update README.md

1ce7776 verified 11 months ago

|

history blame contribute delete

1.79 kB

	---
	license: mit
	license_link: https://huggingface.co/nvidia/BigVGAN/blob/main/LICENSE
	tags:
	- neural-vocoder
	- audio-generation
	library_name: PyTorch
	pipeline_tag: audio-to-audio
	---

	## BigVGAN with different mel spectrogram input
	These BigVGAN checkpoints are from continued training of https://huggingface.co/nvidia/bigvgan_v2_24khz_100band_256x, with the input mel spectrogram generated from this code from [[vocos]](https://github.com/gemelo-ai/vocos/blob/c859e3b7b534f3776a357983029d34170ddd6fc3/vocos/feature_extractors.py#L28C1-L49C24):

	```py
	class MelSpectrogramFeatures(FeatureExtractor):
	def __init__(self, sample_rate=24000, n_fft=1024, hop_length=256, n_mels=100, padding="center"):
	super().__init__()
	if padding not in ["center", "same"]:
	raise ValueError("Padding must be 'center' or 'same'.")
	self.padding = padding
	self.mel_spec = torchaudio.transforms.MelSpectrogram(
	sample_rate=sample_rate,
	n_fft=n_fft,
	hop_length=hop_length,
	n_mels=n_mels,
	center=padding == "center",
	power=1,
	)

	def forward(self, audio, **kwargs):
	if self.padding == "same":
	pad = self.mel_spec.win_length - self.mel_spec.hop_length
	audio = torch.nn.functional.pad(audio, (pad // 2, pad // 2), mode="reflect")
	mel = self.mel_spec(audio)
	features = safe_log(mel)
	return features
	```

	Training was done with segment_size=65536 (unchanged) and batch_size=24 (vs 32 from the Nvidia team). Final eval PESQ is 4.340 (vs 4.362 from the Nvidia checkpoint, on their own mel spectrogram code).

	<center><img src="https://huggingface.co/cckm/bigvgan_melspec/resolve/main/assets/bigvgan_pesq.png" width="800"></center>