File size: 9,493 Bytes
2631d60 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 |
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# MusicGen-Style\n",
"Welcome to MusicGen-Style's demo jupyter notebook. Here you will find a series of self-contained examples of how to use MusicGen-Style in different settings.\n",
"\n",
"First, we start by initializing MusicGen-Style."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from audiocraft.models import MusicGen\n",
"from audiocraft.models import MultiBandDiffusion\n",
"\n",
"USE_DIFFUSION_DECODER = False\n",
"\n",
"model = MusicGen.get_pretrained('facebook/musicgen-style')\n",
"if USE_DIFFUSION_DECODER:\n",
" mbd = MultiBandDiffusion.get_mbd_musicgen()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, let us configure the generation parameters. Specifically, you can control the following:\n",
"* `use_sampling` (bool, optional): use sampling if True, else do argmax decoding. Defaults to True.\n",
"* `top_k` (int, optional): top_k used for sampling. Defaults to 250.\n",
"* `top_p` (float, optional): top_p used for sampling, when set to 0 top_k is used. Defaults to 0.0.\n",
"* `temperature` (float, optional): softmax temperature parameter. Defaults to 1.0.\n",
"* `duration` (float, optional): duration of the generated waveform. Defaults to 30.0.\n",
"* `cfg_coef` (float, optional): coefficient used for classifier free guidance. Defaults to 3.0.\n",
"* `cfg_coef_beta` (float, optional): If not None, we use double CFG. cfg_coef_beta is the parameter that pushes the text. Defaults to None, user should start at 5.\n",
" If the generated music adheres to much to the text, the user should reduce this parameter. If the music adheres too much to the style conditioning, \n",
" the user should increase it\n",
"\n",
"When left unchanged, MusicGen will revert to its default parameters.\n",
"\n",
"These are the conditioner parameters for the style conditioner:\n",
"* `eval_q` (int): integer between 1 and 6 included that tells how many quantizers are used in the RVQ bottleneck\n",
" of the style conditioner. The higher eval_q is, the more style information passes through the model.\n",
"* `excerpt_length` (float): float between 1.5 and 4.5 that indicates which length is taken from the audio \n",
" conditioning to extract style. \n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"model.set_generation_params(\n",
" use_sampling=True,\n",
" top_k=250,\n",
" duration=30\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The model can perform text-to-music, style-to-music and text-and-style-to-music.\n",
"* Text-to-music can be done using `model.generate`, or `model.generate_with_chroma` with the wav condition being None. \n",
"* Style-to-music and Text-and-Style-to-music can be done using `model.generate_with_chroma`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Text-to-Music"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from audiocraft.utils.notebook import display_audio\n",
"\n",
"model.set_generation_params(\n",
" duration=8, # generate 8 seconds, can go up to 30\n",
" use_sampling=True, \n",
" top_k=250,\n",
" cfg_coef=3., # Classifier Free Guidance coefficient \n",
" cfg_coef_beta=None, # double CFG is only useful for text-and-style conditioning\n",
")\n",
"\n",
"output = model.generate(\n",
" descriptions=[\n",
" '80s pop track with bassy drums and synth',\n",
" '90s rock song with loud guitars and heavy drums',\n",
" 'Progressive rock drum and bass solo',\n",
" 'Punk Rock song with loud drum and power guitar',\n",
" 'Bluesy guitar instrumental with soulful licks and a driving rhythm section',\n",
" 'Jazz Funk song with slap bass and powerful saxophone',\n",
" 'drum and bass beat with intense percussions'\n",
" ],\n",
" progress=True, return_tokens=True\n",
")\n",
"display_audio(output[0], sample_rate=32000)\n",
"if USE_DIFFUSION_DECODER:\n",
" out_diffusion = mbd.tokens_to_wav(output[1])\n",
" display_audio(out_diffusion, sample_rate=32000)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Style-to-Music\n",
"For Style-to-Music, we don't need double CFG. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import torchaudio\n",
"from audiocraft.utils.notebook import display_audio\n",
"\n",
"model.set_generation_params(\n",
" duration=8, # generate 8 seconds, can go up to 30\n",
" use_sampling=True, \n",
" top_k=250,\n",
" cfg_coef=3., # Classifier Free Guidance coefficient \n",
" cfg_coef_beta=None, # double CFG is only useful for text-and-style conditioning\n",
")\n",
"\n",
"model.set_style_conditioner_params(\n",
" eval_q=1, # integer between 1 and 6\n",
" # eval_q is the level of quantization that passes\n",
" # through the conditioner. When low, the models adheres less to the \n",
" # audio conditioning\n",
" excerpt_length=3., # the length in seconds that is taken by the model in the provided excerpt\n",
" )\n",
"\n",
"melody_waveform, sr = torchaudio.load(\"../assets/electronic.mp3\")\n",
"melody_waveform = melody_waveform.unsqueeze(0).repeat(2, 1, 1)\n",
"output = model.generate_with_chroma(\n",
" descriptions=[None, None], \n",
" melody_wavs=melody_waveform,\n",
" melody_sample_rate=sr,\n",
" progress=True, return_tokens=True\n",
")\n",
"display_audio(output[0], sample_rate=32000)\n",
"if USE_DIFFUSION_DECODER:\n",
" out_diffusion = mbd.tokens_to_wav(output[1])\n",
" display_audio(out_diffusion, sample_rate=32000)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Text-and-Style-to-Music\n",
"For Text-and-Style-to-Music, if we use simple classifier free guidance, the models tends to ignore the text conditioning. We then, introduce double classifier free guidance \n",
"$$l_{\\text{double CFG}} = l_{\\emptyset} + \\alpha [l_{style} + \\beta(l_{text, style} - l_{style}) - l_{\\emptyset}]$$\n",
"\n",
"For $\\beta=1$ we retrieve classic CFG but if $\\beta > 1$ we boost the text condition"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import torchaudio\n",
"from audiocraft.utils.notebook import display_audio\n",
"\n",
"model.set_generation_params(\n",
" duration=8, # generate 8 seconds, can go up to 30\n",
" use_sampling=True, \n",
" top_k=250,\n",
" cfg_coef=3., # Classifier Free Guidance coefficient \n",
" cfg_coef_beta=5., # double CFG is necessary for text-and-style conditioning\n",
" # Beta in the double CFG formula. between 1 and 9. When set to 1 \n",
" # it is equivalent to normal CFG. \n",
")\n",
"\n",
"model.set_style_conditioner_params(\n",
" eval_q=1, # integer between 1 and 6\n",
" # eval_q is the level of quantization that passes\n",
" # through the conditioner. When low, the models adheres less to the \n",
" # audio conditioning\n",
" excerpt_length=3., # the length in seconds that is taken by the model in the provided excerpt\n",
" )\n",
"\n",
"melody_waveform, sr = torchaudio.load(\"../assets/electronic.mp3\")\n",
"melody_waveform = melody_waveform.unsqueeze(0).repeat(3, 1, 1)\n",
"\n",
"descriptions = [\"8-bit old video game music\", \"Chill lofi remix\", \"80s New wave with synthesizer\"]\n",
"\n",
"output = model.generate_with_chroma(\n",
" descriptions=descriptions,\n",
" melody_wavs=melody_waveform,\n",
" melody_sample_rate=sr,\n",
" progress=True, return_tokens=True\n",
")\n",
"display_audio(output[0], sample_rate=32000)\n",
"if USE_DIFFUSION_DECODER:\n",
" out_diffusion = mbd.tokens_to_wav(output[1])\n",
" display_audio(out_diffusion, sample_rate=32000)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.16"
},
"vscode": {
"interpreter": {
"hash": "b02c911f9b3627d505ea4a19966a915ef21f28afb50dbf6b2115072d27c69103"
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}
|