File size: 9,493 Bytes
2631d60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# MusicGen-Style\n",
    "Welcome to MusicGen-Style's demo jupyter notebook. Here you will find a series of self-contained examples of how to use MusicGen-Style in different settings.\n",
    "\n",
    "First, we start by initializing MusicGen-Style."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "from audiocraft.models import MusicGen\n",
    "from audiocraft.models import MultiBandDiffusion\n",
    "\n",
    "USE_DIFFUSION_DECODER = False\n",
    "\n",
    "model = MusicGen.get_pretrained('facebook/musicgen-style')\n",
    "if USE_DIFFUSION_DECODER:\n",
    "    mbd = MultiBandDiffusion.get_mbd_musicgen()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, let us configure the generation parameters. Specifically, you can control the following:\n",
    "* `use_sampling` (bool, optional): use sampling if True, else do argmax decoding. Defaults to True.\n",
    "* `top_k` (int, optional): top_k used for sampling. Defaults to 250.\n",
    "* `top_p` (float, optional): top_p used for sampling, when set to 0 top_k is used. Defaults to 0.0.\n",
    "* `temperature` (float, optional): softmax temperature parameter. Defaults to 1.0.\n",
    "* `duration` (float, optional): duration of the generated waveform. Defaults to 30.0.\n",
    "* `cfg_coef` (float, optional): coefficient used for classifier free guidance. Defaults to 3.0.\n",
    "* `cfg_coef_beta` (float, optional): If not None, we use double CFG. cfg_coef_beta is the parameter that pushes the text. Defaults to None, user should start at 5.\n",
    "    If the generated music adheres to much to the text, the user should reduce this parameter. If the music adheres too much to the style conditioning, \n",
    "    the user should increase it\n",
    "\n",
    "When left unchanged, MusicGen will revert to its default parameters.\n",
    "\n",
    "These are the conditioner parameters for the style conditioner:\n",
    "* `eval_q` (int): integer between 1 and 6 included that tells how many quantizers are used in the RVQ bottleneck\n",
    "    of the style conditioner. The higher eval_q is, the more style information passes through the model.\n",
    "* `excerpt_length` (float): float between 1.5 and 4.5 that indicates which length is taken from the audio \n",
    "    conditioning to extract style. \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.set_generation_params(\n",
    "    use_sampling=True,\n",
    "    top_k=250,\n",
    "    duration=30\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The model can perform text-to-music, style-to-music and text-and-style-to-music.\n",
    "* Text-to-music can be done using `model.generate`, or `model.generate_with_chroma` with the wav condition being None. \n",
    "* Style-to-music and Text-and-Style-to-music can be done using `model.generate_with_chroma`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Text-to-Music"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from audiocraft.utils.notebook import display_audio\n",
    "\n",
    "model.set_generation_params(\n",
    "    duration=8, # generate 8 seconds, can go up to 30\n",
    "    use_sampling=True, \n",
    "    top_k=250,\n",
    "    cfg_coef=3., # Classifier Free Guidance coefficient \n",
    "    cfg_coef_beta=None, # double CFG is only useful for text-and-style conditioning\n",
    ")\n",
    "\n",
    "output = model.generate(\n",
    "    descriptions=[\n",
    "        '80s pop track with bassy drums and synth',\n",
    "        '90s rock song with loud guitars and heavy drums',\n",
    "        'Progressive rock drum and bass solo',\n",
    "        'Punk Rock song with loud drum and power guitar',\n",
    "        'Bluesy guitar instrumental with soulful licks and a driving rhythm section',\n",
    "        'Jazz Funk song with slap bass and powerful saxophone',\n",
    "        'drum and bass beat with intense percussions'\n",
    "    ],\n",
    "    progress=True, return_tokens=True\n",
    ")\n",
    "display_audio(output[0], sample_rate=32000)\n",
    "if USE_DIFFUSION_DECODER:\n",
    "    out_diffusion = mbd.tokens_to_wav(output[1])\n",
    "    display_audio(out_diffusion, sample_rate=32000)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Style-to-Music\n",
    "For Style-to-Music, we don't need double CFG. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torchaudio\n",
    "from audiocraft.utils.notebook import display_audio\n",
    "\n",
    "model.set_generation_params(\n",
    "    duration=8, # generate 8 seconds, can go up to 30\n",
    "    use_sampling=True, \n",
    "    top_k=250,\n",
    "    cfg_coef=3., # Classifier Free Guidance coefficient \n",
    "    cfg_coef_beta=None, # double CFG is only useful for text-and-style conditioning\n",
    ")\n",
    "\n",
    "model.set_style_conditioner_params(\n",
    "    eval_q=1, # integer between 1 and 6\n",
    "              # eval_q is the level of quantization that passes\n",
    "              # through the conditioner. When low, the models adheres less to the \n",
    "              # audio conditioning\n",
    "    excerpt_length=3., # the length in seconds that is taken by the model in the provided excerpt\n",
    "    )\n",
    "\n",
    "melody_waveform, sr = torchaudio.load(\"../assets/electronic.mp3\")\n",
    "melody_waveform = melody_waveform.unsqueeze(0).repeat(2, 1, 1)\n",
    "output = model.generate_with_chroma(\n",
    "    descriptions=[None, None], \n",
    "    melody_wavs=melody_waveform,\n",
    "    melody_sample_rate=sr,\n",
    "    progress=True, return_tokens=True\n",
    ")\n",
    "display_audio(output[0], sample_rate=32000)\n",
    "if USE_DIFFUSION_DECODER:\n",
    "    out_diffusion = mbd.tokens_to_wav(output[1])\n",
    "    display_audio(out_diffusion, sample_rate=32000)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Text-and-Style-to-Music\n",
    "For Text-and-Style-to-Music, if we use simple classifier free guidance, the models tends to ignore the text conditioning. We then, introduce double classifier free guidance \n",
    "$$l_{\\text{double CFG}} = l_{\\emptyset} + \\alpha [l_{style} + \\beta(l_{text, style} - l_{style}) - l_{\\emptyset}]$$\n",
    "\n",
    "For $\\beta=1$ we retrieve classic CFG but if $\\beta > 1$ we boost the text condition"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import torchaudio\n",
    "from audiocraft.utils.notebook import display_audio\n",
    "\n",
    "model.set_generation_params(\n",
    "    duration=8, # generate 8 seconds, can go up to 30\n",
    "    use_sampling=True, \n",
    "    top_k=250,\n",
    "    cfg_coef=3., # Classifier Free Guidance coefficient \n",
    "    cfg_coef_beta=5., # double CFG is necessary for text-and-style conditioning\n",
    "                   # Beta in the double CFG formula. between 1 and 9. When set to 1 \n",
    "                   # it is equivalent to normal CFG. \n",
    ")\n",
    "\n",
    "model.set_style_conditioner_params(\n",
    "    eval_q=1, # integer between 1 and 6\n",
    "              # eval_q is the level of quantization that passes\n",
    "              # through the conditioner. When low, the models adheres less to the \n",
    "              # audio conditioning\n",
    "    excerpt_length=3., # the length in seconds that is taken by the model in the provided excerpt\n",
    "    )\n",
    "\n",
    "melody_waveform, sr = torchaudio.load(\"../assets/electronic.mp3\")\n",
    "melody_waveform = melody_waveform.unsqueeze(0).repeat(3, 1, 1)\n",
    "\n",
    "descriptions = [\"8-bit old video game music\", \"Chill lofi remix\", \"80s New wave with synthesizer\"]\n",
    "\n",
    "output = model.generate_with_chroma(\n",
    "    descriptions=descriptions,\n",
    "    melody_wavs=melody_waveform,\n",
    "    melody_sample_rate=sr,\n",
    "    progress=True, return_tokens=True\n",
    ")\n",
    "display_audio(output[0], sample_rate=32000)\n",
    "if USE_DIFFUSION_DECODER:\n",
    "    out_diffusion = mbd.tokens_to_wav(output[1])\n",
    "    display_audio(out_diffusion, sample_rate=32000)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.16"
  },
  "vscode": {
   "interpreter": {
    "hash": "b02c911f9b3627d505ea4a19966a915ef21f28afb50dbf6b2115072d27c69103"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}