Fixed duration?
Hi!
I've been performing some optimisations and tests and after a while now testing different parameters, and it would appear that regardless of the specified length the model always returns audios of about 11 seconds, is that expected?
Also I saw in the demo implementation mentioned in the ARM optimisations tutorial seems to specify that same length (see here )
Just before I go further trying to get that to work on my code I would like to confirm that on your end, or if someone else here managed otherwise I would appreciate the help.
Thank you!
ig it's mentioned that variable length in 11sec for this model
Hi
@edwixx
thanks for your reply,
do you mean you saw somewhere the model only supports 11secs? I must have missed it.
I've had a similar experience. I'm prototyping on iOS. Whatever value I supply for audio-length input tensor, the audio always comes out at 7 seconds?
Hey! Author here, to go into more detail about this:
SAOS is designed, like the rest of the stable audio series, to take in a "seconds_total" parameter that controls the amount of audio that should be generated, in this case between 0-11s (its actually 11.88s, or 256 latents in our VAE, to be specific). That means that no matter how much audio you want to generate b/w 0-11s, the model will always generate 256 latents of content, where for seconds_total < 11 the rest should be silence. For downstream use-cases, this approach is chosen because it allows you to generate less audio than you need without frequent graph recalculations for compiled models, and it allows you train on variable length inputs in a scalable fashion.
HOWEVER, note that if you want to hack at the model, you theoretically can get it to run on data sizes that are smaller than 11.88s (256 latents) without training, and longer than that with some finetuning (theoretically you might be able to get it to 0-shot generate longer than 11s of audio, but since it wasn't trained on that, the quality will probably suffer). For <11s, I've briefly played around on my end with changing the data size itself (so generating instead128 latents, or 5.94s), and this gets you pretty expected speedups without much quality degradation from my brief tests (though I wouldn't be shocked if quality took a dip).