Noissy is a latent diffusion model built to generate music from text inputs. It uses pre tained text encoers to analyze and interpret the language, converting it into corresponding musical patterns. The model is currently training on large datasets of music and text pairs, allowing it to create coherent and musically relevant outputs.
The pipeline uses two different deep learning models to achieve its goal.
The first stage uses a VAE to convert Mel spectrograms into latent space encodings. These encodings effectively compress the audio information while preserving its essential structure, enabling more efficient processing in the next stages.
In the second stage, the latent representation is progressively noised using a diffusion model. The noise-added latent encoding is then passed through a custom UNet for denoising.
The custom Unet module can be configured in any shape the user wants and has the ability to add attention blocks to generate audio based on a condition.
We have a diverse range of musics and corresponding labels which need to be transformed into a form that can be processed by our model
First step of this process it to resample all the songs to a constant sampling rate and them split them into 30 second chunks so that they are of all equal length.
Using a short time Fourier Transform these 30 second chunks are first converted to frequency form and then transformed into Mel spectrograms.