How Meta’s AI Generates Music Based on a Reference Melody

Artificial Intelligence

How Meta’s AI Generates Music Based on a Reference Melody

admin

June 23, 2023

How Meta’s AI Generates Music Based on a Reference Melody

MusicGen, analyzed

On June thirteenth, 2023, Meta (formerly Facebook) made waves within the music and AI communities with the discharge of their generative music model, MusicGen. This model not only surpasses Google’s MusicLM, which was launched earlier this 12 months, when it comes to capabilities but can also be trained on licensed music data and open-sourced for non-commercial use.

This implies which you can not only read the research paper or hearken to demos but additionally copy their code from GitHub or experiment with the model in an online app on HuggingFace.

Along with generating audio from a text prompt, MusicGen may generate music based on a given reference melody, a feature often known as melody conditioning. On this blog post, I’ll exhibit how Meta implemented this convenient and interesting functionality into their model. But before we delve into that, let’s first understand how melody conditioning works in practice.

Base Track

The next is a brief electronic music snippet that I produced for this text. It features electronic drums, two dominant 808 bass and two syncopated synths. When listening to it, attempt to discover the “essential melody” of the track.

Using MusicGen, I can now generate music in other genres that keep on with the identical essential melody. All I would like for that’s my base track and a text prompt describing how the brand new piece should sound.

Orchestral Variant

A grand orchestral arrangement with thunderous percussion, epic brass fanfares, and soaring strings, making a cinematic atmosphere fit for a heroic battle.

Reggae Variant

classic reggae track with an electronic guitar solo

Jazz Variant

smooth jazz, with a saxophone solo, piano chords, and snare full drums

How Good are the Results?

Although MusicGen doesn’t adhere closely to my text prompts and creates music that’s barely different from what I asked for, the generated pieces still accurately reflect the requested genre and, more importantly, each bit showcases its own interpretation of the essential melody from the bottom track.

While the outcomes will not be perfect, I find the capabilities of this model to be quite impressive. The indisputable fact that MusicGen has been probably the most popular models on HuggingFace ever since its release further emphasizes its significance. With that said, let’s delve deeper into the technical elements of how melody conditioning works.

Three text-music pairs as they’re used for training models like MusicLM or MusicGen. Image by creator.

Just about all current generative music models follow the identical procedure during training. They’re supplied with a big database of music tracks accompanied by corresponding text descriptions. The model learns the connection between words and sounds, in addition to tips on how to convert a given text prompt right into a coherent and enjoyable piece of music. Throughout the training process, the model optimizes its own compositions by comparing them to the true music tracks within the dataset. This allows the model to discover its strengths and areas that require improvement.

The difficulty lies within the indisputable fact that once a machine learning model is trained for a particular task, akin to text-to-music generation, it is proscribed to that exact task. While it is feasible to make MusicGen perform certain tasks that it was not explicitly trained for, like continuing a given piece of music, it can’t be expected to tackle every music generation request. For example, it cannot simply take a melody and transform it into a distinct genre. This may be like throwing potatoes right into a toaster and expecting fries to return out. As a substitute, a separate model have to be trained to implement this functionality.

Let’s explore how Meta adapted the model training procedure to enable MusicGen to generate variations of a given melody based on a text prompt. Nonetheless, there are several challenges related to this approach. One among the first obstacles is the paradox in identifying “the melody” of a song and representing it in a computationally meaningful way. Nonetheless, for the aim of understanding the brand new training procedure at a broader level, let’s assume a consensus on what constitutes “the melody” and the way it may well be easily extracted and fed into the model. On this scenario, the adjusted training method will be outlined as follows:

Three text-music-melody pairs as they were used for teaching MusicGen melody-conditioned generation.

For every track within the database, step one is to extract its melody. Subsequently, the model is fed with each the track’s text description and its corresponding melody, prompting the model to recreate the unique track. Essentially, this approach simplifies the unique training objective, where the model was solely tasked with recreating the track based on text.

To know why we do that, let’s ask ourselves what the AI model learns on this training procedure. In essence, it learns how a melody will be become a full piece of music based on a text description. Because of this after the training, we will provide the model with a melody and request it to compose a bit of music with any genre, mood, or instrumentation. To the model, this is identical “semi-blind” generation task it has successfully completed countless times during training.

Having grasped the technique employed by Meta to show the model melody-conditioned music generation, we still have to tackle the challenge of precisely defining what constitutes “the melody.”

The reality is, there isn’t a objective method to find out or extract “the melody” of a polyphonic musical piece, except when all instruments are playing in unison. While there is usually a distinguished instrument akin to a voice, guitar, or violin, it doesn’t necessarily imply that the opposite instruments will not be a part of “the melody.” Take Queen’s “Bohemian Rhapsody” for example. If you consider the song, you may first recall Freddie Mercury’s essential vocal melodies. Nonetheless, does that mean the piano within the intro, the background singers in the center section, and the electrical guitar before “So you think that you possibly can stone me […]” will not be a part of the melody?

One method for extracting “the melody” of a song is to contemplate probably the most distinguished melody as probably the most dominant one, typically identified because the loudest melody in the combo. The chromagram is a widely utilized representation that visually displays probably the most dominant musical notes throughout a track. Below, you’ll find the chromagram of the reference track, initially with the entire instrumentation after which excluding drums and bass. On the left side, probably the most relevant notes for the melody (B, F#, G) are highlighted in blue.

Each chromagrams accurately depict the first melody notes, with the version of the track without drums and bass providing a clearer visualization of the melody. Meta’s study also revealed the identical remark, which led them to utilize their source separation tool (DEMUCS) to remove any disturbing rhythmic elements from the track. This process leads to a sufficiently representative rendition of “the melody,” which may then be fed to the model.

In summary, we will now connect the pieces to grasp the underlying process when requesting MusicGen to perform melody-conditioned generation. Here’s a visual representation of the workflow:

How MusicGen produces a melody-conditioned music output. Image by creator.

While MusicGen shows promising advancements in melody-conditioning, it is crucial to acknowledge that the technology remains to be a work-in-progress. Chromagrams, even when drums and bass are removed, offer an imperfect representation of a track’s melody. One limitation is that chromagrams categorize all notes into the 12 western pitch classes, meaning they capture the transition between two pitch classes but not the direction (up or down) of the melody.

For example, the melodic interval between moving from C4 to G4 (an ideal fifth) differs significantly from moving from C4 to G3 (an ideal fourth). Nonetheless, in a chromagram, each intervals would seem the identical. The difficulty worsens with octave jumps, because the chromagram would indicate the melody stayed on the identical note. Consider how a chromagram would misinterpret the emotional octave jump performed by Céline Dion in “My Heart Will Go On” through the line “wher-e-ver you’re” as a stable melodic movement. To exhibit this, just take a look at the chromagram for the chorus in A-ha’s “Tackle Me”, below. Does this reflect your idea of the song’s melody?

A chromagram of the chorus in “Tackle Me” (A-ha), bass and drums removed. Image by creator.

One other challenge is the inherent bias of the chromagram. It performs well in capturing the melody of some songs while completely missing the mark in others. This bias is systematic moderately than random. Songs with dominant melodies, minimal interval jumps, and unison playing are higher represented by the chromagram in comparison with songs with complex melodies spread across multiple instruments and featuring large interval jumps.

Moreover, the constraints of the generative AI model itself are value noting. The output audio still exhibits noticeable differences from human-made music, and maintaining a consistent style over a six-second interval stays a struggle. Furthermore, MusicGen falls short in faithfully capturing the more intricate elements of the text prompt, as evidenced by the examples provided earlier. It should require further technological advancements for melody-conditioned generation to succeed in a level where it may well be used not just for amusement and inspiration but additionally for generating end-user-friendly music.

Photo by Marc Sendra Martorell on Unsplash

How can we improve the AI?

From my perspective, one in every of the first concerns that future research should address regarding melody-conditioned music generation is the extraction and representation of “the melody” from a track. While the chromagram is a well-established and simple signal processing method, there are many newer and experimental approaches that utilize deep learning for this purpose. It will be exciting to witness corporations like Meta drawing inspiration from these advancements, a lot of that are covered in a comprehensive 72-page review by Reddy et al. (2022).

Regarding the standard of the model itself, each the audio quality and the comprehension of text inputs will be enhanced through scaling up the scale of the model and training data, in addition to the event of more efficient algorithms for this specific task. In my view, the discharge of MusicLM in January 2023 resembles a “GPT-2 moment.” We’re starting to witness the capabilities of those models, but significant improvements are still needed across various elements. If this analogy holds true, we will anticipate the discharge of a music generation model akin to GPT-3 ahead of we’d expect.

How does this impact musicians?

As is usually the case with generative music AI, concerns arise regarding the potential negative impact on the work and livelihoods of music creators. I expect that in the longer term, it should change into increasingly difficult to earn a living by creating variations of existing melodies. This is especially evident in scenarios akin to jingle production, where corporations can effortlessly generate quite a few variations of a characteristic jingle melody at minimal cost for brand spanking new ad campaigns or personalized advertisements. Undoubtedly, this poses a threat to musicians who depend on such activities as a big source of income. I reiterate my plea for creatives involved in producing music valued for its objective musical qualities moderately than subjective, human qualities (akin to stock music or jingles) to explore alternative income sources to arrange for the longer term.

On the positive side, melody-conditioned music generation presents an incredible tool for enhancing human creativity. If someone develops a charming and memorable melody, they will quickly generate examples of how it’d sound in various genres. This process might help discover the best genre and elegance to bring the music to life. Furthermore, it offers a possibility to revisit past projects inside one’s music catalogue, exploring their potential when translated into different genres or styles. Finally, this technology lowers the entry barrier for creatively inclined individuals without formal musical training to enter the sector. Anyone can now give you a melody, hum it right into a smartphone microphone, and share remarkable arrangements of their ideas with friends, family, and even attempt to succeed in a wider audience.

The query of whether AI music generation is helpful to our societies stays open for debate. Nonetheless, I firmly consider that melody-conditioned music generation is one in every of the use cases of this technology that genuinely enhances the work of each skilled and aspiring creatives. It adds value by offering recent avenues for exploration. I’m eagerly looking forward to witnessing further advancements on this field within the near future.