Exploring Music Transcription with Multi-Modal Language Models

Using Qwen2-Audio to transcribe music into sheet music

Automatic music transcription is the strategy of converting audio files like MP3 and WAV into sheet music, guitar tablature, and any format a musician will want to learn a song on their instrument.

We’ll go over the most effective current tools for doing this, which occur to be deep learning-based, and a novel approach for it.

The present state-of-the-art for this task comes from Magenta, an open-source research project developed by the now defunct (as of April 2023) Google Brain Team.

They released a paper Sequence-to-Sequence Piano Transcription with Transformers in 2021 which used a T5-inspired transformer model (much like “t5-small”) with 54 million parameters and the Maestro dataset, achieving great results. The issue is approached as a sequence-to-sequence task using an encoder-decoder Transformer architecture. The encoder processes mel spectrogram frames as input and produces embeddings, while the decoder uses these embeddings via cross-attention to autoregressively generate a sequence of MIDI-like tokens. Their vocabulary consisted of 4 sorts of tokens:

Note tokens (128 values for MIDI pitches)
Velocity tokens (128 values including zero for note-off)
Time tokens (6,000 values in 10ms bins for absolute timing)
EOS token (to mark sequence end)

See the image below for a visualisation of the architecture and an example sequence of their custom MIDI tokens:

Figure 1. from Sequence-to-Sequence Piano Transcription with Transformers paper

Our model is a generic encoder-decoder Transformer architecture where each input position comprises a single spectrogram frame and every output position comprises an event from our MIDI-like vocabulary. Outputs tokens are autoregressively sampled from the decoder, at each step taking the token with maximum probability.

In 2022, they released a paper, MT3: Multi-Task Multitrack Music Transcription. This experiment used the identical approach because the last one but added additional instrument tokens to represent different instruments. Again, they used an identical T5 model and achieved great performance against most of the datasets trained on, notably Slakh, Maestro and MusicNet.

MR-MT3 was released the next 12 months as a slight improvement to MT3.

Compute/GPU resources

Huge resources were needed to coach this from scratch, despite being much smaller in size in comparison with even the smallest language models. The 2021 paper noted:

“We trained all models on 32 TPUv3 cores, leading to a per-core batch size of 8. Based on validation set results, overfitting didn’t appear to be an issue, so we allowed training to progress for 400K steps, which took about 2.5 days for our baseline models.”

The MT3 paper doesn’t provide as specific details on training, stating they train for 1 million steps.

Other limitations

These models have some inherent limitations of their output flexibility. While language models typically have large vocabularies (often 30,000+ tokens) which are extensively pre-trained on diverse natural language data, MT3 and similar music transcription models use a much smaller, specialised token vocabulary (only just a few thousand tokens) focused solely on musical events. This specialisation implies that adding recent tokens, akin to for brand spanking new instruments or playing techniques like palm muting on guitars or pizzicato on violins, is probably going difficult — it requires significant retraining to integrate these recent tokens effectively with the present vocabulary, and sometimes requires substantial training data demonstrating these techniques. This differs from large language models which might often describe such musical nuances in natural language without modification, as they’ve encountered these concepts during their broad pre-training.

Transfer learning and zero-shot

We will leverage transfer learning from large open-source pre-trained audio and language models. Examples of music generation models include OpenAI’s Jukebox and Meta’s MusicGen.

GPT-4o is designed to handle text, audio and pictures “natively”. Although OpenAI has not released the technical details on this, it’s assumed that some weights within the network will process all modalities. It’s possible that the model uses a decoder-only architecture like language only GPT models without the necessity for encoder components to convert different modalities to a dense representation first. This design allows the model to seamlessly process and interpret inputs like text and pictures together, potentially offering performance advantages each computationally and by way of model understanding.

Many multi-modal models take an easier approach harking back to the encoder-decoder architecture: they mix two pre-trained models — an encoder for the particular input modality (like ViT for vision or an audio encoder for sound) and a Large Language Model (akin to LLaMA, Gemma, or Qwen). These models are connected through projection layers that align their representations in a shared latent space, often using only a single linear layer. These projection layers learn to convert the encoder’s output right into a format that matches the LLM’s expected input dimensions and characteristics. The projection creates recent embeddings/tokens from the input modality that may then be injected into the LLM’s input sequence. LLaVA is a main example of this architecture for vision-language tasks, while Spotify’s Llark and Qwen-Audio apply the identical principle using audio encoders as an alternative of vision encoders.

Here’s some pseudocode on how the models are stitched together:

# Extract features from final layer of audio encoder
# Shape: [batch_size, audio_seq_len, encoder_dim=1024]
audio_features = audio_model(audio_input)# Project audio features to match LLM's embedding dimension
# Shape: [batch_size, audio_seq_len, llm_embed_dim=4096]
audio_embeddings = projection_layer(audio_features)
# Get text embeddings from LLM's embedding layer
# Shape: [batch_size, text_seq_len, llm_embed_dim=4096]
text_embeddings = llm.embed_text(text_input)
# Concatenate along sequence length dimension
# Shape: [batch_size, audio_seq_len + text_seq_len, llm_embed_dim=4096]
combined_input = concatenate([audio_embeddings, text_embeddings], dim=1)
# Feed them into the LLM as normal for generation
output = llm(combined_input)

Overview of architecture

Llark uses OpenAI’s Jukebox and Qwen2-Audio uses OpenAI’s Whisper for the audio towers. Jukebox is a music generation model but it will probably also soak up audio clips as input and outputs a continuation of the audio clip. Whisper is used for transcribing voice to text.

Given their purpose, the selection of audio module is evident: Llark specialises in music evaluation, while Qwen2Audio primarily focuses on responding to voice instructions with some basic audio and music evaluation capabilities.

Determining the optimal source for extracting embeddings from large pre-trained models involves research and experimentation. Moreover, deciding whether to fine-tune all the module or freeze parts of it is an important design selection. As an example, LlaVa’s training strategy involves freezing the vision tower and specializing in fine-tuning the projection layer and language model. We’ll go over this aspect of every model below.

Llark: why Jukebox? Are these embeddings the most effective as of September 2024?

Determining the optimal location to extract embeddings from large models typically requires extensive probing. This involves testing various activations or extracted layers of the model on different classification tasks through a strategy of trial and error. For music generation models, this might include tasks like genre recognition, instrument detection, emotion detection, in addition to evaluation of harmonic structures and temporal patterns. Many industrial embedding models (like OpenAI’s embedding models) are trained specifically for embedding generation with specialised architectures and training objectives, slightly than being fine-tuned versions of existing language models.

The 2 largest publicly available music generation and music continuation (i.e.: capable of soak up audio as input) models are Jukebox and MusicGen. MusicGen is newer and faster, and subsequently gave the look of it might be the apparent selection to me. Nevertheless, based on this paper on probing MusicGen, embeddings extracted from Jukebox appear to outperform MusicGen on average in classification tasks. The findings from this paper led to the authors of Llark using the next approach for extracting embeddings:

Embeddings are derived from the output of the thirty sixth layer of the Jukebox encoder following the approach described in Castellon et al. (2021)
Original Jukebox encoding:
* 4800-dimensional vectors at 345Hz
* For a 25s clip: over 4.14 * 10⁷ floating-point values
The authors use a downsampling approach: Mean-pooling inside 100ms frames, leading to:
* Downsampled frequency: 10Hz
* Embedding size: 1.2 × 10⁶ for a 25s audio clip. Which means a 2D array with shape [240, 4800].
* Retains temporal information (unlike Castellon et al. who average over the time dimension)

(The downsampled embedding size is roughly 6x larger than CLIP ViT-L14 models utilized in many multimodal vision models)

Qwen2Audio: Whisper

The embedding extraction for Qwen2Audio isn’t mentioned intimately within the paper. Whisper is an encoder-decoder architecture where the encoder generates deeply learned representations of the audio and the decoder decodes the representations to text (the transcription). In Qwen2Audio, it appears they extract embeddings from the ultimate layer of Whisper’s encoder, although they don’t mention whether or not they freeze it during training.

Pre-trained weights, training data and datasets

Unfortunately Spotify has not provided any datasets or their trained model weights to the general public, noting:

“With respect to inputs: the inputs to our model are public, open-source, Creative Commons-licensed audio and associated annotations. Nevertheless, each individual audio file can have its own, potentially more restrictive license. Lots of the audio files include “no derivatives” licenses. We encourage users of the datasets to familiarize themselves with the restrictions of those licenses; with a purpose to honor such licenses, we don’t release any derivatives from the training data on this paper (including query- response pairs or trained model weights).”

They used the next datasets:

MusicCaps (Agostinelli et al., 2023)
YouTube8M-MusicTextClips (McKee et al., 2023)
MusicNet (Thickstun et al., 2017)
FMA (Defferrard et al., 2017)
MTG-Jamendo (Bogdanov et al., 2019)
MagnaTagATune (Law et al., 2009)

Llark details it’s training data generation process in the next extract:

“We use variants of ChatGPT to extract the instruction- tuning data for all experiments. Nevertheless, the precise language model used varies by dataset. We select the OpenAI model as follows: We use GPT-4 for all reasoning tasks. We found that GPT-4 was way more adept at following the complex instructions within the Reasoning task family. For datasets with greater than 25k samples, we limit Reasoning data to a random subsample of 25k tracks.”

This leads to Q&A knowledge like this:

*Example text inputs and outputs from LLark, for the provided audio.*

The datasets used for training Qwen2Audio usually are not shared either, however the trained model is widely available and in addition is implemented within the transformers library:

For this project, fine-tuning off a pre-trained Llark model would have been optimal, given it’s reportedly good performance against the evaluation benchmarks Spotify stated within the paper.

Nevertheless, given they didn’t release the weights for it, it’s unfeasible to begin training a model like this from scratch and not using a good bit of experience and money. Spotify trained it on:

Our model is trained on 4 80GB NVIDIA A100 GPUs. Training takes roughly 54 hours.

This could cost around $700 using a provider like LambdaLabs.

Due to above, I went with Qwen. Nevertheless, Qwen2-Audio doesn’t perform that well across basic music tasks like tempo and instrument detection. I detail this below within the evaluation section. Which means the model might be not large enough or pre-trained enough to attain this task, but my hope is I could at the least set a start line and framework for fine-tuning on this task in the long run. As Alibaba state of their Qwen2-Audio blog post:

We also plan to construct larger Qwen2-Audio models to explore the scaling laws of audio language models.

For my very own learning though, I did have a go at re-creating the model using torch and pre-trained models with the transformers library.

I also created datasets for Q&A knowledge and embeddings. I generated short form Q&A knowledge for the URMP dataset, e.g.: “What’s the tempo of this track”, “What instruments are playing on this audio”.

Here’s a notebook for running Jukebox in a Colab environment to benefit from the low cost T4 GPU’s. I uploaded each Q&A and embeddings datasets to HuggingFace here.

Here’s a notebook with Llark replicated.

Transcription format

I selected ABC music notation because the output format that the language model is predicted to transcribe the music in. Here’s an example of it:

X:1
M:4/4
L:1/16
K:none
Q:67V:1 name="Electric Bass (finger)"
%%octave-default C4
GAA^2E3A2A^4 A^2E2 A2A^4A^2 E2 | A2A^4 |
V:2 name="Vivid Acoustic Piano"
%%octave-default C5
[E3C3][E3C3][E3C3] [E3C3][A^,2E2A^2] | [E3A^3][E3A^3][E3A^3][E3A^3][E3A^3] |
[E3A^3][E3A^3][E3A^3] [E3A^3][E3A^3] | [E3A^3][E3A^3][E3A^3][E3A^3][E3A^3] |
[E3A^3][E3A^3][E3A^3] [E3A^3][E3A^3] | [E3A^3] |
V:3 name="Electric Guitar (jazz)"
%%octave-default C5
E'3C'3A^4E'3C'3 | A^4E'3 C'3A^4E'3C'3 | A^4 E'3C'3A^4 E'3C'3 | A^4E'3C'3A^4E'3C'3 |
A^4E'3C'3 A^4E'3C'3 | A^4 |

On this notation we’ve got the time signature and tempo defined at the highest denoted by ‘M’ and ‘Q’. The ‘L’ indicates the default note length of the notation, on this case a sixteenth note, which is the norm. We then define each instrument and the default octave they need to adhere to when writing the notes for every of them. Here’s a summary of the important thing syntactical points for writing notes in ABC music notation:

Notes are represented by letters A-G, with lowercase letters indicating higher octaves
Sharps are denoted by ^ before the note, flats by _
Natural signs are represented by =
Note length is indicated by numbers after the note (C2 is twice so long as C)
Dotted notes use a . after the note (C. is a dotted quarter note)
Rests are represented by z, with numbers for duration (z2 is a half rest)
Chords are enclosed in square brackets [CEG]
Ties are shown with a hyphen –
Bar lines are represented by |
Broken rhythms use > or < between notes (C>D means dotted-C eighth note followed by D sixteenth note)

Why ABC?

The explanations for selecting this notation are:

It’s a minimalist format for writing music
It’s widely used and popular; language models have already got good comprehension of ABC notation attributable to extensive pre-training on it.
It’s flexible and might easily be prolonged to incorporate tempo changes, time signature changes, additional playing styles like mentioned above, etc…

I converted the MIDI files provided by the datasets to ABC notation using this library. A notebook for creating the datasets is here.

To judge each the unique model and every stage of fine-tuning I performed thereafter, I randomly chosen 30 samples of various complexity from the URMP dataset and ran the model 3 times on each sample, manually examining all responses.

Through manual testing, I discovered the optimal decoding parameters to be a temperature of 0.7 and a top_p of 1.2. The utmost variety of tokens to return was capped at 2048. Adjusting the max looked as if it would have little difference on performance.

The unique model performed poorly on this evaluation set. While it occasionally predicted the tempo and instruments accurately, it mostly didn’t achieve this. A text file with the evaluation results is out there here.

Given this start line, it’s unlikely that we’ll see strong results from this experiment and not using a robust pre-trained model. Nevertheless, the goal is to develop strategies that may be applied in the long run as more advanced pre-trained models grow to be available.

I first attempted fine-tuning with basic cross-entropy loss. Supervised fine-tuning with cross-entropy loss is a fast strategy to start teaching the model but a basic loss function like this has limitations as we are going to see below. The intuition behind this stage of coaching is that it might nudge the model in the correct direction and it might pick up any patterns or any customised ABC notation the dataset could have which the model may not have seen before.

Cross-entropy loss with teacher forcing

First, we trained it in a typical supervised fine-tuning manner for language models. I used the SFTtrainer from the trl library for this, which uses cross-entropy loss with teacher forcing defined step-by-step below:

The model predicts the following token within the sequence.
The loss is calculated based on the difference between the anticipated probabilities (logits) and the actual next token.
For the following prediction, the model is given the actual correct token (ground truth), slightly than its own prediction. That is referred to as teacher forcing, it helps stabilise training and significantly speed it up, especially within the early stages.

The outcomes from this training phase were poor. It degraded the performance of the unique model. The model, which previously handled tempo and instrument recognition well, now mostly got these incorrect. It also began producing garbled text output with countless repetition. This occurred even when setting a low learning rate, applying gradient clipping, and using low LoRA ranks to mitigate large changes to the model. Overall, it seemed the model was very sensitive to the training applied.

Nevertheless, while this training phase may offer some improvements, it won’t result in optimal performance attributable to the constraints of our basic loss function. This function struggles to totally capture the model’s performance nuances. For instance, when using teacher forcing, instrument predictions can yield deceptively low loss across certain token sections. If an instrument name begins with “V”, the model might confidently predict “Violin” or “Viola” based on our dataset, no matter accuracy. Moreover, the loss function may not accurately reflect near-misses, akin to predicting a tempo of 195 as an alternative of 200 — a small difference that’s reasonably accurate but potentially penalised heavily depending on the distribution of probabilities amongst logits. It’s possible that neighbouring numbers even have high probabilities.

RLHF with PPO

Due to these limitations, we will create our own custom loss function that may more accurately rating the response from the model. That’s, given a predicted sequence from the model, the loss function could give it a rating between 0 and 1 on how good it’s.

Nevertheless, integrating this practice loss function into supervised fine-tuning presents a big challenge. The problem stems from the non-linearity introduced by the custom loss function, which prevents the direct calculation of gradients. Let’s break this down:

In traditional SFT with cross-entropy loss:

The model outputs logits (raw scores) for every token in its vocabulary
These logits directly represent the model’s prediction probabilities
The loss function compares these probabilities to the bottom truth
Gradients may be computed directly through this comparison
The chain rule of calculus allows us to propagate these gradients back through the model

With our custom loss function:

The model must first generate complete text output
This generation process involves sampling from probability distributions
Our loss function then analyses this text output (checking tempo, notes, etc.)
This creates a non-differentiable step between the model’s logits and our loss calculation
The sampling and text evaluation steps break the gradient chain needed for backpropagation

To beat this, reinforcement learning techniques like Proximal Policy Optimisation (PPO) may be employed. PPO is specifically designed to handle non-differentiable loss functions and might optimise the model by considering all the policy (the model’s output distribution), slightly than counting on gradient information from logits.

Note, there’s a lot of great articles on here explaining PPO!

The important thing insight of PPO is that as an alternative of attempting to directly backpropagate through the non-differentiable steps, it:

Treats the model’s outputs as actions in a reinforcement learning framework
Uses the custom loss function as a reward signal
Updates the model’s policy (its probability distributions over tokens) to maximise expected reward
Does this while ensuring the updated policy doesn’t deviate too removed from the present one

This approach allows us to effectively train the model with the custom loss function, ensuring performance improvements without disrupting the core training dynamics. The PPO algorithm’s conservative update strategy helps maintain stability during training, which is especially necessary when working with large language models.

Normally, this scoring function could be implemented as a separate LLM in the shape of a “reward model” commonly used when fine-tuning models via RLHF, which was a breakthrough first introduced when ChatGPT got here out. Because of the character of this task, we will manually write code to attain the responses, which uses fewer resources and is quicker.

For time signature and tempo recognition this is straightforward to calculate. We extract all predicted items with , for instance extracting the metre:

def extract_metre(self, abc_string):
return re.search(r'M:(S+)', abc_string).group(1)

The model should learn the syntax and structure we wish it to output within the SFT stage. If it outputs something that may cause our to not find anything or error, we will just skip that sample, assuming it’s a small minority of the dataset.

We extract the anticipated tempo and write a function that’s more forgiving for small errors but penalises larger errors more heavily:

For small differences (≤10 BPM), it uses linear scaling.
For larger differences, it switches to exponential scaling.
The ultimate loss is capped between 0 and 1.

Let’s break down the important thing components of this practice loss:

Code for the custom loss is here

1. Metre Loss

The metre loss focuses on the time signature of the piece. It compares the anticipated metre with the bottom truth, considering each the numerator and denominator individually, in addition to their ratio. This approach allows for a nuanced evaluation that may handle various time signatures accurately.

The metre loss uses a mix of linear and exponential scaling to penalise differences. Small discrepancies end in a linear increase in loss, while larger differences result in an exponential increase, capped at a maximum value of 1.

2. Tempo Loss

Tempo loss evaluates the accuracy of the anticipated beats per minute (BPM). Much like the metre loss, it uses a mix of linear and exponential scaling.

For small tempo differences (≤10 BPM), the function applies linear scaling. Larger differences trigger exponential scaling, ensuring that significant tempo mismatches are penalised more heavily.

3. Pitch Loss

The pitch loss is maybe essentially the most crucial component, because it assesses the accuracy of the transcribed notes. This function uses the Levenshtein distance to match the sequence of notes in each voice.

The pitch loss calculation accounts for multiple voices, matching each predicted voice to the closest ground truth voice. This approach allows for flexibility in voice ordering while still maintaining accuracy in the general pitch content.

4. Instrument Loss

The instrument loss evaluates the accuracy of instrument selection for every voice.

This function considers exact matches, instruments from the identical family, and uses string similarity for more nuanced comparisons. It provides a comprehensive assessment of how well the model identifies and assigns instruments to every voice.

5. Combining the Losses

The ultimate loss is a weighted combination of those individual components:

total_loss = (0.5 * pitch_loss +
0.15 * metre_loss +
0.15 * tempo_loss +
0.2 * instrument_loss)

This weighting scheme prioritises pitch accuracy while still considering other necessary facets of music transcription.

PPO training generally requires so much more memory than SFT for just a few reasons:

Multiple policy evaluations — PPO needs to keep up each the present policy (model weights) and an “old” policy to compute the probability ratio between them. This effectively doubles the model parameters in memory.
Experience buffer — PPO stores a buffer of experiences (states, actions, rewards, etc.) to perform updates in mini-batches. This buffer may be quite large and takes significant memory.
Advantage estimation — Computing benefits requires keeping track of value estimates and returns across trajectories, adding one other layer of memory overhead.
Additional optimisation objectives — PPO tracks multiple loss components (policy loss, value loss, entropy bonus) and their gradients, whereas SFT has a single loss.

Due to above, we’re more limited than SFT in the dimensions of the models we will train and the way much it costs. Whereas the above training I could do on an A100 40GB in Colab, for the PPO training I needed more memory. I trained on an H100 80GB, which could train a LoRA with a rank of 128 and a batch size of 8.

My hyperparameter sweep was narrow, I went with what seemed most intuitive using batch sizes starting from 1 to 16 and learning rates from 2e-5 to 2e-4.

The model made no improvements to the duty. The text file with the outcomes is here.

I tracked various training metrics using Weights & Biases (WandB). Key metrics included the policy loss, value loss, total loss, KL divergence, and the reward model’s rating.

For all hyperparameter runs, the logs no improvement within the rewards and loss calculated over time. The KL divergence remained inside the pre-defined threshold.

Exploring Music Transcription with Multi-Modal Language Models

Using Qwen2-Audio to transcribe music into sheet music

Compute/GPU resources

Other limitations

Transfer learning and zero-shot

Overview of architecture

Llark: why Jukebox? Are these embeddings the most effective as of September 2024?

Qwen2Audio: Whisper

Pre-trained weights, training data and datasets

Transcription format

Why ABC?

Cross-entropy loss with teacher forcing

RLHF with PPO

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Production-Ready LLMs Made Easy with the NeMo Agent Toolkit

Supply chains, AI, and the cloud: The most important failures (and one success) of 2025

Our Transformers Code Agent beats the GAIA benchmark 🏅

What Advent of Code Has Taught Me About Data Science

From prophet to product: How AI got here back right down to earth in 2025

Exploring Music Transcription with Multi-Modal Language Models

Using Qwen2-Audio to transcribe music into sheet music

Compute/GPU resources

Other limitations

Transfer learning and zero-shot

Overview of architecture

Llark: why Jukebox? Are these embeddings the most effective as of September 2024?

Qwen2Audio: Whisper

Pre-trained weights, training data and datasets

Transcription format

Why ABC?

Cross-entropy loss with teacher forcing

RLHF with PPO

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.