audio embeddings for music advice?
Streaming platforms (Spotify, Apple Music, etc.) must have the power to recommend recent songs to their users. The higher the recommendations, the higher the listening experience.
There are various ways these platforms can construct their advice systems. Modern systems will mix different advice methods together right into a hybrid structure.
Take into consideration while you first joined Spotify, you should have been asked what genres you want. Based on the genres you choose, Spotify will recommend some songs. Recommendations based on song metadata like this are known as content-based filtering. Collaborative filtering can also be used, which groups together customers that behave similarly, after which suggestions are transferred between them.
The 2 methods above lean heavily on user behaviour. One other method, which is increasingly getting used by large streaming services, is using Deep Learning to represent songs in learned embedding spaces. This permits songs to be represented in a high dimensional embedding space which captures rhythm, timbre, texture, and production style. Similarity between songs can then be computed easily, which scales higher than using classical collaborative filtering approaches when considering a whole bunch of tens of millions of users and tens of tens of millions of tracks.
Through the rise of LLMs, word and phrase embeddings have grow to be mainstream and are relatively well understood. But how does embedding work for songs and what problem are they solving? The rest of this post focuses on how audio becomes a model input, what architectural selections encode music features, how contrastive learning shapes the geometry of the embedding space and the way a song recommender system using an embedding might work in practice.
How does Audio grow to be an input right into a neural network?
Raw audio files like MP3 are fundamentally a waveform – a rapidly various time series. Learning from these files is feasible, but is usually data-hungry and computationally expensive. We are able to convert .mp3 files into mel-spectrograms, that are way more suited as inputs to a neural network.
Mel-spectrograms are a way of representing audio file’s frequency content over time, adapted to how humans perceive sound. It’s a 2D representation where the x-axis corresponds to time, the y-axis corresponds to mel-scaled frequency bands, and every value represents the log-scaled energy in that band at the moment.

The colors and shapes we see on a mel-spectrogram can tell us meaningful musical information. Brighter colors indicate higher energy at that frequency and time and darker colors indicate lower energy. Thin horizontal bands indicate sustained pitches and infrequently correspond to sustained notes (vocals, strings, synth pads). Tall, vertical streaks indicate energy across many frequencies without delay, concentrated in time. These can represent drum snares and claps.
Now we will begin to take into consideration how convolutional neural networks can learn to recognise features of those audio representations. At this point, the important thing challenge becomes: how can we train a model to recognise that two short audio excerpts belong to the identical song without labels?
Chunking and Contrastive Learning
Before we jump into the architecture of the CNN that we have now used, we are going to take a while to discuss how we load the spectrogram data into the network, and the way we arrange the loss function of the network without labels.
At a really high level, we feed the spectrograms into the CNN, a number of matrix multiplication happens inside, after which we’re left with a 128-dimensional vector which is a latent representation of physical features of that audio file. But how can we arrange the batching and loss for the network to give you the option to judge similar songs.
Let’s start with the batching. We now have a dataset of songs (from the FMA small dataset) that we have now converted into spectrograms. We make use of the tensorflow.keras.utils.Sequence class to randomly select 8 songs from the dataset. We then randomly “chunk” each spectrogram to pick out a 128 x 129 rectangle which represents a small portion of every song, as depicted below.

Because of this every batch we feed into the network is of the form (8, 128, 129, 1) (batch size, mel frequency dimension, time chunk, channel dimension). By feeding chunks of songs as a substitute of whole songs, the model will see different parts of the identical songs across training epochs. This prevents the model from overfitting to a selected moment in each track. Using short samples from each song encourages the network to learn local musical texture (timbre, rhythmic density) slightly than long-range structure.
Next, we make use of a contrastive learning objective. Contrastive loss was introduced in 2005 by Chopra et al. to learn an embedding space where similar pairs (positive pairs) have a low Euclidean distance, and dissimilar pairs (negative pairs) are separated by at the least a certain margin. We’re using the same concept by making use of InfoNCE loss.
We create two stochastic “views” of every batch. What this really means is that we create two augmentations of the batch, each with random, normally distributed noise added. This is completed simply, with the next function:
@tf.function
def augment(x):
"""Tiny time-frequency noise."""
noise = tf.random.normal(shape=tf.shape(x), mean=0.0, stddev=0.05)
return tf.clip_by_value(x + noise, -80.0, 0.0)
# mel dB range normally -80–0
Embeddings of the identical audio sample needs to be more just like each apart from to embeddings of every other sample within the batch.
So for a batch of size 8, we compute the similarity of each embedding from the primary view and each embedding from the second view, leading to an 8×8 similarity matrix.
We define the 2 L2-normalised augmented batches as [z_i, z_j in mathbb{R}^{N times d} ]
Each row (a 128-D embedding, in our case) of the 2 batches are L2-normalised, that’s,
[ Vert z_i^{(k)} Vert_2 = 1 ]
We are able to then compute the similarity of each embedding from the primary view and each embedding from the second view, leading to an NxN similarity matrix. This matrix is defined as:
[ S = frac{1}{tau} z_i z_j^T ]
Where every element of S is the similarity between the embedding of song k and embedding of song l across each augmentations. This will be defined element-wise as:
[
S_{kl} = frac{1}{tau} langle z_i^{(k)}, z_j^{(l)} rangle
= frac{1}{tau} cos(z_i^{(k)}, z_j^{(l)})
]
Where tau is a temperature parameter. Because of this the diagonal entries (the similarity between chunks from the identical song) will probably be the positive pairs, and the off-diagonal entries are the negative pairs.
Then for every row k of the similarity matrix, we compute:
[
ell_k =log
frac{exp(S_{kk})}{sum_{l=1}^{N} exp(S_{kl})}
]
This can be a softmax cross-entropy loss where the numerator is similarity between the positive chunks, and the denominator is the sum of all of the similarities across the row.
Finally we average the loss over the batch, giving us the complete loss objective:
[
L =
frac{1}{N}
sum_{k=1}^{N}
left( log
frac{
expleft(
frac{1}{tau}
langle z_i^{(k)}, z_j^{(k)} rangle
right)
}{
sum_{l=1}^{N}
expleft(
frac{1}{tau}
langle z_i^{(k)}, z_j^{(l)} rangle
right)
}
right)
]
Minimising the contrastive loss encourages the model to assign the very best similarity to matching augmented views while suppressing similarity to all other samples within the batch. This concurrently pulls representations of the identical audio closer together and pushes representations of various audio further apart, shaping a structured embedding space without requiring explicit labels.
This loss function is neatly described by the next python function:
def contrastive_loss(z_i, z_j, temperature=0.1):
"""
Compute InfoNCE loss between two batches of embeddings.
z_i, z_j: (batch_size, embedding_dim)
"""
z_i = tf.math.l2_normalize(z_i, axis=1)
z_j = tf.math.l2_normalize(z_j, axis=1)
logits = tf.matmul(z_i, z_j, transpose_b=True) / temperature
labels = tf.range(tf.shape(logits)[0])
loss = tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)
return tf.reduce_mean(loss)
Now we have now built some intuition of how we load batches into the model and the way minimising our loss function clusters similar sounds together, we will dive into the structure of the CNN.
An easy CNN architecture
We now have chosen a reasonably easy convolutional neural network architecture for this task. CNNs first originated with Yann LeCun and team after they created LeNet for handwritten digit recognition. CNNs are great at learning to know images, and we have now converted each song into an image-like format that works with CNNs.
The primary convolution layer applies 32 small filters across the spectrogram. At this point, the network is generally learning very local patterns: things like short bursts of energy, harmonic lines, or sudden changes that always correspond to notice onsets or percussion. Batch normalization keeps the activations well-behaved during training, and max pooling reduces the resolution barely so the model doesn’t overreact to tiny shifts in time or frequency.
The second block increases the variety of filters to 64 and starts combining those low-level patterns into more meaningful structures. Here, the network begins to choose up on broader textures, repeating rhythmic patterns, and consistent timbral features. Pooling again compresses the representation while keeping crucial activations.
By the third convolution layer, the model is working with 128 channels. These feature maps are likely to reflect higher-level points of the sound, resembling overall spectral balance or instrument-like textures. At this stage, the precise position of a feature matters lower than whether it appears in any respect.

Global average pooling removes the remaining time–frequency structure by averaging each feature map right down to a single value. This forces the network to summarize what patterns are present within the chunk, slightly than where they occur, and produces a fixed-size vector no matter input length.
A dense layer then maps this summary right into a 128-dimensional embedding. That is the space where similarity is learned: chunks that sound alike should find yourself close together, while dissimilar sounds are pushed apart.
Finally, the embedding is L2-normalized so that every one vectors lie on the unit sphere. This makes cosine similarity easy to compute and keeps distances within the embedding space consistent during contrastive training.
At a high level, this model learns about music in much the identical way that a convolutional neural network learns about images. As an alternative of pixels arranged by height and width, the input here’s a mel-spectrogram arranged by frequency and time.
How can we know the model is any good?
All the pieces we’ve talked about to date has been quite abstract. How can we actually know that the mel-spectrogram representations, the model architecture and the contrastive learning have done an honest job at creating meaningful embeddings?
One common way of understanding the embedding space we have now created is to visualise the space in a lower-dimensional one, one which humans can actually visualise. This method is known as dimensionality reduction, and is beneficial when trying to know high dimensionality data.

Two techniques we will use are PCA (Principal Component Evaluation) and t-SNE (t-distributed Stochastic Neighbor Embedding). PCA is a linear method that preserves global structure, making it useful for understanding the general shape and major directions of variation in an embedding space. t-SNE is a non-linear method that prioritises local neighbourhood relationships, which makes it higher for revealing small clusters of comparable points but less reliable for interpreting global distances. Because of this, PCA is best for assessing whether an embedding space is coherent overall, while t-SNE is best for checking whether similar items are likely to group together locally.
As mentioned above, I trained this CNN using the FMA small dataset, which incorporates genre labels for every song. Once we visualise the embedding space, we will group genres together which helps us make some statements in regards to the quality of the embedding space.
The 2-dimensional projections give different but complementary views of the learned embedding space. Neither plot shows perfectly separated genre clusters, which is predicted and really desirable for a music similarity model.
Within the PCA projection, genres are heavily mixed and form a smooth, continuous shape slightly than distinct groups. This means that the embeddings capture gradual differences in musical characteristics resembling timbre and rhythm, slightly than memorising genre labels. Because PCA preserves global structure, this means that the embedding space is coherent and organised in a meaningful way.
The t-SNE projection focuses on local relationships. Here, tracks from the identical genre usually tend to appear near one another, forming small, loose clusters. At the identical time, there continues to be significant overlap between genres, reflecting the proven fact that many songs share characteristics across genre boundaries.

Overall, these visualisations suggest that the embeddings work well for similarity-based tasks. PCA shows that the space is globally well-structured, while t-SNE shows that locally similar songs are likely to group together — each of that are vital properties for a music advice system. To further evaluate the standard of the embeddings we could also have a look at recommendation-related evaluation metrics, like NDCG and recall@k.
Turning the project right into a usable music advice ap
Lastly we are going to spend a while talking about how we will actually turn this trained model into something usable. As an instance how a CNN like this is perhaps utilized in practice, I actually have created a quite simple song recommender web app. This app takes an uploaded MP3 file, computes its embedding and returns an inventory of probably the most similar tracks based on cosine similarity. Somewhat than treating the model in isolation, I designed the pipeline end-to-end: audio preprocessing, spectrogram generation, embedding inference, similarity search, and result presentation. This mirrors how such a system could be utilized in practice, where models must operate reliably on unseen inputs slightly than curated datasets.
The embeddings from the FMA small dataset are precomputed and stored offline, allowing recommendations to be generated quickly using cosine similarity slightly than running the model repeatedly. Chunk-level embeddings are aggregated right into a single song-level representation, ensuring consistent behaviour for tracks of various lengths.
The is a light-weight web application that demonstrates how a learned representation will be integrated right into a real advice workflow.
This can be a quite simple representation of how embeddings may very well be utilized in an actual advice system, nevertheless it doesn’t capture the entire picture. Modern advice systems will mix each audio embeddings and collaborative filtering, as mentioned at first of this text.
Audio embeddings capture what things sound like and collaborative filtering captures who likes what. A mixture of the 2, together with additional rating models can mix to create a hybrid system that balances acoustic similarity and private taste.
Data sources and Images
This project uses the FMA Small dataset, a publicly available subset of the Free Music Archive (FMA) dataset introduced by Defferrard et al. The dataset consists of short music clips released under Creative Commons licenses and is widely used for tutorial research in music information retrieval.
All schematic diagrams in this text were generated by the writer using AI-assisted image generation tools and are utilized in accordance with the tool’s terms, which allow business use. The photographs were created from original prompts and don’t reference copyrighted works, fictional characters, or real individuals.
