Home Artificial Intelligence Generating Song Recommendations Introduction Goal Data Pipeline Summarization Model Description Comparison Musical Feature Comparison User Flow Results Conclusion and Future Works

Generating Song Recommendations Introduction Goal Data Pipeline Summarization Model Description Comparison Musical Feature Comparison User Flow Results Conclusion and Future Works

1
Generating Song Recommendations
Introduction
Goal
Data Pipeline
Summarization Model
Description Comparison
Musical Feature Comparison
User Flow
Results
Conclusion and Future Works

Jaykumar Patel, Janvi Patel, Aniketh Devarasetty, Malvika Vaidya, Seann Robbins, Tanner Hudnall

In this text, we’ll showcase our initial attempts at making a song suggestion model. We’ll give an outline of our dataset and the way it was used. We’ll then explain the training of our model that generates a brief list of song recommendations given an input song. Finally, we’ll discuss the constraints of our current model and the way we’d hope to enhance it in the longer term.

Upon doing research, this task has been attempted previously. Nonetheless, most recommendations systems that we researched did not take the meaning behind lyrics under consideration. For instance, this article looks on the song features provided by the Spotipy library which doesn’t provide lyrics but only musical features reminiscent of “danceability”, “loudness”, “key”, etc. It goes on to perform evaluation on those features reminiscent of clustering, t-SNE, and PCA. This tutorial by GeeksforGeeks uses the same approach as well.

Also, this article looks at a hybrid of “content” based approach (which recommends songs based on what the user listened to up to now) and “collaborative” based approach (which recommends songs based on what other similar users liked). Nonetheless, the article’s approach also fails to account for a song’s lyrics or meaning when taking a look at features.

Moreover, we noticed a difficulty where Spotify’s suggestion model overemphasizes the listening history of the user when recommending songs. We wanted to handle this by specifically specializing in the song and its features (somewhat than user’s history) when recommending songs.

We desired to capture the meaning of songs in addition to other musical features, something that other previous models did not do. This might include song lyrics in addition to information that Spotify API provides reminiscent of “danceability”, “energy”, “key”, etc. We’ll consult with this information that Spotify API provides as “musical features” any further.

Ideally, our model would have the opportunity to research lyrics and generate a brief description or summary of the input song, and recommend as much as top 5 songs which might be similar in the outline of the input song in addition to the musical features.

Model design:

We checked out various places for data reminiscent of Kaggle and Hugging Face. Nonetheless, we only found data that included either the lyrics OR the musical features. So as to create an NLP summarization model that might have the opportunity to generate an outline of an input song, we would wish a dataset that had song lyrics together with song descriptions for training the NLP model, which we also couldn’t find. Moreover, the KNN model would wish the musical features. Subsequently, the largest challenge of this project was gathering the information required to coach our models.

Here is the ultimate approach to our data pipeline:

Given the next, generate a 2–4 sentence summary without including the song name or the artist name within the summary:

We formulated this question to generate suggestion from intrinsic features of songs and without the identifying features (reminiscent of artist name or song name) as a way to promote diversity and wider range of recommendations.

After feeding our Kaggle dataset through this pipeline, we were left with roughly 5000 songs that included the song’s identifying information, musical features, lyrics, and descriptions. This preprocessed dataset forms the premise for developing our two goal models. We’ll hereafter consult with this dataset as simply “the dataset”.

Certainly one of the most important tasks was training a summarization model that might be used to generate a song description for the input song. For this, we used the thought of transfer models.

We used the t5-small model to generate the summary for the lyrics of the user’s input song. T5 is an encoder-decoder model that’s pre-trained to perform various tasks reminiscent of translations and summarization. All of the model needs in an input text with a prefix of its task, reminiscent of “summarize: ”. Nonetheless, the model does need some fine-tuning to supply the perfect output for a selected application. So, we trained the t5-small model to summarize a song’s lyrics, using a dataset of ~5,000 English songs, and ChatGPT generated goal summaries. The model was trained for 40 epochs and had a cross entropy loss of roughly 2. Although this loss was relatively high, it’s an inexpensive number for under training on 5,000 samples.

The model’s parameters were then saved to a torch file. After the parameters were loaded into the default pretrained summarizer, the model could then generate a summary given some lyrics as input.

We would have liked a strategy to compare the outline of the input song to the descriptions of the ~5000 songs in our dataset as a way to determine the highest 500 similar songs.

One strategy to compare two texts is to convert them right into a vectorized format after which compute the cosine similarity between them, which shows how similar they’re in content. This system measures the cosine angle between two vectors, where more similar texts’ vectors will lie closer in space.

Subsequently, as a way to compare the song summaries, all the summaries have to be converted right into a vector representation. Two possible approaches might be taken — CountVectorizer or TfidVectorizer. CountVectorizer simply counts the occurrence of every word in a document, making a sparse matrix where each row represents a document and every column represents a word. TfidVectorizer (which stands for Term Frequency-Inverse Document Frequency Vectorizer), however, takes under consideration the frequency of the term within the document in addition to the inverse frequency of the term within the document. This inverse frequency represents the ‘rarity’ of the term, so this method takes under consideration each frequency and rarity of the word, due to this fact giving higher weight to more vital, unique words and lower weight to common words.

The summaries within the dataset and the expected summary from the model were vectorized using python’s TfidVectorizer. The cosine similarity was computed between each entry within the dataset and the expected summary. The highest 500 entries from the dataset that resulted in the best cosine similarity scores essentially represent the five hundred songs from the dataset which might be closest in meaning and content to the expected summary of the input song.

This vectorizing and cosine similarity approach appeared to perform well. For a given predicted summary, “A person and woman enter a romantic relationship that’s looked down upon by their families. They need to fight for his or her love,” one in all the highest recommendations was a song called “Again” by Wande Coal. The ChatGPT generated summary of this song is “The song is a few man who’s in love with a girl and desires to spend the remaining of his life along with her. He tells her to not take heed to what others say about him and guarantees to make her glad.” These two texts are relatively similar in content in that they capture the romantic and barely forbidden aspect of the connection.

One other approach to vectorizing the song lyrics is using word embeddings. A BERT transformer model might be trained on the song’s summaries to realize some irrelevant task. After the model is trained, each summary might be passed into the model as a document. For every document, the CLS document embedding shall be extracted from one in all the output layers of the model. A CLS document embedding is the vector representation of the CLS token, which is a token that captures all the document. This may give us the vector representations of every summary, after which the cosine similarities might be computed.

We found that this method was not essential for the scope of the project. It seemed that TfidVectorizer was capturing the meaning of the summaries fairly decently and resulted in a robust output.

Then we would have liked a strategy to determine the highest 5 similar songs out of the five hundred based on musical features. Because the musical features were numeric, we were capable of use K-Nearest-Neighbors by sklearn.

After filtering out the songs with the least similar lyrics, the subsequent step is to decide on the songs with probably the most similar song features. A convenient strategy to measure those similarities in features is by analyzing Spotify’s track feature classifications, which quantifies a song’s specific features reminiscent of “acousticness” or “energy”. Using the Spotify API, we will gather these metrics for any input song, after which run the next algorithm to get the same songs.

K Nearest Neighbors is an unsupervized machine learning algorithm that might be used to seek out the k nearest data points to a given point, based on the similarity of features of the input data point and the points within the dataset. We will employ this method to match the features of the user’ input song against all of the filtered songs, to seek out the highest k songs which have the closest musical features. For our project, we decided that finding the highest 5 neighbor songs would yield an inexpensive output, all of which must have similar musical features to the user’s song.

Here’s what the method looks like for a user wanting recommendations. They might input a song that they like. We might use the LyricGenius API to get the input song’s lyrics. We might then use those lyrics to generate an outline using our NLP summarization model, and compare that description with the descriptions of the songs in our dataset to find out top 500 similar songs. Then we’d get the musical features of the input song using Spotify’s API, and compare them with the five hundred songs to get top 5 recommendations.

The summarizer model had a cross entropy loss of roughly 2 on the validation set. We moreover used evaluation metrics reminiscent of Rouge1, Rouge2 and RougeL. Rouge1 and rouge2 represent the variety of unigrams and bigrams respecively that match between the unique lyrics and the prediction. RougeL rating represents the longest common subsequence. The scores for these metrics were 0.25, 0.07, and 0.17 respectively, where higher values are more favorable.

Although the loss is comparatively high and rouge scores are low, we used human evaluation to find out that the generated summaries were relatively comparable to the ChatGPT generated summaries.

1 COMMENT

LEAVE A REPLY

Please enter your comment!
Please enter your name here