Home Artificial Intelligence Your First Advice System: From Data Preparation to ML Debugging and Improvements Assessment

Your First Advice System: From Data Preparation to ML Debugging and Improvements Assessment

Your First Advice System: From Data Preparation to ML Debugging and Improvements Assessment

Image by Vska from freepik.com

So that you’ve begun to develop your first production suggestion system, and although you could have experience in programming and ML, you might be bombarded with an unlimited volume of recent information, like model selection, metrics selection, inference problems and quality assurance.

We cover steps for creating the primary working version of an ML model, including data processing, model selection, metrics selection, ML debugging, results interpretation and evaluation of improvements.

The code of the article is here, it may function a start line for your individual work. The file rec_sys.ipynb comprises a step-by-step guide.

To start out, we’d like a minimum available dataset consisting of three entities: user, item, rating. Each record of this dataset will tell us about user’s interaction with an item. For this text now we have chosen a dataset MovieLens [1](used with permission) with 100k records containing 943 unique users and 1682 movies. For the this dataset userUserId, itemMovieId , ratingRating .

MovieLens also comprises a metadata for every movie. This has information by genre. We’ll need this to interpret the predictions.

Here there are the special preprocessing steps for suggestion system. I need to skip obvious steps like drop NaN elements, drop duplicate elements, data cleansing.

Rating generation

If we don’t have a rating, then create a rating column with the worth 1.

df['rating'] = 1

Also if the rating isn’t explicit it may be created by various aggregating functions, for instance, based on the variety of interactions, duration, etc.

Entity encoding If you could have objects within the item and user fields, these fields must be converted to a numeric format. An important solution to accomplish that is to make use of LabelEncoder.

from sklearn.preprocessing import LabelEncoder
u_transf = LabelEncoder()
item_transf = LabelEncoder()
# encoding
df['user'] = u_transf.fit_transform(df['user'])
df['item'] = item_transf.fit_transform(df['item'])
# decoding
df['item'] = item_transf.inverse_transform(df['item'])
df['user'] = u_transf.inverse_transform(df['user'])

Sparsity index

The Sparsity index should be lowered for quality model training.

What do the high values of this index tell us? It implies that now we have quite a lot of users who haven’t watched many movies, and we even have movies which have a low audience. The more inactive users and unpopular movies now we have, the upper this level is.

This example happens most frequently when, for instance, when variety of all users is suddenly increasing. Or we decided to drastically increase our movie library, and now we have absolutely no views for brand new movies.

Reducing sparsity is critical for training. Let’s say you’ve loaded your data and try to coach a model, and also you’re getting extremely low metrics. You don’t need to start out looking for special hyperparameters, or on the lookout for other higher models. Start by checking the sparsity index.

You possibly can see from the graph that reducing this index by almost 2% has a really positive effect on metrics

The graph shows that 81% of users are inactive (they’ve watched lower than 20 movies). They must be removed. And this function will help us with this:

def reduce_sparsity(df, min_items_per_user, min_user_per_item, user_col=USER_COL, item_col=ITEM_COL):
good_users = df[user_col].value_counts()[df[user_col].value_counts() > min_items_per_user].index
df = df[df[user_col].isin(good_users)]

good_items = df[item_col].value_counts()[df[item_col].value_counts() > min_user_per_item].index
df = df[df[item_col].isin(good_items)].reset_index(drop=1)

return df

So we needed to remove some users and films, but this can allow us to coach the model higher. It’s desirable to scale back the sparsity level fastidiously and select it for every dataset based on the situation. In my experience, a sparsity-index of about 98% is already sufficient for training.

There are good articles detailing popular metrics for recommender systems. For instance, “Recommender Systems: Machine Learning Metrics and Business Metrics” by Zuzanna Deutschman and “Automatic Evaluation of Advice Systems: Coverage, Novelty and Diversity” by Zahra Ahmad. For this text I made a decision to deal with 4 metrics that may serve at least set to get you began.


P = (relevant elements) / k

This is an easy metric that doesn’t consider the order of predictions, so it is going to have higher values than using MAP. This metric can also be sensitive to changes within the model, which could be useful for model monitoring and evaluation. It is less complicated to interpret, so we will include it in our list of metrics.

MAP (Mean Average precision)

This metric, unlike the previous one, is very important for the order of predictions, the closer to the highest of the list of recommendations we’re mistaken, the greater the penalty.


Coverage = num of uniq items in recommendations / all uniq items

The metric lets you see percentage of films utilized by the suggestion system. This is generally very vital for businesses to ensure that the content (on this case, movies) they’ve on their site is used to its full potential.


The aim of this metric is to calculate how diverse the recommendations are.

Within the paper “Automatic Evaluation of Advice Systems: Coverage, Novelty and Diversity” by Zahra Ahmad, diversity is the typical similarity for top_n.

But in this text diversity can be treated otherwise — as a median value of the variety of unique genres. High diversity values mean that users have a chance to find recent genres, diversify their experience, and spend more time on the positioning. As a rule, it increases the retention rate and has a positive impact on revenue. This fashion of calculating metrics has a high degree of interpretability for business, unlike the abstract mean similarity ratio.

Interpretation of metrics

There is a superb repository on recommender systems that comprises not only the models themselves, but in addition an evaluation of the metrics. Studying this table gives us an understanding of the possible range of metrics, and intuition in model evaluation. For instance, Precision@k values below 0.02, generally, must be considered bad.

So now we have quality metrics tied and never tied to the rank. There are metrics not directly accountable for business and money, in addition to for the use and availability of content. Now we will move on to the alternative of the model.

ALS matrix factorization

That is an awesome model to start out with. Written in Spark the algorithm is comparatively easy

During training, the model initialises the User Matrix and Item Matrix and trains them in such a way as to attenuate the error of reconstructing the Rating matrix. Each vector of the User Matrix is a representation of some user and every vector of the Item Matrix is a representation of a specific item . Accordingly, the prediction is a scalar multiplication of the corresponding vectors from the User and Item Matrices.

It’s an awesome model to start out with, since it’s extremely easy to implement, and fairly often it’s higher to start out with easy models on the research stage, because this model will learn quickly, which suggests the iteration time is reduced, which is able to speed up the project considerably. Also, the model won’t overload memory and in case there’s quite a lot of data, it is going to cope, which is able to save on infrastructure in the long run.

Bilateral Variational Autoencoder (BiVAE)

The model relies on the paper “Bilateral variational autoencoder for collaborative filtering” by Quoc Tuan TRUONG, Aghiles SALAH, Hady W. LAUW.

The model is broadly just like the previous one — within the technique of training, the matrix of users Theta and the matrix of Beta units are trained.

However the structure of the model is way more complex than in ALS. We’ve User encoder and Item encoder consisting of a sequence of linear layers. Their task is to coach hidden variables Theta and Beta respectively. Decoding and inference is completed by scalar multiplication of those two variables. The error function (Evidence lower certain on this case) is counted twice between the created user vectors and the actual values, then the identical is completed for the item encoder.

The model has been chosen as one of the best one within the comparison table. This model is the a part of Cornac zoo of models for recommender systems, with Pytorch under the hood. The model has a custom implementation of learning mode. It’s slower than its predecessor, and would require more spending on support and infrastructure, but perhaps its high metrics are price it.


Yes, the only model, and in some situations probably the most effective.


Although this approach seems too easy, it is going to nevertheless allow us to check metrics and, for instance, learn how far ML models have gone from such an easy model. Having such a model can justify or disprove the necessity for ML implementation.

Random model

This model will just give out random Item. It also creates the obligatory contrast when evaluating metrics and predictions of ML models.

So, we selected 4 models for the experiment. One is optimized for speed, one other for quality, and a pair of others for comparison and a greater understanding of the outcomes. We are actually able to begin training.

We’ll train 4 models directly to make it convenient to check them. We’ll use the settings that were within the recomenders repository.

import json
from pathlib import Path

import pandas as pd
from sklearn.model_selection import train_test_split

from models import RandomModel, MostPopular, ModelALS, BIVAE, evaluate
from setup import ITEM_COL, TOP_K_METRICS, TOP_K_PRED

def essential(out_folder='outputs'):
df = pd.read_csv('personalize.csv.zip', compression='zip').iloc[:, :3]

genres = pd.read_csv('movies.csv').rename({"movieId": ITEM_COL}, axis=1).dropna()

train, test = train_test_split(df, test_size=None, train_size=0.75, random_state=42)

metrics = {}
for model_cls in [BIVAE, ModelALS, RandomModel, MostPopular]:
model = model_cls()

preds = model.transform(TOP_K_PRED)

preds.to_csv(Path(out_folder) / f"{model_cls.__name__}_preds.csv", index=False)
metrics[model_cls.__name__] = evaluate(train, test, preds, genres, TOP_K_METRICS)

with open('outputs/metrics.json', 'w') as fp:
json.dump(metrics, fp)

if __name__ == "__main__":

Predictions for recommender systems have their very own specifics. Predicting all units for all users, we get a matrix of n-users x n-items. Accordingly, we will predict only for those users and units that were on the time of coaching.


  1. Remove from the prediction those units that were on the trainee.
  2. Sort by rating in descending order for every user. Otherwise, the metrics can be bad.

A very important point, since people often forget to remove seemed items(or train items), this may have a negative impact on the metrics, since the top can be those things that usually are not within the test dataset. As well as, users may have a negative experience related to the indisputable fact that the model will recommend what they’ve already seen.

As we will see, the BIVAE model showed one of the best precision metrics, it was in a position to adjust very precisely to the tastes of users. And good precision has a downside — Coverage and Diversity are worse than ALS model. Which means that an enormous amount of content will likely never get a likelihood to be seen by users. On this particular case BIVAE still looks preferable.

But sometimes Diversity is more vital to the business than Precision, and this will occur, for instance, if in your site users watch only romantic comedies and other genres usually are not preferred, but you desire to to encourage your audience to observe other genres.

The identical could be said concerning the MostPopular model, this model has higher performance than the ALS machine learning model. And evidently why need all this ML complexity, when now we have a ready-made model! But in case you look fastidiously we see that Coverage could be very low, and frequently with the rise of content, the share will fall much more, for instance now we have only 1682 movies, but what happens if tomorrow the business decides to expand the library by 100k movies? The Coverage percentage would drop much more for that model. The identical rule works in the other way — the less data you could have, the more likely it’s that an easy MostPopular model will work.

It’s also interesting to think about RandomModel, since when it comes to Precision metric it doesn’t look too bad compared with ALS, and its Coverage is 100%. Again, don’t jump to conclusions. The small variety of movies on this dataset contributes to this success.

In the long run, an acceptable model which has prime quality and acceptable Coverage and Diversity is BIVAE. We are able to construct our base suggestion system this model.

Sometimes debugging ml could be very difficult. Where and the best way to search for problems if the metrics usually are not superb? Within the code? In the information? Within the alternative of model or its hyperarameters?

There are some suggestions:

  1. If you could have low metrics, for instance PrecisionK below 0.1, and also you don’t know what the rationale is — the information or the model, or possibly the metric calculation, you possibly can take the MovieLens dataset and train your model on it. If its metrics are low on MovieLens too, then the cause is within the model, if the metrics are good, then the likely cause lies within the preprocessing and postprocessing stages.
  2. If Random model and Hottest model metrics are near ML models, it’s price checking the information — possibly the variety of unique items is just too low. This also can occur if now we have little or no data, or possibly there’s a bug within the training code.
  3. Values higher than 0.5 for PrecisionK look too high and it’s price checking if there’s a bug within the script or if we’re lowering the sparsity index an excessive amount of.
  4. At all times compare what number of users and items you could have left after lowering the sparsity index. Sometimes within the pursuit of quality you possibly can lose just about all users, so it is best to search for a compromise.

So what will we do next, if we would like to get the model to production? We’d like to determine what else we’d like to do and the way much work we’d like to do.

BIVAE optimization and functionality expansion

  • There aren’t any cold-start mechanisms out of the box. Considering the indisputable fact that we lowered the sparsity index, meaning that many users simply won’t have predictions.
  • It’s obligatory to implement a batch predict algorithm. Now the mechanism is implemented to predict one user, i.e. batch_size=1, naturally this greatly slows down the speed of labor.
  • Using metadata about users and objects (movies).


  • Advanced data diversification algorithms could also be required. Which might sometimes be comparable to the model development time.


  • development of recent metrics
  • data quality check

In fact, in each case, this list could be different, I actually have listed only the almost certainly improvements, and maybe, after compiling such an inventory, there can be a desire to make use of one other model that may have such features out of the box.

  1. We tested different models, including the newest solutions, to make sure their effectiveness and learned the best way to select probably the most suitable model based on technical and business metrics.
  2. Made the primary predictions that may already be used to point out users.
  3. As well as, now we have outlined an additional motion plan related to the event of the model and its release into production (prod).

You will discover all of the code for the article here.

I would love to inform you more concerning the BiVAE architecture, about cold start and increasing diversification, about down-stream tasks akin to item / user similarity, about how quality could be controlled and about online or offline inference, about the best way to transfer all this to pipelines. But this is way beyond the scope of the article, and if the readers just like the article, then it is going to be possible to release a sequel where I’ll go into every little thing in additional detail.

Subscribe to get notified once I publish a recent story.


[1] F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19. https://doi.org/10.1145/2827872

All images unless otherwise noted are by the creator.



Please enter your comment!
Please enter your name here