Constructing Higher ML Systems — Chapter 4. Model Deployment and Beyond

Artificial Intelligence

Constructing Higher ML Systems — Chapter 4. Model Deployment and Beyond

admin

September 28, 2023

Constructing Higher ML Systems — Chapter 4. Model Deployment and Beyond

When deploying a model to production, there are two vital inquiries to ask:

Should the model return predictions in real time?
Could the model be deployed to the cloud?

The primary query forces us to choose from real-time vs. batch inference, and the second — between cloud vs. edge computing.

Real-Time vs. Batch Inference

Real-time inference is a simple and intuitive solution to work with a model: you give it an input, and it returns you a prediction. This approach is used when prediction is required immediately. For instance, a bank might use real-time inference to confirm whether a transaction is fraudulent before finalizing it.

Batch inference, then again, is cheaper to run and easier to implement. Inputs which were previously collected are processed unexpectedly. Batch inference is used for evaluations (when running on static test datasets), ad-hoc campaigns (equivalent to choosing customers for email marketing campaigns), or in situations where immediate predictions aren’t essential. Batch inference can be a price or speed optimization of real-time inference: you precompute predictions upfront and return them when requested.

*Real-time vs. Batch Inference. Image by Writer*

Running real-time inference is far tougher and dear than batch inference. It is because the model should be all the time up and return predictions with low latency. It requires a clever infrastructure and monitoring setup which may be unique even for various projects inside the same company. Subsequently, if getting a prediction immediately is just not critical for the business — keep on with the batch inference and be blissful.

Nevertheless, for a lot of firms, real-time inference does make a difference by way of accuracy and revenue. That is true for search engines like google and yahoo, advice systems, and ad click predictions, so investing in real-time inference infrastructure is greater than justified.

For more details on real-time vs. batch inference, take a look at these posts:
– Deploy machine learning models in production environments by Microsoft
– Batch Inference vs Online Inference by Luigi Patruno

Cloud vs. Edge Computing

In cloud computing, data is generally transferred over the web and processed on a centralized server. Alternatively, in edge computing data is processed on the device where it was generated, with each device handling its own data in a decentralized way. Examples of edge devices are phones, laptops, and cars.

Streaming services like Netflix and YouTube are typically running their recommender systems within the cloud. Their apps and web sites send user data to data servers to get recommendations. Cloud computing is comparatively easy to establish, and you may scale computing resources almost indefinitely (or a minimum of until it’s economically sensible). Nevertheless, cloud infrastructure heavily depends upon a stable Web connection, and sensitive user data shouldn’t be transferred over the Web.

Edge computing is developed to beat cloud limitations and is capable of work where cloud computing cannot. The self-driving engine is running on the automobile, so it might still work fast with out a stable web connection. Smartphone authentication systems (like iPhone’s FaceID) run on smartphones because transferring sensitive user data over the web is just not a superb idea, and users do have to unlock their phones without a web connection. Nevertheless, for edge computing to be viable, the sting device must be sufficiently powerful, or alternatively, the model should be lightweight and fast. This gave rise to the model compression methods, equivalent to low-rank approximation, knowledge distillation, pruning, and quantization. If you must learn more about model compression, here is an ideal place to begin: Awesome ML Model Compression.

For a deeper dive into Edge and Cloud Computing, read these posts:
– What’s the Difference Between Edge Computing and Cloud Computing? by NVIDIA
– Edge Computing vs Cloud Computing: Major Differences by Mounika Narang

Easy Deployment & Demo

“Production is a spectrum. For some teams, production means generating nice plots from notebook results to indicate to the business team. For other teams, production means keeping your models up and running for hundreds of thousands of users per day.” Chip Huyen, Why data scientists shouldn’t have to know Kubernetes

Deploying models to serve hundreds of thousands of users is the duty for a big team, in order a Data Scientist / ML Engineer, you won’t be left alone.

Nevertheless, sometimes you do have to deploy alone. Perhaps you’re working on a pet or study project and would really like to create a demo. Perhaps you’re the primary Data Scientist / ML Engineer in the corporate and you’ll want to bring some business value before the corporate decides to scale the Data Science team. Perhaps all of your colleagues are so busy with their tasks, so you’re asking yourself whether it’s easier to deploy yourself and never wait for support. You should not the primary and definitely not the last who faces these challenges, and there are answers to show you how to.

To deploy a model, you would like a server (instance) where the model will likely be running, an API to speak with the model (send inputs, get predictions), and (optionally) a user interface to simply accept input from users and show them predictions.

Google Colab is Jupyter Notebook on steroids. It’s an ideal tool to create demos that you would be able to share. It doesn’t require any specific installation from users, it offers free servers with GPU to run the code, and you may easily customize it to simply accept any inputs from users (text files, images, videos). It’s very talked-about amongst students and ML researchers (here is how DeepMind researchers use it). In case you are fascinated by learning more about Google Colab, start here.

FastAPI is a framework for constructing APIs in Python. You could have heard about Flask, FastAPI is comparable, but simpler to code, more specialized towards APIs, and faster. For more details, take a look at the official documentation. For practical examples, read APIs for Model Serving by Goku Mohandas.

Streamlit is a simple tool to create web applications. It is straightforward, I actually mean it. And applications change into nice and interactive — with images, plots, input windows, buttons, sliders,… Streamlit offers Community Cloud where you may publish apps totally free. To start, consult with the official tutorial.

Cloud Platforms. Google and Amazon do an ideal job making the deployment process painless and accessible. They provide paid end-to-end solutions to coach and deploy models (storage, compute instance, API, monitoring tool, workflows,…). Solutions are easy to begin with and still have a large functionality to support specific needs, so many firms construct their production infrastructure with cloud providers.

In case you would really like to learn more, listed below are the resources to review:
– Deploy your side-projects at scale for mainly nothing by Alex Olivier
– Deploy models for inference by Amazon
– Deploy a model to an endpoint by Google

Like all software systems in production, ML systems should be monitored. It helps quickly detect and localize bugs and forestall catastrophic system failures.

Technically, monitoring means collecting logs, calculating metrics from them, displaying these metrics on dashboards like Grafana, and organising alerts for when metrics fall outside expected ranges.

What metrics ought to be monitored? Since an ML system is a subclass of a software system, start with operational metrics. Examples are CPU/GPU utilization of the machine, its memory, and disk space; variety of requests sent to the appliance and response latency, error rate; network connectivity. For a deeper dive into monitoring of the operation metrics, take a look at the post An Introduction to Metrics, Monitoring, and Alerting by Justin Ellingwood.

While operational metrics are about machine, network, and application health, ML-related metrics check model accuracy and input consistency.

Accuracy is crucial thing we care about. This implies the model might still return predictions, but those predictions could possibly be entirely off-base, and also you won’t understand it until the model is evaluated. In case you’re fortunate to work in a site where natural labels develop into available quickly (as in recommender systems), simply collect these labels as they are available, evaluate the model, and accomplish that constantly. Nevertheless, in lots of domains, labels might either take a protracted time to reach or not are available in any respect. In such cases, it’s helpful to observe something that might not directly indicate a possible drop in accuracy.

Why could model accuracy drop in any respect? Probably the most widespread reason is that production data has drifted from training/test data. Within the Computer Vision domain, you may visually see that data has drifted: images became darker, or lighter, or resolution changes, or now there are more indoor images than outdoor.

To robotically detect data drift (additionally it is called “data distribution shift”), constantly monitor model inputs and outputs. The inputs to the model ought to be consistent with those used during training; for tabular data, which means that column names in addition to the mean and variance of the features should be the identical. Monitoring the distribution of model predictions can also be helpful. In classification tasks, for instance, you may track the proportion of predictions for every class. If there’s a notable change — like if a model that previously categorized 5% of instances as Class A now categorizes 20% as such — it’s an indication that something definitely happened. To learn more about data drift, take a look at this great post by Chip Huyen: Data Distribution Shifts and Monitoring.

There may be way more left to say about monitoring, but we must move on. You may check these posts in case you feel like you would like more information:
– Monitoring Machine Learning Systems by Goku Mohandas
– A Comprehensive Guide on The best way to Monitor Your Models in Production by Stephen Oladele

In case you deploy the model to production and do nothing to it, its accuracy diminishes over time. Usually, it’s explained by data distribution shifts. The input data may change format. User behavior constantly changes with none valid reasons. Epidemics, crises, and wars may suddenly occur and break all the principles and assumptions that worked previously. “Change is the one constant.”- Heraclitus.

That’s the reason production models should be recurrently updated. There are two sorts of updates: model update and data update. In the course of the model update an algorithm or training strategy is modified. The model update doesn’t have to occur recurrently, it’s normally done ad-hoc — when a business task is modified, a bug is found, or the team has time for the research. In contrast, a knowledge update is when the identical algorithm is trained on newer data. Regular data update is a must for any ML system.

A prerequisite for normal data updates is organising an infrastructure that may support automatic dataflows, model training, evaluation, and deployment.

It’s crucial to focus on that data updates should occur with little to no manual intervention. Manual efforts ought to be primarily reserved for data annotation (while ensuring that data flow to and from annotation teams is fully automated), possibly making final deployment decisions, and addressing any bugs that will surface through the training and deployment phases.

Once the infrastructure is about up, the frequency of updates is merely a price you’ll want to adjust within the config file. How often should the model be updated with the newer data? The reply is: as regularly as feasible and economically sensible. If increasing the frequency of updates brings more value than consumes costs — definitely go for the rise. Nevertheless, in some scenarios, training every hour may not be feasible, even when it might be highly profitable. As an example, if a model depends upon human annotations, this process can develop into a bottleneck.

Training from scratch or fine-tuning on recent data only? It’s not a binary decision but relatively a mix of each. Ceaselessly fine-tuning the model is wise because it’s cheaper and quicker than training from scratch. Nevertheless, occasionally, training from scratch can also be essential. It’s crucial to grasp that fine-tuning is primarily an optimization of cost and time. Typically, firms start with the easy approach of coaching from scratch initially, regularly incorporating fine-tuning because the project expands and evolves.

To search out out more about model updates, take a look at this post:
To retrain, or to not retrain? Let’s get analytical about ML model updates by Emeli Dral et al.

Before the model is deployed to production, it should be thoroughly evaluated. We’ve already discussed the pre-production (offline) evaluation within the previous post (check section “Model Evaluation”). Nevertheless, you never understand how the model will perform in production until you deploy it. This gave rise to testing in production, which can also be known as online evaluation.

Testing in production doesn’t mean recklessly swapping out your reliable old model for a newly trained one after which anxiously awaiting the primary predictions, able to roll back on the slightest hiccup. Never do this. There are smarter and safer strategies to check your model in production without risking losing money or customers.

A/B testing is the preferred approach within the industry. With this method, traffic is randomly divided between existing and recent models in some proportion. Existing and recent models make predictions for real users, the predictions are saved and later fastidiously inspected. It is helpful to match not only model accuracies but in addition some business-related metrics, like conversion or revenue, which sometimes could also be negatively correlated with accuracy.

A/B testing highly relies on statistical hypothesis testing. If you must learn more about it, here is the post for you: A/B Testing: A Complete Guide to Statistical Testing by Francesco Casalegno. For engineering implementation of the A/B tests, take a look at Online AB test pattern.

Shadow deployment is the safest solution to test the model. The thought is to send all of the traffic to the prevailing model and return its predictions to the tip user in the standard way, and at the identical time, also send all of the traffic to a recent (shadow) model. Shadow model predictions should not used anywhere, only stored for future evaluation.

*A/B Testing vs. Shadow Deployment. Image by Writer*

Canary release. Chances are you’ll consider it as “dynamic” A/B testing. A recent model is deployed in parallel with the prevailing one. Originally only a small share of traffic is distributed to a recent model, as an example, 1%; the opposite 99% continues to be served by an existing model. If the brand new model performance is nice enough its share of traffic is regularly increased and evaluated again, and increased again and evaluated, until all traffic is served by a recent model. If at some stage, the brand new model doesn’t perform well, it’s faraway from production and all traffic is directed back to the prevailing model.

Here is the post that explains it a bit more:
Shadow Deployment Vs. Canary Release of ML Models by Bartosz Mikulski.

On this chapter, we learned about a complete recent set of challenges that arise, once the model is deployed to production. The operational and ML-related metrics of the model should be constantly monitored to quickly detect and fix bugs in the event that they arise. The model should be recurrently retrained on newer data because its accuracy diminishes over time primarily on account of the info distribution shifts. We discussed high-level decisions to make before deploying the model — real-time vs. batch inference and cloud vs. edge computing, each of them has its own benefits and limitations. We covered tools for simple deployment and demo when in infrequent cases you should do it alone. We learned that the model should be evaluated in production along with offline evaluations on the static datasets. You never understand how the model will work in production until you truly release it. This problem gave rise to “secure” and controlled production tests — A/B tests, shadow deployments, and canary releases.

This was also the ultimate chapter of the “Constructing Higher ML Systems” series. If you might have stayed with me from the start, you understand now that an ML system is way more than simply a flowery algorithm. I actually hope this series was helpful, expanded your horizons, and taught you how you can construct higher ML systems.

Thanks for reading!