Home Artificial Intelligence Introduction to ML Deployment: Flask, Docker & Locust Introduction What’s “deployment” anyway? Setup Project Overview What’s Flask? Create Flask App Containerise Flask App Test Flask App Summary

Introduction to ML Deployment: Flask, Docker & Locust Introduction What’s “deployment” anyway? Setup Project Overview What’s Flask? Create Flask App Containerise Flask App Test Flask App Summary

Introduction to ML Deployment: Flask, Docker & Locust
What’s “deployment” anyway?
Project Overview
What’s Flask?
Create Flask App
Containerise Flask App
Test Flask App

Photo by İsmail Enes Ayhan on Unsplash

You’ve spent a whole lot of time on EDA, rigorously crafted your features, tuned your model for days and at last have something that performs well on the test set. Now what? Now, my friend, we’d like to deploy the model. In any case, any model that stays within the notebook has a worth of zero, no matter how good it’s.

It would feel overwhelming to learn this a part of the info science workflow, especially if you happen to don’t have a whole lot of software engineering experience. Fear not, this post’s predominant purpose is to get you began by introducing one of the crucial popular frameworks for deployment in Python — Flask. As well as, you’ll learn containerise the deployment and measure its performance, two steps which are incessantly ignored.

Very first thing first, let’s make clear what I mean by deployment on this post. ML deployment is the means of taking a trained model and integrating it right into a production system (server within the diagram below), making it available to be used by end-users or other systems.

Model deployment diagram. Image by creator.

Take into accout that in point of fact, deployment process is way more complicated than simply making the model available to end-users. It also involves service integration with other systems, collection of an appropriate infrastructure, load balancing and optimisation, and robust testing of all of those components. Most of those steps are out-of-scope for this post and will ideally be handled by experienced software/ML engineers. Nevertheless, it’s essential to have some understanding around these areas which is why this post will cover containerisation, inference speed testing, and cargo handling.

All of the code could be present in this GitHub repo. I’ll show fragments from it, but make sure that to drag it and experiment with it, that’s the very best approach to learn. To run the code you’ll need — docker , flask , fastapi , and locust installed. There is perhaps some additional dependencies to put in, depending on the environment you’re running this code in.

To make the educational more practical, this post will show you an easy demo deployment of a loan default prediction model. The model training process is out of scope for this post, so already trained and serialised CatBoost model is on the market within the GitHub repo. The model was trained on the pre-processed U.S. Small Business Administration dataset (CC BY-SA 4.0 license). Be happy to explore the info dictionary to grasp what each of the columns mean.

This project focuses totally on the serving part i.e. making the model available to other systems. Hence, the model will actually be deployed in your local machine which is sweet for testing but is suboptimal for the actual world. Listed below are the predominant steps that deployments for Flask and FastAPI will follow:

  1. Create API endpoint (using Flask or FastAPI)
  2. Containerise the appliance (endpoint) using Docker
  3. Run the Docker image locally, making a server
  4. Test the server performance
Project flow diagram. Image by creator.

Sounds exciting, right? Well, let’s start then!

Flask is a well-liked and widely adopted web framework for Python on account of its lightweight nature and minimal installation requirements. It offers a simple approach to developing REST APIs which are ideal for serving machine learning models.

The everyday workflow for Flask involves defining a prediction HTTP endpoint and linking it to specific Python functions that receive data as input and generate predictions as output. This endpoint can then be accessed by users and other applications.

For those who’re curious about simply making a prediction endpoint, it’s going to be quite easy. All you must do is to deserialise the model, create the Flask application object, and specify the prediction endpoint with POST method. More details about POST and other methods yow will discover here.

An important a part of the code above is the predict function. It reads the json input which on this case is a bunch of attributes describing a loan application. It then takes this data, transforms it to the DataFrame, and passes it through the model. The resulting probability of a default is then formatted back into json and returned. When this app is deployed locally, we will get the prediction by sending a request with json-formatted data to url. Let’s try it out! To launch the server, we will simply run the Python file with the command below.

python app.py
Expected output. Screenshot by creator.

When this command is run it’s best to the message that you simply app is running on the address. For now, let’s ignore a giant red warning and test the app. To examine if the app is working as expected, we will send a test request (loan application data) to the app and see if we get a response (default probability prediction) in return.

For those who managed to get a response with probability — congrats! You’ve deployed the model using your individual computer as a server. Now, let’s kick it up a notch and package your deployment app using Docker.

Containerisation is the means of encapsulating your application and all of its dependencies (including Python) right into a self-contained, isolated package that may run consistently across different environments (e.g. locally, within the cloud, in your friend’s laptop, etc.). You’ll be able to achieve this with Docker, and all you must do is to appropriately specify the Dockerfile, construct the image after which run it. Dockerfile gives instructions to your container e.g. which version of Python to make use of, which packages to put in, and which commands to run. There’s an amazing video tutorial about Docker if you happen to’re interested to seek out out more.

Here’s how it may possibly appear like for the Flask application above.

Now, we will construct the image using docker construct command.

docker construct -t default-service:v01 .

-t gives you a choice to name your docker image and supply a tag for it, so this image’s name is deafult-service with a tag of v01 . The dot at the top refers back to the PATH argument that should be provided. It’s the situation of your model, application code, etc. Since I assume that you simply’re constructing this image within the directory with all of the code, PATH is about to . which implies current directory. It would take a while to construct this image but once it’s done, it’s best to give you the chance to see it whenever you run docker images .

Let’s run the Dockerised app using the next command:

docker run -it --rm -p 8989:8989 default-service:v01

-it flag makes the Docker image run in an interactive mode, meaning that you simply’ll give you the chance to see the code logs within the shell and to stop the image when needed using Ctrl+C. --rm ensures that the container is mechanically removed whenever you stop the image. Finally, -p makes the ports from inside Docker image available outside of it. The command above maps port 8989 from inside Docker to the localhost, making our endpoint available at the identical address.

Now that our model is successfully deployed using Flask and the deployment container is up and running (at the least locally), it’s time to guage its performance. At this point, our focus is on serving metrics reminiscent of response time and the server’s capability to handle requests per second, reasonably than ML metrics like RMSE or F1 rating.

Testing Using Script

To acquire a rough estimation of response latency, we will create a script that sends several requests to the server and measure the time taken (normally in milliseconds) for the server to return a prediction. Nonetheless, it’s essential to notice that the response time will not be constant, so we’d like to measure the median latency to estimate the time users normally wait to receive a response, and the ninety fifth latency percentile to measure the worst-case scenarios.

This code resides in measure_response.py , so we will simply run this python file to measure these latency metrics.

python measure_response.py
Latency metrics. Screenshot by creator.

The median response time turned out to be 9 ms, however the worst case scenario is greater than 10x this time. If this performance is satisfactory or not is as much as you and the product manager but at the least now you’re aware of those metrics and might work further to enhance them.

Testing Using Locust

Locust is a Python package designed to check the performance and scalability of web applications. We’re going to make use of Locust to generate a more advanced testing scenario because it allows to configure parameters just like the variety of users (i.e. loan applicants) per second.

First things first, the package could be installed by running pip install locust in your terminal. Then, we’d like to define a test scenario which can specify what our imaginary user will perform with our server. In our case it’s quite straightforward — the user will send us a request with the (json formatted) details about their loan application and can receive a response from our deployed model.

As you may see, the Locust task could be very just like a test ping that we did above. The one difference is that it must be wrapped in a category that inherits from locust.HttpUser and the performed task (send data and get response) must be decorated with @task .

To statrt load testing we simply have to run the command below.

locust -f app_test.py

When it launches, you’ll give you the chance to access the testing UI at where you’ll have to specify the appliance’s URL, variety of users and spawn rate.

Locust UI. Screenshot by creator.

Spawn rate of 5 and with 100 users implies that every second there might be 5 recent users sends requests to your app, until their number reaches 100. Which means that at its peak, our app might want to handle 100 requests per second. Now, let’s click the button and move to the charts section of the UI. Below I’m going to present results for my machine but they’ll actually be different to yours, so make sure that to run this on your individual as well.

Locust 100 users test visualisation. Screenshot by creator.

You’ll see that because the traffic builds up, your response time will get slower. There are going to be some occasional peaks as well, so it’s essential to grasp after they occur and why. Most significantly, Locust helps us understand that our local server can handle 100 requests per second with median response time of ~250ms.

We are able to keep stress testing our app and discover the load that it cannot manage. For this, let’s increase the variety of users to 1000 to see what happens.

Locust 1000 users test visualisation. Screenshot by creator.

Looks just like the breaking point of my local server is ~ 180 concurrent users. That is a very important piece of data that we were in a position to extract using Locust.

Good job for getting this far! I hope that this post has provided you with a practical and insightful introduction to model deployment. By following this project or adapting it to your specific model, it’s best to now have a radical comprehension of the essential steps involved in model deployment. Specifically, you could have gained knowledge on creating REST API endpoints in your model using Flask, containerising them with Docker, and systematically testing these endpoints using Locust.

In the subsequent post, I’ll be covering FastAPI, BentoML, cloud deployment and way more so make sure that to subscribe, clap, and leave a comment if something is unclear.



Please enter your comment!
Please enter your name here