Serving ML Models with TorchServe


Image by creator

This post will walk you thru a strategy of serving your deep learning Torch model with the TorchServe framework.

There are quite a little bit of articles about this topic. Nevertheless, typically they’re focused either on deploying TorchServe itself or on writing custom handlers and getting the tip results. That was a motivation for me to jot down this post. It covers each parts and provides end-to-end example.
The image classification challenge was taken for instance. At the tip of the day you’ll have the ability to deploy TorchServe server, serve a model, send any random picture of a garments and eventually get the anticipated label of a garments class. I feel that is what people may expect from an ML model served as API endpoint for classification.

Say, your data science team designed an exquisite DL model. It’s an incredible accomplishment with no doubts. Nevertheless, to make a price out of it the model must be someway exposed to the surface world (if it’s not a Kaggle competition). This is named model serving. On this post I’ll not touch serving patterns for batch operations in addition to streaming patterns purely based on streaming frameworks. I’ll concentrate on one option of serving a model as API (never mind if this API is named by a streaming framework or by any custom service). More precisely, this feature is the TorchServe framework.
So, whenever you resolve to serve your model as API you have got no less than the next options:

  • web frameworks reminiscent of Flask, Django, FastAPI etc
  • cloud services like AWS Sagemaker endpoints
  • dedicated serving frameworks like Tensorflow Serving, Nvidia Triton and TorchServe

All have its pros and cons and the alternative is perhaps not at all times straightforward. Let’s practically explore the TorchServe option.

The primary part will briefly describe how a model was trained. It’s not essential for the TorchServe nonetheless I feel it helps to follow the end-to-end process. Then a custom handler will likely be explained.
The second part will concentrate on deployment of the TorchServe framework.
Source code for this post is situated here: git repo

For this toy example I chosen the image classification task based on FashionMNIST dataset. In case you’re not acquainted with the dataset it’s 70k of grayscale 28×28 images of various clothes. There are 10 classes of the garments. So, a DL classification model will return 10 logit values. For the sake of simplicity a model is predicated on the TinyVGG architecture (in case you ought to visualize it with CNN explainer): simply few convolution and max pooling layers with RELU activation. The notebook model_creation_notebook within the repo shows all of the strategy of training and saving the model.
Briefly the notebook just downloads the information, defines the model architecture, trains the model and saves state dict with torch save. There are two artifacts relevant to TorchServe: a category with definition of the model architecture and the saved model (.pth file).

Two modules have to be prepared: model file and custom handler.

Model file
As per documentation “A model file should contain the model architecture. This file is mandatory in case of eager mode models.
This file should contain a single class that inherits from torch.nn.Module.

So, let’s just copy the category definition from the model training notebook and reserve it as (any name you favor):

TorchServe offers some default handlers (e.g. image_classifier) but I doubt it could be used as is for real cases. So, most certainly you have to to create a custom handler on your task. The handler actually defines easy methods to preprocess data from http request, easy methods to feed it into the model, easy methods to postprocess the model’s output and what to return because the end result within the response.
There are two options — module level entry point and sophistication level entry point. See the official documentation here.
I’ll implement the category level option. It mainly implies that I want to create a custom Python class and define two mandatory functions: initialize and handle.
Initially, to make it easier let’s inherit from the BaseHandler class. The initialize function defines easy methods to load the model. Since we don’t have any specific requirements here let’s just use the definition from the super class.

The handle function mainly defines easy methods to process the information. In the best case the flow is: preprocess >> inference >> postprocess. In real applications likely you’ll must define your custom preprocess and postprocess functions. For the inference function for this instance I’ll use the default definition within the super class:

Preprocess function

Say, you built an app for image classification. The app sends the request to TorchServe with a picture as payload. It’s probably unlikely that the image at all times complies with the image format used for model training. Also you’d probably train your model on batches of samples and tensor dimensions have to be adjusted. So, let’s make a straightforward preprocess function: resize image to the required shape, make it grayscale, transform to Torch tensor and make it as one-sample batch.

Postprocess function

A multiclass classification model will return an inventory of logit or softmax probabilities. But in real scenario you’d somewhat need a predicted class or a predicted class with the probability value or possibly top N predicted labels. After all, you may do it somewhere within the essential app/other service however it means you bind the logic of your app with the ML training process. So, let’s return the anticipated class directly within the response.
(for the sake of simplicity the list of labels is hardcoded here. In github version the handler reads is from config)

Okay, the model file and the handler are ready. Now let’s deploy TorchServe server. Code above assumes that you have got already installed pytorch. One other prerequisite is JDK 11 installed (note, just JRE is just not enough, you wish JDK).
For TorchServe you could install two packages: torchserve and torch-model-archiver.
After successful installation step one is to arrange a .mar file — archive with the model artifacts. CLI interface of torch-model-archiver is aimed to do it. Type in terminal:

torch-model-archiver --model-name fashion_mnist --version 1.0 --model-file path/ --serialized-file path/fashion_mnist_model.pth --handler path/

Arguments are the next:
model name: a reputation you ought to give to the model
version: semantic version for versioning
model file: file with class definition of the model architecture
serialized file: .pth file from
handler: Python module with handler

Because of this the .mar file called as model name (in this instance fashion_mnist.mar) will likely be generated within the directory where CLI command is executed. So, higher to cd to your project directory before calling the command.

Next step finally is to begin the server. Type in terminal:

torchserve --start --model-store path --models fmnist=/path/fashion_mnist.mar 

model store: directory where the mar files are situated
models: name(s) of the model(s) and path to the corresponding mar file.

Note, that model name in archiver defines how your .mar file will likely be named. The model name in torchserve defines the API endpoint name to invoke the model. So, those names may be the identical or different, it’s as much as you.

After those two command the server shall be up and running. By default TorchServe uses three ports: 8080, 8081 and 8082 for inference, management and metrics correspondingly. Go to your browser/curl/Postman and send a request to
If TorchServe works accurately you need to see {‘status’: ‘Healthy’}

Image by creator

A few hints for possible issues:
1. If after torchserve -start command you see errors within the log with mentioning “ module named captum” then install it manually. I encountered this error with the torchserve 0.7.1

2. It could occur that some port is already busy with one other process. Then likely you will notice ‘Partially healthy’ status and a few errors in log.
To examine which process uses the port on Mac type (for instance for 8081):

sudo lsof -i :8081

One option may be to kill the method to free the port. But it surely is perhaps not at all times a very good idea if the method is someway essential.
As a substitute it’s possible to specify any latest port for TorchServe in a straightforward config file. Say, you have got some application which is already working on 8081 port. Let’s change the default port for TorchServe management API by creating torch_config file with only one line:


(you may select any free port)

Next we’d like to let TorchServe know in regards to the config. First, stop the unhealthy server by

torchserve --stop

Then restart it as

torchserve --start --model-store path --models fmnist=/path/fashion_mnist.mar --ts-config path/torch_config

At this step it’s assumed the server is up and running accurately. Let’s pass a random clothes image to the inference API and get the anticipated label.
The endpoint for inference is


In this instance it’s http://localhost:8080/predictions/fmnist
Let’s curl it and pass a picture as

curl -X POST http://localhost:8080/predictions/fmnist -T /path_to_image/image_file

for instance with the sample image from the repo:

curl -X POST http://localhost:8080/predictions/fmnist -T tshirt4.jpg

(X flag is to specify the tactic /POST/, -T flag is to transfer a file)

Within the response we must always see the anticipated label:

Image by creator

Well, by following along this blog post we were capable of create a REST API endpoint to which we will send a picture and get the anticipated label of the image. By repeating the identical procedure on a server as a substitute of local machine one can leverage it to create an endpoint for user-facing app, for other services or as an illustration endpoint for streaming ML application (see this interesting paper for a reason why you likely shouldn’t do this:

Stay tuned, in the following part I’ll expand the instance: let’s make a mock of Flask app for business logic and invoke an ML model served via TorchServe (and deploy all the pieces with Kubernetes).
A straightforward use case: user-facing app with tons of business logic and with many alternative features. Say, one feature is uploading a picture to use a desired style to it with a mode transfer ML model. The ML model may be served with TorchServe and thus the ML part will likely be completely decoupled from business logic and other features within the essential app.


What are your thoughts on this topic?
Let us know in the comments below.


0 0 votes
Article Rating
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

Would love your thoughts, please comment.x