The Case for Centralized AI Model Inference Serving

models proceed to extend in scope and accuracy, even tasks once dominated by traditional algorithms are step by step being replaced by Deep Learning models. Algorithmic pipelines — workflows that take an input, process it through a series of algorithms, and produce an output — increasingly depend on a number of AI-based components. These AI models often have significantly different resource requirements than their classical counterparts, akin to higher memory usage, reliance on specialized hardware accelerators, and increased computational demands.

On this post, we address a standard challenge: efficiently processing large-scale inputs through algorithmic pipelines that include deep learning models. A typical solution is to run multiple independent jobs, each answerable for processing a single input. This setup is commonly managed with job orchestration frameworks (e.g., Kubernetes). Nonetheless, when deep learning models are involved, this approach can turn out to be inefficient as loading and executing the identical model in each individual process can result in resource contention and scaling limitations. As AI models turn out to be increasingly prevalent in algorithmic pipelines, it’s crucial that we revisit the design of such solutions.

On this post we evaluate the advantages of centralized Inference serving, where a dedicated inference server handles prediction requests from multiple parallel jobs. We define a toy experiment by which we run an image-processing pipeline based on a ResNet-152 image classifier on 1,000 individual images. We compare the runtime performance and resource utilization of the next two implementations:

Decentralized inference — each job loads and runs the model independently.
Centralized inference — all jobs send inference requests to a dedicated inference server.

To maintain the experiment focused, we make several simplifying assumptions:

As an alternative of using a full-fledged job orchestrator (like Kubernetes), we implement parallel process execution using Python’s multiprocessing module.
While real-world workloads often span multiple nodes, we run every part on a single node.
Real-world workloads typically include multiple algorithmic components. We limit our experiment to a single component — a ResNet-152 classifier running on a single input image.
In a real-world use case, each job would process a novel input image. To simplify our experiment setup, each job will process the identical kitty.jpg image.
We are going to use a minimal deployment of a TorchServe inference server, relying totally on its default settings. Similar results are expected with alternative inference server solutions akin to NVIDIA Triton Inference Server or LitServe.

The code is shared for demonstrative purposes only. Please don’t interpret our selection of TorchServe — or every other component of our demonstration — as an endorsement of its use.

Toy Experiment

We conduct our experiments on an Amazon EC2 c5.2xlarge instance, with 8 vCPUs and 16 GiB of memory, running a PyTorch Deep Learning AMI (DLAMI). We activate the PyTorch environment using the next command:

source /opt/pytorch/bin/activate

Step 1: Making a TorchScript Model Checkpoint

We start by making a ResNet-152 model checkpoint. Using TorchScript, we serialize each the model definition and its weights right into a single file:

import torch
from torchvision.models import resnet152, ResNet152_Weights

model = resnet152(weights=ResNet152_Weights.DEFAULT)
model = torch.jit.script(model)
model.save("resnet-152.pt")

Step 2: Model Inference Function

Our inference function performs the next steps:

Load the ResNet-152 model.
Load an input image.
Preprocess the image to match the input format expected by the model, following the implementation defined here.
Run inference to categorise the image.
Post-process the model output to return the highest five label predictions, following the implementation defined here.

We define a continuing MAX_THREADS hyperparameter that we use to limit the variety of threads used for model inference in each process. That is to stop resource contention between the multiple jobs.

import os, time, psutil
import multiprocessing as mp
import torch
import torch.nn.functional as F
import torchvision.transforms as transforms
from PIL import Image


def predict(image_id):
    # Limit each process to 1 thread
    MAX_THREADS = 1
    os.environ["OMP_NUM_THREADS"] = str(MAX_THREADS)
    os.environ["MKL_NUM_THREADS"] = str(MAX_THREADS)
    torch.set_num_threads(MAX_THREADS)

    # load the model
    model = torch.jit.load('resnet-152.pt').eval()

    # Define image preprocessing steps
    transform = transforms.Compose([
        transforms.Resize(256),
        transforms.CenterCrop(224),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], 
                             std=[0.229, 0.224, 0.225])
    ])

    # load the image
    image = Image.open('kitten.jpg').convert("RGB")
    
    # preproc
    image = transform(image).unsqueeze(0)

    # perform inference
    with torch.no_grad():
        output = model(image)

    # postproc
    probabilities = F.softmax(output[0], dim=0)
    probs, classes = torch.topk(probabilities, 5, dim=0)
    probs = probs.tolist()
    classes = classes.tolist()

    return dict(zip(classes, probs))

Step 3: Running Parallel Inference Jobs

We define a function that spawns parallel processes, each processing a single image input. This function:

Accepts the whole variety of images to process and the utmost variety of concurrent jobs.
Dynamically launches recent processes when slots turn out to be available.
Monitors CPU and memory usage throughout execution.

def process_image(image_id):
    print(f"Processing image {image_id} (PID: {os.getpid()})")
    predict(image_id)

def spawn_jobs(total_images, max_concurrent):
    start_time = time.time()
    max_mem_utilization = 0.
    max_utilization = 0.

    processes = []
    index = 0
    while index < total_images or processes:

        while len(processes) < max_concurrent and index < total_images:
            # Start a brand new process
            p = mp.Process(goal=process_image, args=(index,))
            index += 1
            p.start()
            processes.append(p)

        # sample memory utilization
        mem_usage = psutil.virtual_memory().percent
        max_mem_utilization = max(max_mem_utilization, mem_usage)
        cpu_util = psutil.cpu_percent(interval=0.1)
        max_utilization = max(max_utilization, cpu_util)

        # Remove accomplished processes from list
        processes = [p for p in processes if p.is_alive()]

    total_time = time.time() - start_time
    print(f"nTotal Processing Time: {total_time:.2f} seconds")
    print(f"Max CPU Utilization: {max_utilization:.2f}%")
    print(f"Max Memory Utilization: {max_mem_utilization:.2f}%")

spawn_jobs(total_images=1000, max_concurrent=32)

Estimating the Maximum Variety of Processes

While the optimal variety of maximum concurrent processes is best determined empirically, we will estimate an upper sure based on the 16 GiB of system memory and the dimensions of the resnet-152.pt file, 231 MB.

The table below summarizes the runtime results for several configurations:

Decentralized Inference Results (by Writer)

Although memory becomes fully saturated at 50 concurrent processes, we observe that maximum throughput is achieved at 8 concurrent jobs — one per vCPU. This means that beyond this point, resource contention outweighs any potential gains from additional parallelism.

The Inefficiencies of Independent Model Execution

Running parallel jobs that every load and execute the model independently introduces significant inefficiencies and waste:

Each process must allocate the suitable memory resources for storing its own copy of the AI model.
AI models are compute-intensive. Executing them in lots of processes in parallel can result in resource contention and reduced throughput.
Loading the model checkpoint file and initializing the model in each process adds overhead and may further increase latency. Within the case of our toy experiment, model initialization makes up for roughly 30%(!!) of the general inference processing time.

A more efficient alternative is to centralize inference execution using a dedicated model inference server. This approach would eliminate redundant model loading and reduce overall system resource utilization.

In the following section we are going to arrange an AI model inference server and assess its impact on resource utilization and runtime performance.

Note: We could have modified our multiprocessing-based approach to share a single model across processes (e.g., using torch.multiprocessing or one other solution based on shared memory). Nonetheless, the inference server demonstration higher aligns with real-world production environments, where jobs often run in isolated containers.

TorchServe Setup

The TorchServe setup described on this section loosely follows the resnet tutorial. Please discuss with the official TorchServe documentation for more in-depth guidelines.

Installation

The PyTorch environment of our DLAMI comes preinstalled with TorchServe executables. In case you are running in a unique environment run the next installation command:

pip install torchserve torch-model-archiver

Making a Model Archive

The TorchServe Model Archiver packages the model and its associated files right into a “” file archive, the format required for deployment on TorchServe. We create a TorchServe model archive file based on our model checkpoint file and using the default image_classifier handler:

mkdir model_store
torch-model-archiver 
    --model-name resnet-152 
    --serialized-file resnet-152.pt 
    --handler image_classifier 
    --version 1.0 
    --export-path model_store

TorchServe Configuration

We create a TorchServe config.properties file to define how TorchServe should operate:

model_store=model_store
load_models=resnet-152.mar
models={
  "resnet-152": {
    "1.0": {
        "marName": "resnet-152.mar"
    }
  }
}

# Variety of employees per model
default_workers_per_model=1

# Job queue size (default is 100)
job_queue_size=100

After completing these steps, our working directory should seem like this:

├── config.properties
֫├── kitten.jpg
├── model_store
│   ├── resnet-152.mar
├── multi_job.py

Starting TorchServe

In a separate shell we start our TorchServe inference server:

source /opt/pytorch/bin/activate
torchserve 
    --start 
    --disable-token-auth 
    --enable-model-api 
    --ts-config config.properties

Inference Request Implementation

We define an alternate prediction function that calls our inference service:

import requests

def predict_client(image_id):
    with open('kitten.jpg', 'rb') as f:
        image = f.read()
    response = requests.post(
        "http://127.0.0.1:8080/predictions/resnet-152",
        data=image,
        headers={'Content-Type': 'application/octet-stream'}
    )

    if response.status_code == 200:
        return response.json()
    else:
        print(f"Error from inference server: {response.text}")

Scaling Up the Variety of Concurrent Jobs

Now that inference requests are being processed by a central server, we will scale up parallel processing. Unlike the sooner approach where each process loaded and executed its own model, we've got sufficient CPU resources to permit for a lot of more concurrent processes. Here we decide 100 processes in accordance with the default capability of the inference server:

spawn_jobs(total_images=1000, max_concurrent=100)

Results

The performance results are captured within the table below. Take note that the comparative results can vary greatly based on the main points of the AI model and the runtime environment.

By utilizing a centralized inference server, not only have we've got increased overall throughput by greater than 2X, but we've got freed significant CPU resources for other computation tasks.

Next Steps

Now that we've got effectively demonstrated the advantages of a centralized inference serving solution, we will explore several ways to reinforce and optimize the setup. Recall that our experiment was intentionally simplified to concentrate on demonstrating the utility of inference serving. In real-world deployments, additional enhancements could also be required to tailor the answer to your specific needs.

Custom Inference Handlers: While we used TorchServe’s built-in image_classifier handler, defining a custom handler provides much greater control over the main points of the inference implementation.
Advanced Inference Server Configuration: Inference server solutions will typically include many features for tuning the service behavior based on the workload requirements. In the following sections we are going to explore a number of the features supported by TorchServe.
Expanding the Pipeline: Real world models will typically include more algorithm blocks and more sophisticated AI models than we utilized in our experiment.
Multi-Node Deployment: While we ran our experiments on a single compute instance, production setups will typically include multiple nodes.
Alternative Inference Servers: While TorchServe is a preferred selection and comparatively easy to establish, there are various alternative inference server solutions which will provide additional advantages and will higher fit your needs. Importantly, it was recently announced that TorchServe would not be actively maintained. See the documentation for details.
Alternative Orchestration Frameworks: In our experiment we use Python multiprocessing. Real-world workloads will typically use more advanced orchestration solutions.
Utilizing Inference Accelerators: While we executed our model on a CPU, using an AI accelerator (e.g., an NVIDIA GPU, a Google Cloud TPU, or an AWS Inferentia) can drastically improve throughput.
Model Optimization: Optimizing your AI models can greatly increase efficiency and throughput.
Auto-Scaling for Inference Load: In some use cases inference traffic will fluctuate, requiring an inference server solution that may scale its capability accordingly.

In the following sections we explore two easy ways to reinforce our TorchServe-based inference server implementation. We leave the discussion on other enhancements to future posts.

Batch Inference with TorchServe

Many model inference service solutions support the choice of grouping inference requests into batches. This normally leads to increased throughput, especially when the model is running on a GPU.

We extend our TorchServe file to support batch inference with a batch size of as much as 8 samples. Please see the official documentation for details on batch inference with TorchServe.

model_store=model_store
load_models=resnet-152.mar
models={
  "resnet-152": {
    "1.0": {
        "marName": "resnet-152.mar",
        "batchSize": 8,
        "maxBatchDelay": 100,
        "responseTimeout": 200
    }
  }
}

# Variety of employees per model
default_workers_per_model=1

# Job queue size (default is 100)
job_queue_size=100

Results

We append the leads to the table below:

Enabling batched inference increases the throughput by an extra 26.5%.

Multi-Employee Inference with TorchServe

Many model inference service solutions will support creating multiple inference employees for every AI model. This permits fine-tuning the variety of inference employees based on expected load. Some solutions support auto-scaling of the variety of inference employees.

We extend our own TorchServe setup by increasing the default_workers_per_model setting that controls the variety of inference employees assigned to our image classification model.

Importantly, we must limit the variety of threads allocated to every employee to stop resource contention. That is controlled by the number_of_netty_threadssetting and by the OMP_NUM_THREADS and MKL_NUM_THREADS environment variables. Here we've got set the variety of threads to equal the variety of vCPUs (8) divided by the variety of employees.

model_store=model_store
load_models=resnet-152.mar
models={
  "resnet-152": {
    "1.0": {
        "marName": "resnet-152.mar"
        "batchSize": 8,
        "maxBatchDelay": 100,
        "responseTimeout": 200
    }
  }
}

# Variety of employees per model
default_workers_per_model=2 

# Job queue size (default is 100)
job_queue_size=100

# Variety of threads per employee
number_of_netty_threads=4

The modified TorchServe startup sequence appears below:

export OMP_NUM_THREADS=4
export MKL_NUM_THREADS=4
torchserve 
    --start 
    --disable-token-auth 
    --enable-model-api 
    --ts-config config.properties

Results

Within the table below we append the outcomes of running with 2, 4, and eight inference employees:

By configuring TorchServe to make use of multiple inference employees, we're capable of increase the throughput by an extra 36%. This amounts to a 3.75X improvement over the baseline experiment.

Summary

This experiment highlights the potential impact of inference server deployment on multi-job deep learning workloads. Our findings suggest that using an inference server can improve system resource utilization, enable higher concurrency, and significantly increase overall throughput. Take note that the precise advantages will greatly depend upon the main points of the workload and the runtime environment.

Designing the inference serving architecture is only one a part of optimizing AI model execution. Please see a few of our many posts covering a wide selection AI model optimization techniques.

The Case for Centralized AI Model Inference Serving

Toy Experiment

Step 1: Making a TorchScript Model Checkpoint

Step 2: Model Inference Function

Step 3: Running Parallel Inference Jobs

Estimating the Maximum Variety of Processes

The Inefficiencies of Independent Model Execution

TorchServe Setup

Installation

Making a Model Archive

TorchServe Configuration

Starting TorchServe

Inference Request Implementation

Scaling Up the Variety of Concurrent Jobs

Results

Next Steps

Batch Inference with TorchServe

Results

Multi-Employee Inference with TorchServe

Results

Summary

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

a Leaderboard for Real World Use Cases

Patch Time Series Transformer in Hugging Face

Constitutional AI with Open LLMs

Hugging Face Text Generation Inference available for AWS Inferentia2

The best way to Leverage Slash Commands to Code Effectively

The Case for Centralized AI Model Inference Serving

Toy Experiment

Step 1: Making a TorchScript Model Checkpoint

Step 2: Model Inference Function

Step 3: Running Parallel Inference Jobs

Estimating the Maximum Variety of Processes

The Inefficiencies of Independent Model Execution

TorchServe Setup

Installation

Making a Model Archive

TorchServe Configuration

Starting TorchServe

Inference Request Implementation

Scaling Up the Variety of Concurrent Jobs

Results

Next Steps

Batch Inference with TorchServe

Results

Multi-Employee Inference with TorchServe

Results

Summary

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.