Optimizing PyTorch Model Inference on CPU

grows, so does the criticality of optimizing their runtime performance. While the degree to which AI models will outperform human intelligence stays a heated topic of debate, their need for powerful and expensive compute resources is unquestionable — and even notorious.

In previous posts, we covered the subject of AI model optimization — primarily within the context of model training — and demonstrated how it might probably have a decisive impact on the associated fee and speed of AI model development. On this post, we focus our attention on AI model inference, where model optimization has a further objective: To attenuate the latency of inference requests and improve the user experience of the model consumer.

On this post, we’ll assume that the platform on which model inference is performed is a 4th Gen Intel® Xeon® Scalable CPU processor, more specifically, an Amazon EC2 c7i.xlarge instance (with 4 Intel Xeon vCPUs) running a dedicated Deep Learning Ubuntu (22.04) AMI and a CPU construct of PyTorch 2.8.0. After all, the alternative of a model deployment platform is one in every of the various vital decisions taken when designing an AI solution together with the alternative of model architecture, development framework, training accelerator, data format, deployment strategy, etc. — each one in every of which have to be taken with consideration of the associated costs and runtime speed. The alternative of a CPU processor for running model inference could seem surprising in an era wherein the variety of dedicated AI inference accelerators is constantly growing. Nevertheless, as we’ll see, there are some occasions when the perfect (and most cost-effective) option may thoroughly be just old-fashioned CPU.

We’ll introduce a toy image-classification model and proceed to reveal a few of the optimization opportunities for AI model inference on an Intel® Xeon® CPU. The deployment of an AI model typically features a full inference server solution, but for the sake of simplicity, we’ll limit our discussion to simply the model’s core execution. For a primer on model inference serving, please see our previous post: The Case for Centralized AI Model Inference Serving.

Our intention on this post is to reveal that: 1) a couple of easy optimization techniques may end up in meaningful performance gains and a couple of) that reaching such results doesn’t require specialized expertise in performance analyzers (comparable to Intel® VTune™ Profiler) or on the inner workings of the low-level compute kernels. Importantly, the strategy of AI model optimization can differ considerably based on the model architecture and runtime environment. Optimizing for training will differ from optimizing for inference. Optimizing a transformer model will differ from optimizing a CNN model. Optimizing a 22-billion-parameter model will differ from optimizing a 100-million parameter model. Optimizing a model to run on a GPU will differ from optimizing it for a CPU. Even different generations of the identical CPU family could have different computation components and, consequently, different optimization techniques. While the high-level steps for optimizing a given model on a given instance are pretty standard, the particular course it should take and the tip result can vary greatly based on the project at hand.

The code snippets we’ll share are intended for demonstrative purposes. Please don’t depend on their accuracy or their optimality. Please don’t interpret our mention of any tool or technique as an endorsement for its use. Ultimately, the perfect design selections on your use case will greatly rely upon the small print of your project and, given the extent of the potential impact on performance, ought to be evaluated with the suitable time and a spotlight.

Why CPU?

With the ever-increasing variety of hardware solutions for executing AI/ML model inference, our alternative of a CPU could seem surprising. On this section, we describe some scenarios wherein CPU will be the preferred platform for inference.

Accessibility: Using dedicated AI accelerators — comparable to GPUs — typically requires dedicated deployment and maintenance or, alternatively, access to such instances on a cloud service platform. CPUs, alternatively, are in all places. Designing an answer to run on a CPU provides much greater flexibility and increases the opportunities for deployment.
Availability: Even in case your algorithm can access an AI accelerator, there may be the query of availability. AI accelerators are in extremely high demand, and even when/if you end up capable of acquire one, whether it’s on-prem or within the cloud, you might decide to prioritize them for tasks which are much more resource intensive, comparable to AI model training.
Reduced Latency: There are lots of situations wherein your AI model is only one component in a pipeline of software algorithms running on a typical CPU. While the AI model may perform significantly faster on an AI accelerator, when bearing in mind the time required to send an inference request over the network, it is sort of possible that running it on the identical CPU shall be faster.
Underuse of Accelerator: AI accelerators are typically quite expensive. To justify their cost, your goal ought to be to maintain them fully occupied, minimizing their idle time. In some cases, the inference load is not going to justify the associated fee of an expensive AI accelerator.
Model Architecture: Lately, we are inclined to routinely assume that AI models will perform significantly higher on AI accelerators than on CPUs. And while most of the time, that is indeed the case, your model may include layers that perform higher on CPU. For instance, sequential algorithms comparable to Non-Maximum Suppression (NMS) and the Hungarian matching algorithm are inclined to perform higher on CPU than GPU and are sometimes offloaded onto the CPU even when a GPU is on the market (e.g., see here). In case your model accommodates many such layers, running it on a CPU may not be such a nasty option.

Why Intel Xeon?

Intel® Xeon® Scalable CPU processors include built-in accelerators for the matrix and convolution operators which are common in typical AI/ML workloads. These include AVX-512 (introduced in Gen1), the VNNI extension (Gen2), and AMX (Gen4). The AMX engine, particularly, includes specialized hardware instructions for executing AI models using bfloat16 and int8 precision data types. The acceleration engines are tightly integrated with Intel’s optimized software stack, which incorporates oneDNN, OpenVINO, and the Intel Extension for PyTorch (IPEX). These libraries utilize the dedicated Intel® Xeon® hardware capabilities to optimize model execution with minimal code changes.

Despite the arguments made on this section, the alternative of inference vehicle ought to be made after considering all options available and after assessing the opportunities for optimization on each. In the subsequent sections, we’ll introduce a toy experiment and explore a few of the optimization opportunities on CPU.

Inference Experiment

On this section, we define a toy AI model inference experiment comprising a Resnet50 image classification model, a randomly generated input batch, and an easy benchmarking utility which we use to report the typical variety of input samples processed per second (SPS).

import torch, torchvision
import time


def get_model():
    model = torchvision.models.resnet50()
    model = model.eval()
    return model


def get_input(batch_size):
    batch = torch.randn(batch_size, 3, 224, 224)
    return batch


def get_inference_fn(model):
    def infer_fn(batch):
        with torch.inference_mode():
            output = model(batch)
        return output
    return infer_fn


def benchmark(infer_fn, batch):
    # warm-up
    for _ in range(10):
        _ = infer_fn(batch)

    iters = 100

    start = time.time()
    for _ in range(iters):
        _ = infer_fn(batch)
    end = time.time()

    return (end - start) / iters


batch_size = 1
model = get_model()
batch = get_input(batch_size)
infer_fn = get_inference_fn(model)
avg_time = benchmark(infer_fn, batch)
print(f"nAverage samples per second: {(batch_size/avg_time):.2f}")

The baseline performance of our toy model is 22.76 samples per second (SPS).

Model Inference Optimization

On this section, we apply quite a few optimizations to our toy experiment and assess their impact on runtime performance. Our focus shall be on optimization techniques that may be applied with relative ease. While it is sort of likely that additional performance gains may be achieved, these may require much greater specialization and a more significant time investment.

Our focus shall be on optimizations that don’t change the model architecture; optimization techniques comparable to model distillation and model pruning are out of the context of this post. Also out of scope are methods for optimizing specific model components, e.g., by implementing custom PyTorch operators.

In a previous post we discussed AI model optimization on Intel XEON CPUs within the context of coaching workloads. On this section we’ll revisit a few of the techniques mentioned there, this time within the context of AI model inference. We’ll complement these with optimization techniques which are unique to inference settings, including model compilation for inference, INT8 quantization, and multi-worker inference.

The order wherein we present the optimization methods isn’t binding. The truth is, a few of the techniques are interdependent; for instance, increasing the variety of inference employees could impact the optimal alternative of batch size.

Optimization 1: Batched Inference

A typical method for increasing resource utilization while reducing the typical inference response time is to group input samples into batches. In real-world scenarios, we want to be sure to cap the batch size in order that we meet the service level response time requirements, but for the needs of our experiment we ignore this requirement. Experimenting with different batch sizes we discover that a batch size of 8 ends in a throughput of 26.28 SPS, 15% higher than the baseline result.

Note that within the case that the shapes of the input samples vary, batching requires more handling (e.g., see here).

Optimization 2: Channels-Last Memory Format

By default in PyTorch, 4D tensors are stored in NCHW format, i.e., the 4 dimensions represent the batch size, channels, height, and width, respectively. Nevertheless, the channels-last or NHWC format (i.e., batch size, height, width, and channels) exhibits higher performance on CPU. Adjusting our inference script to use the channels-last optimization is an easy matter of setting the memory format of each the model and the input to as shown below:

def get_model(channels_last=False):
    model = torchvision.models.resnet50()
    if channels_last:
        model= model.to(memory_format=torch.channels_last)
    model = model.eval()
    return model

def get_input(batch_size, channels_last=False):
    batch = torch.randn(batch_size, 3, 224, 224)
    if channels_last:
        batch = batch.to(memory_format=torch.channels_last)
    return batch


batch_size = 8
model = get_model(channels_last=True)
batch = get_input(batch_size, channels_last=True)
infer_fn = get_inference_fn(model)
avg_time = benchmark(infer_fn, batch)
print(f"nAverage samples per second: {(batch_size/avg_time):.2f}")

Applying the memory optimization, ends in an extra boost of 25% in throughput.

The impact of this optimization is most noticeable on models which have many convolutional layers. It isn’t expected to make a noticeable impact on other model architectures (e.g., transformer models).

Please see the PyTorch documentation for more details on the memory format optimization and the Intel documentation for details on how that is implemented internally in oneDNN.

Optimization 3: Automatic Mixed Precision

Modern Intel® Xeon® Scalable processors (from Gen3) include native support for the bfloat16 data type, a 16-bit floating point alternative to the usual float32. We are able to make the most of this by applying PyTorch’s automatic mixed precision package, torch.amp, as demonstrated below:

def get_inference_fn(model, enable_amp=False):
    def infer_fn(batch):
        with torch.inference_mode(), torch.amp.autocast(
                'cpu',
                dtype=torch.bfloat16,
                enabled=enable_amp
        ):
            output = model(batch)
        return output
    return infer_fn

batch_size = 8
model = get_model(channels_last=True)
batch = get_input(batch_size, channels_last=True)
infer_fn = get_inference_fn(model, enable_amp=True)
avg_time = benchmark(infer_fn, batch)
print(f"nAverage samples per second: {(batch_size/avg_time):.2f}")

The results of applying mixed precision is a throughput of 86.95 samples per second, 2.6 times the previous experiment and three.8 times the baseline result.

Note that using a reduced precision floating point type can have an effect on numerical accuracy, and its effect on model quality performance have to be evaluated.

Optimization 4: Memory Allocation Optimization

Typical AI/ML workloads require the allocation and access of huge blocks of memory. Quite a few optimization techniques are aimed toward tuning the way in which memory is allocated and used during model execution. One common step is to switch the default system allocator (ptmalloc) with another memory allocation libraries, comparable to Jemalloc and TCMalloc, which have been shown to perform higher on common AI/ML workloads (e.g., see here). To put in TCMalloc run:

sudo apt-get install google-perftools

We program its use via the environment variable:

LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4 python predominant.py

This optimization ends in one other significant performance boost: 117.54 SPS, 35% higher than our previous experiment!!

Optimization 5: Enable Huge Page Allocations

By default, the Linux kernel allocates memory in blocks of 4 KB, commonly known as pages. The mapping between the virtual and physical memory addresses is managed by the CPU’s Memory Management Unit (MMU), which uses a small hardware cache called the Translation Lookaside Buffer (TLB). The TLB is restricted within the number entries it might probably hold. When you’ve gotten small pages (as in large neural network models), the variety of TLB cache misses can climb quickly, increasing latency and slowing down the speed of this system. A typical solution to address that is to make use of “huge pages” — blocks of two MB (or 1 GB) per page. This reduces the variety of TLB entries required, improving memory access efficiency and lowering allocation latency.

export THP_MEM_ALLOC_ENABLE=1

Within the case of our model, the impact is negligible. Nevertheless, that is a vital optimization for a lot of AI/ML workloads.

Optimization 6: IPEX

Intel® Extension for PyTorch (IPEX) is a library extension for PyTorch with the most recent performance optimizations for Intel hardware. To put in it we run:

pip install intel_extension_for_pytorch

Within the code block below, we reveal the essential use of the ipex.optimize API.

import intel_extension_for_pytorch as ipex

def get_model(channels_last=False, ipex_optimize=False):
    model = torchvision.models.resnet50()

    if channels_last:
        model= model.to(memory_format=torch.channels_last)

    model = model.eval()

    if ipex_optimize:
        model = ipex.optimize(model, dtype=torch.bfloat16)

    return model

The resultant throughout is 159.31 SPS, for an additional 36% performance boost.

Please see the official documentation for more details on the various optimizations that IPEX has to supply.

Optimization 7: Model Compilation

One other popular PyTorch optimization is torch.compile. Introduced in PyTorch 2.0, this just-in-time (JIT) compilation feature, performs kernel fusion and other optimizations. In a previous post we covered PyTorch compilation in great detail, covering some its many features, controls, and limitations. Here we reveal its basic use:

def get_model(channels_last=False, ipex_optimize=False, compile=False):
    model = torchvision.models.resnet50()

    if channels_last:
        model= model.to(memory_format=torch.channels_last)

    model = model.eval()

    if ipex_optimize:
        model = ipex.optimize(model, dtype=torch.bfloat16)

    if compile:
        model = torch.compile(model)

    return model

Applying torch.compile on the IPEX-optimized model ends in a throughput of 144.5 SPS, which is lower than our previous experiment. Within the case of our model, IPEX and torch.compile don’t coexist well. When applying just the torch.compile the throughput is 133.36 SPS.

The final takeaway from this experiment is that, for a given model, any two optimization techniques could interfere with each other. This necessitates evaluating the impact of multiple configurations on the runtime performance of a given model to be able to find the perfect one.

Optimization 8: Auto-tune Environment Setup With `torch.xeon.run_cpu`

There are quite a few environment settings that control thread and memory management and may be used to further fine-tune the runtime performance of an AI/ML workload. Moderately than setting these manually, PyTorch offers the torch.xeon.run_cpu script that does this routinely. In preparation for using this script, we install Intel’s threading and multiprocessing libraries, one TBB and Intel OpenMP. We also add a symbolic link to our TCMalloc installation.

# install TBB
sudo apt install -y libtbb12
# install openMP
pip install intel-openmp
# link to tcmalloc
sudo ln -sf /usr/lib/x86_64-linux-gnu/libtcmalloc.so.4 /usr/lib/libtcmalloc.so

Within the case of our toy model, using torch.xeon.run_cpu increases the throughput to 162.15 SPS — a slight increase over our previous maximum of 159.31 SPS.

Please see the PyTorch documentation for more features of the torch.xeon.run_cpu and more details on the environment variables it applies.

Optimization 9: Multi-worker Inference

One other popular technique for increasing resource utilization and scale is to load multiple instances of the AI model and run them in parallel in separate processes. Although this system is more commonly applied on machines with many CPUs (separated into multiple NUMA nodes) — not on our small 4-vCPU instance — we include it here for the sake of demonstration. Within the script below we run 2 instances of our model in parallel:

python -m torch.backends.xeon.run_cpu --ninstances 2 predominant.py

This ends in a throughput of 169.4 SPS — additional modest but meaningful 4% increase.

Optimization 10: INT8 Quantization

INT8 quantization is one other common technique for accelerating AI model inference execution. In INT8 quantization, the floating point datatypes of the model weights and activations are replaced by 8-bit integers. Intel’s Xeon processors include dedicated accelerators for processing INT8 operations (e.g., see here). INT8 quantization may end up in a meaningful increase in speed and a lower memory footprint. Importantly, the reduced bit-precision can have a major impact on the standard of the model output. There are lots of different approaches to INT8 quantization a few of which include calibration or retraining. There are also a wide range of tools and libraries for applying quantization. A full discussion on the subject of quantization is beyond the scope of this post.

Since on this post we have an interest just within the potential performance impact, we reveal one quantization scheme using TorchAO, without consideration of the impact on model quality. Within the code block below, we implement PyTorch 2 Export Quantization with X86 Backend through Inductor. INT8 quantization is one other common technique for accelerating AI model inference execution. In INT8 quantization, the floating point datatypes of the model weights and activations are replaced by 8-bit integers. Intel’s Xeon processors include dedicated accelerators for processing INT8 operations (e.g., see here). INT8 quantization may end up in a meaningful increase in speed and a lower memory footprint.

Importantly, the reduced bit-precision can have a major impact on the standard of the model output. There are lots of different approaches to INT8 quantization a few of which include calibration or retraining. There are also a wide range of tools and libraries for applying quantization. A full discussion on the subject of quantization is beyond the scope of this post. Since on this post we have an interest just within the potential performance impact, we reveal one quantization scheme using TorchAO, without consideration of the impact on model quality. Within the code block below, we implement PyTorch 2 Export Quantization with X86 Backend through Inductor. Please see the documentation for the complete details:

from torchao.quantization.pt2e.quantize_pt2e import prepare_pt2e, convert_pt2e
import torchao.quantization.pt2e.quantizer.x86_inductor_quantizer as xiq

def quantize_model(model):
    x = torch.randn(4, 3, 224, 224).contiguous(
                            memory_format=torch.channels_last)
    example_inputs = (x,)
    batch_dim = torch.export.Dim("batch")
    with torch.no_grad():
        exported_model = torch.export.export(
            model,
            example_inputs,
            dynamic_shapes=((batch_dim,
                             torch.export.Dim.STATIC,
                             torch.export.Dim.STATIC,
                             torch.export.Dim.STATIC),
                            )
        ).module()
    quantizer = xiq.X86InductorQuantizer()
    quantizer.set_global(xiq.get_default_x86_inductor_quantization_config())
    prepared_model = prepare_pt2e(exported_model, quantizer)
    prepared_model(*example_inputs)
    converted_model = convert_pt2e(prepared_model)
    optimized_model = torch.compile(converted_model)
    return optimized_model


batch_size = 8
model = get_model(channels_last=True)
model = quantize_model(model)
batch = get_input(batch_size, channels_last=True)
infer_fn = get_inference_fn(model, enable_amp=True)
avg_time = benchmark(infer_fn, batch)
print(f"nAverage samples per second: {(batch_size/avg_time):.2f}")

This ends in a throughput of 172.67 SPS.

Please see here for more details on quantization in PyTorch.

Optimization 11: Graph Compilation and Execution With ONNX

There are quite a few third party libraries that concentrate on compiling PyTorch models into graph representations and optimizing them for runtime performance heading in the right direction inference devices. One of the popular libraries for that is Open Neural Network Exchange (ONNX). ONNX performs ahead-of-time compilation of AI/ML models and executes them using a dedicated runtime library.

While ONNX compilation support is included in PyTorch, we require the next library for executing an ONNX model:

pip install onnxruntime

Within the code block below, we reveal ONNX compilation and model execution:

def export_to_onnx(model, onnx_path="resnet50.onnx"):
    dummy_input = torch.randn(4, 3, 224, 224)
    batch = torch.export.Dim("batch")
    torch.onnx.export(
        model,
        dummy_input,
        onnx_path,
        input_names=["input"],
        output_names=["output"],
        dynamic_shapes=((batch,
                         torch.export.Dim.STATIC,
                         torch.export.Dim.STATIC,
                         torch.export.Dim.STATIC),
                        ),
        dynamo=True
    )
    return onnx_path

def onnx_infer_fn(onnx_path):
    import onnxruntime as ort

    sess = ort.InferenceSession(
        onnx_path,
        providers=["CPUExecutionProvider"]
    )
    input_name = sess.get_inputs()[0].name

    def infer_fn(batch):
        result = sess.run(None, {input_name: batch})
        return result
    return infer_fn

batch_size = 8
model = get_model()
onnx_path = export_to_onnx(model)
batch = get_input(batch_size).numpy()
infer_fn = onnx_infer_fn(onnx_path)
avg_time = benchmark(infer_fn, batch)
print(f"nAverage samples per second: {(batch_size/avg_time):.2f}")

The resultant throughput is 44.92 SPS, far lower than in our previous experiments. Within the case of our toy model, the ONNX runtime doesn’t provide a profit.

Optimization 12: Graph Compilation and Execution with OpenVINO

One other opensource toolkit aimed toward deploying highly performant AI solutions is OpenVINO. OpenVINO is highly optimized for model execution on Intel hardware — e.g., by fully leveraging the Intel AMX instructions. A typical solution to apply OpenVINO in PyTorch is to first convert the model to ONNX:

from openvino import Core

def compile_openvino_model(onnx_path):
    core = Core()
    model = core.read_model(onnx_path)
    compiled = core.compile_model(model, "CPU")
    return compiled

def openvino_infer_fn(compiled_model):
    def infer_fn(batch):
        result = compiled_model([batch])[0]
        return result
    return infer_fn

batch_size = 8
model = get_model()
onnx_path = export_to_onnx(model)
ovm = compile_openvino_model(onnx_path)
batch = get_input(batch_size).numpy()
infer_fn = openvino_infer_fn(ovm)
avg_time = benchmark(infer_fn, batch)
print(f"nAverage samples per second: {(batch_size/avg_time):.2f}")

The results of this optimization is a throughput of 297.33 SPS, nearly twice as fast as our previous best experiment!!

Please see the official documentation for more details on OpenVINO.

Optimization 13: INT8 Quantization in OpenVINO with NNCF

As our final optimization, we revisit INT8 quantization, this time within the framework of OpenVINO compilation. As before, there are quite a few methods for performing quantization — aimed toward minimizing the impact on quality performance. Here we reveal the essential flow using the NNCF library as documented here.

class RandomDataset(torch.utils.data.Dataset):

    def __len__(self):
        return 10000

    def __getitem__(self, idx):
        return torch.randn(3, 224, 224)

def nncf_quantize(onnx_path):
    import nncf

    core = Core()
    onnx_model = core.read_model(onnx_path)
    calibration_loader = torch.utils.data.DataLoader(RandomDataset())
    input_name = onnx_model.inputs[0].get_any_name()
    transform_fn = lambda data_item: {input_name: data_item.numpy()}
    calibration_dataset = nncf.Dataset(calibration_loader, transform_fn)
    quantized_model = nncf.quantize(onnx_model, calibration_dataset)
    return core.compile_model(quantized_model, "CPU")

batch_size = 8
model = get_model()
onnx_path = export_to_onnx(model)
q_model = nncf_quantize(onnx_path)
batch = get_input(batch_size).numpy()
infer_fn = openvino_infer_fn(q_model)
avg_time = benchmark(infer_fn, batch)
print(f"nAverage samples per second: {(batch_size/avg_time):.2f}")

This ends in a throughput of 482.46(!!) SPS, one other drastic improvement and over 18 times faster than our baseline experiment.

Results

We summarize the outcomes of our experiments within the table below:

ResNet50 Inference Experiment (by Writer)

Within the case of our toy model, the optimizations steps we demonstrated resulted in huge performance gains. Importantly, the impact of every optimization can vary greatly based on the small print of the model. You could find that a few of these techniques don’t apply to your model, or don’t end in improved performance. For instance, after we reapply the identical sequence of optimizations to a Vision Transformer (ViT) model, the resultant performance boost is 8.41X — still significant, but lower than the 18.36X of our experiment. Please see the appendix to this post for details.

Our focus has been on runtime performance, but it surely is critical that you furthermore may evaluate the impact of every optimization on other metrics which are vital to you — most significantly model quality.
There are, undoubtedly, many more optimization techniques that may be applied; we’ve got merely scratched the surface. Hopefully, the

Summary

This post continues our series on the vital topic of AI/ML model runtime performance evaluation and optimization. Our focus on this post was on model inference on Intel® Xeon® CPU processors. Given the ubiquity and prevalence of CPUs, the flexibility to execute models on them in a reliable and performant manner, may be extremely compelling. As we’ve got shown, by applying quite a few relatively easy techniques, we will achieve considerable gains in model performance with profound implications on inference costs and inference latency.

Please don’t hesitate to succeed in out with comments, questions, or corrections.

Appendix: Vision Transformer Optimization

To reveal how the impact of the runtime optimizations we discussed rely upon the small print of the AI/ML model, we reran our experiment on a Vision Transformer (ViT) model from the favored timm library:

from timm.models.vision_transformer import VisionTransformer

def get_model(channels_last=False, ipex_optimize=False, compile=False):
    model = VisionTransformer()

    if channels_last:
        model= model.to(memory_format=torch.channels_last)

    model = model.eval()

    if ipex_optimize:
        model = ipex.optimize(model, dtype=torch.bfloat16)

    if compile:
        model = torch.compile(model)

    return model

One modification on this experiment was to use OpenVINO compilation on to the PyTorch model moderately than an intermediate ONNX model. This was attributable to the indisputable fact that OpenVINO compilation failed on the ViT ONNX model. The revised NNCF quantization and OpenVINO compilation sequence is shown below:

import openvino as ov
import nncf


batch_size = 8
model = get_model()
calibration_loader = torch.utils.data.DataLoader(RandomDataset())
calibration_dataset = nncf.Dataset(calibration_loader)

# quantize PyTorch model
model = nncf.quantize(model, calibration_dataset)
ovm = ov.convert_model(model, example_input=torch.randn(1, 3, 224, 224))
ovm = ov.compile_model(ovm)
batch = get_input(batch_size).numpy()
infer_fn = openvino_infer_fn(ovm)
avg_time = benchmark(infer_fn, batch)
print(f"nAverage samples per second: {(batch_size/avg_time):.2f}")

The table below summarizes the outcomes of the optimizations discussed on this post when applied to the ViT model:

Optimizing PyTorch Model Inference on CPU

Why CPU?

Why Intel Xeon?

Inference Experiment

Model Inference Optimization

Optimization 1: Batched Inference

Optimization 2: Channels-Last Memory Format

Optimization 3: Automatic Mixed Precision

Optimization 4: Memory Allocation Optimization

Optimization 5: Enable Huge Page Allocations

Optimization 6: IPEX

Optimization 7: Model Compilation

Optimization 8: Auto-tune Environment Setup With `torch.xeon.run_cpu`

Optimization 9: Multi-worker Inference

Optimization 10: INT8 Quantization

Optimization 11: Graph Compilation and Execution With ONNX

Optimization 12: Graph Compilation and Execution with OpenVINO

Optimization 13: INT8 Quantization in OpenVINO with NNCF

Results

Summary

Appendix: Vision Transformer Optimization

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

How we leveraged distilabel to create an Argilla 2.0 Chatbot

an enormous dataset for Document Visual Query Answering

Deploy Once, Serve 30 Models

Running Mistral 7B with Core ML

Llama 3.1 – 405B, 70B & 8B with multilinguality and long context

Optimizing PyTorch Model Inference on CPU

Why CPU?

Why Intel Xeon?

Inference Experiment

Model Inference Optimization

Optimization 1: Batched Inference

Optimization 2: Channels-Last Memory Format

Optimization 3: Automatic Mixed Precision

Optimization 4: Memory Allocation Optimization

Optimization 5: Enable Huge Page Allocations

Optimization 6: IPEX

Optimization 7: Model Compilation

Optimization 8: Auto-tune Environment Setup With torch.xeon.run_cpu

Optimization 9: Multi-worker Inference

Optimization 10: INT8 Quantization

Optimization 11: Graph Compilation and Execution With ONNX

Optimization 12: Graph Compilation and Execution with OpenVINO

Optimization 13: INT8 Quantization in OpenVINO with NNCF

Results

Summary

Appendix: Vision Transformer Optimization

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

Optimization 8: Auto-tune Environment Setup With `torch.xeon.run_cpu`

What are your thoughts on this topic?
Let us know in the comments below.