Home Artificial Intelligence Boosting PyTorch Inference on CPU: From Post-Training Quantization to Multithreading Problem Statement: Deep Learning Inference under Limited Time and Computation Constraints Approaching Deep Learning Inference on CPU Model Selection Post-Training Quantization Multithreading with ThreadPoolExecutor Summary Enjoyed This Story? References

Boosting PyTorch Inference on CPU: From Post-Training Quantization to Multithreading Problem Statement: Deep Learning Inference under Limited Time and Computation Constraints Approaching Deep Learning Inference on CPU Model Selection Post-Training Quantization Multithreading with ThreadPoolExecutor Summary Enjoyed This Story? References

2
Boosting PyTorch Inference on CPU: From Post-Training Quantization to Multithreading
Problem Statement: Deep Learning Inference under Limited Time and Computation Constraints
Approaching Deep Learning Inference on CPU
Model Selection
Post-Training Quantization
Multithreading with ThreadPoolExecutor
Summary
Enjoyed This Story?
References

For an in-depth explanation of post-training quantization and a comparison of ONNX Runtime and OpenVINO, I like to recommend this text:

This section will specifically have a look at two popular techniques of post-training quantization:

ONNX Runtime

One popular approach to speed-up inference on CPU was to convert the ultimate models to ONNX (Open Neural Network Exchange) format [2, 7, 9, 10, 14, 15].

The relevant steps to quantize and speed up inference on CPU with ONNX Runtime are shown below:

Install ONNX Runtime

pip install onnxruntime

Convert PyTorch Model to ONNX

import torch
import torchvision

# Define your model here
model = ...

# Train model here
...

# Define dummy_input
dummy_input = torch.randn(1, N_CHANNELS, IMG_WIDTH, IMG_HEIGHT, device="cuda")

# Export PyTorch model to ONNX format
torch.onnx.export(model, dummy_input, "model.onnx")

Make predictions with an ONNX Runtime session

import onnxruntime as rt

# Define X_test with shape (BATCH_SIZE, N_CHANNELS, IMG_WIDTH, IMG_HEIGHT)
X_test = ...

# Define ONNX Runtime session
sess = rt.InferenceSession("model.onnx")

# Make prediction
y_pred = sess.run([], {'input' : X_test})[0]

OpenVINO

The equally popular approach to speed-up inference on CPU was to make use of OpenVINO (Open Visual Inference and Neural network Optimization) [5, 6, 12] as shown in this Kaggle Notebook:

The relevant steps to quantize and speed up a Deep Learning model with OpenVINO are shown below:

Install OpenVINO

!pip install openvino-dev[onnx]

Convert PyTorch Model to ONNX (see Step 1 of ONNX Runtime)

Convert ONNX Model to OpenVINO

mo --input_model model.onnx

This can output an XML file and a BIN file — of which we’ll we using the XML file in the following step.

Quantize to INT8 using OpenVINO

import openvino.runtime as ov

core = ov.Core()
openvino_model = core.read_model(model='model.xml')
compiled_model = core.compile_model(openvino_model, device_name="CPU")

Make predictions with an OpenVINO inference request

# Define X_test with shape (BATCH_SIZE, N_CHANNELS, IMG_WIDTH, IMG_HEIGHT)
X_test = ...

# Create inference request
infer_request = compiled_model.create_infer_request()

# Make prediction
y_pred = infer_request.infer(inputs=[X_test, 2])

Comparison: ONNX vs. OpenVINO vs. Alternatives

Each ONNX and OpenVINO are frameworks optimized for deploying models on CPUs. The inference times of a neural network quantized with ONNX and OpenVINO are said to be comparable [12].

Some competitors used PyTorch JIT [3] or TorchScript [1] as alternatives to hurry up inference on CPU. Nevertheless, other competitors shared that ONNX was considerably faster than TorchScript [10].

One other popular approach to speed-up inference on CPU was to make use of multithreading with ThreadPoolExecutor [2, 3, 9, 15] along with post-training quantization, as shown in this Kaggle Notebook:

This enabled competitors to run multiple inferences at the identical time.

In the next example of ThreadPoolExecutor from the competition, now we have a listing of audio files to infer.

audios = ['audio_1.ogg', 
'audio_2.ogg',
# ...,
'audio_n.ogg',]

Next, you must define an inference function that takes an audio file as input and returns the predictions.

def predict(audio_path):
# Define any preprocessing of the audio file here
...

# Make predictions
...

return predictions

With the list of audios (e.g., audios) and the inference function (e.g., predict()), you now can use ThreadPoolExecutor to run multiple inferences at the identical time (in parallel) versus sequentially, which provides you with a pleasant boost in inference time.

import concurrent.futures

with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
dicts = list(executor.map(predict, audios))

There are various more lessons to be learned from reviewing the training resources Kagglers have created in the course of the course of the “BirdCLEF 2023” competition. There are also many various solutions for this sort of problem statement.

In this text, we focused on the overall approach that was popular amongst many competitors:

  • Model Selection: Select the model size in response to one of the best trade-off between performance and inference time. Also, leverage greater and smaller models in your ensemble.
  • Post-Training Quantization: Post-training quantization can result in faster inference times attributable to datatypes of the model weights and activations being optimized to the hardware. Nevertheless, this will result in a slight lack of model performance.
  • Multithreading: Run multiple inferences in parallel as an alternative of sequentially. This provides you with a lift in inference time.

In the event you are considering methods to approach audio classification with Deep Learning, which was the predominant aspect of this competition, try the write-up of the BirdCLEF 2022 competition:

2 COMMENTS

LEAVE A REPLY

Please enter your comment!
Please enter your name here