Speed up your models with 🤗 Optimum Intel and OpenVINO

Last July, we announced that Intel and Hugging Face would collaborate on constructing state-of-the-art yet easy hardware acceleration tools for Transformer models.

Today, we’re very comfortable to announce that we added Intel OpenVINO to Optimum Intel. You may now easily perform inference with OpenVINO Runtime on a wide range of Intel processors (see the complete list of supported devices) using Transformers models which will be hosted either on the Hugging Face hub or locally. You too can quantize your model with the OpenVINO Neural Network Compression Framework (NNCF), and reduce its size and prediction latency in near minutes.

This primary release relies on OpenVINO 2022.2 and enables inference for a great quantity of PyTorch models using our OVModels. Post-training static quantization and quantization aware training will be applied on many encoder models (BERT, DistilBERT, etc.). More encoder models shall be supported within the upcoming OpenVINO release. Currently the quantization of Encoder Decoder models isn’t enabled, nonetheless this restriction must be lifted with our integration of the following OpenVINO release.

Allow us to show you the way to start in minutes!

Quantizing a Vision Transformer with Optimum Intel and OpenVINO

In this instance, we’ll run post-training static quantization on a Vision Transformer (ViT) model fine-tuned for image classification on the food101 dataset.

Quantization is a process that lowers memory and compute requirements by reducing the bit width of model parameters. Reducing the variety of bits signifies that the resulting model requires less memory at inference time, and that operations like matrix multiplication will be performed faster due to integer arithmetic.

First, let’s create a virtual environment and install all dependencies.

virtualenv openvino
source openvino/bin/activate
pip install pip --upgrade
pip install optimum[openvino,nncf] torchvision evaluate

Next, moving to a Python environment, we import the suitable modules and download the unique model in addition to its processor.

from transformers import AutoImageProcessor, AutoModelForImageClassification

model_id = "juliensimon/autotrain-food101-1471154050"
model = AutoModelForImageClassification.from_pretrained(model_id)
processor = AutoImageProcessor.from_pretrained(model_id)

Post-training static quantization requires a calibration step where data is fed through the network to be able to compute the quantized activation parameters. Here, we take 300 samples from the unique dataset to construct the calibration dataset.

from optimum.intel.openvino import OVQuantizer

quantizer = OVQuantizer.from_pretrained(model)
calibration_dataset = quantizer.get_calibration_dataset(
    "food101",
    num_samples=300,
    dataset_split="train",
)

As usual with image datasets, we’d like to use the identical image transformations that were used at training time. We use the preprocessing defined within the processor. We also define a knowledge collation function to feed the model batches of properly formatted tensors.

import torch
from torchvision.transforms import (
    CenterCrop,
    Compose,
    Normalize,
    Resize,
    ToTensor,
)

normalize = Normalize(mean=processor.image_mean, std=processor.image_std)
size = processor.size["height"]
_val_transforms = Compose(
    [
        Resize(size),
        CenterCrop(size),
        ToTensor(),
        normalize,
    ]
)
def val_transforms(example_batch):
    example_batch["pixel_values"] = [_val_transforms(pil_img.convert("RGB")) for pil_img in example_batch["image"]]
    return example_batch

calibration_dataset.set_transform(val_transforms)

def collate_fn(examples):
    pixel_values = torch.stack([example["pixel_values"] for example in examples])
    labels = torch.tensor([example["label"] for example in examples])
    return {"pixel_values": pixel_values, "labels": labels}

For our first attempt, we use the default configuration for quantization. You too can specify the variety of samples to make use of in the course of the calibration step, which is by default 300.

from optimum.intel.openvino import OVConfig

quantization_config = OVConfig()
quantization_config.compression["initializer"]["range"]["num_init_samples"] = 300

We’re now able to quantize the model. The OVQuantizer.quantize() method quantizes the model and exports it to the OpenVINO format. The resulting graph is represented with two files: an XML file describing the network topology and a binary file describing the weights. The resulting model can run on any goal Intel® device.

save_dir = "quantized_model"


quantizer.quantize(
    quantization_config=quantization_config,
    calibration_dataset=calibration_dataset,
    data_collator=collate_fn,
    remove_unused_columns=False,
    save_directory=save_dir,
)
processor.save_pretrained(save_dir)

A minute or two later, the model has been quantized. We are able to then easily load it with our OVModelForXxx classes, the equivalent of the Transformers AutoModelForXxx classes present in the transformers library. Likewise, we are able to create pipelines and run inference with OpenVINO Runtime.

from transformers import pipeline
from optimum.intel.openvino import OVModelForImageClassification

ov_model = OVModelForImageClassification.from_pretrained(save_dir)
ov_pipe = pipeline("image-classification", model=ov_model, image_processor=processor)
outputs = ov_pipe("http://farm2.staticflickr.com/1375/1394861946_171ea43524_z.jpg")
print(outputs)

To confirm that quantization didn’t have a negative impact on accuracy, we applied an evaluation step to check the accuracy of the unique model with its quantized counterpart. We evaluate each models on a subset of the dataset (taking only 20% of the evaluation dataset). We observed little to no loss in accuracy with each models having an accuracy of 87.6.

from datasets import load_dataset
from evaluate import evaluator


eval_dataset = load_dataset("food101", split="validation").select(range(5050))
task_evaluator = evaluator("image-classification")

ov_eval_results = task_evaluator.compute(
    model_or_pipeline=ov_pipe,
    data=eval_dataset,
    metric="accuracy",
    label_mapping=ov_pipe.model.config.label2id,
)

trfs_pipe = pipeline("image-classification", model=model, image_processor=processor)
trfs_eval_results = task_evaluator.compute(
    model_or_pipeline=trfs_pipe,
    data=eval_dataset,
    metric="accuracy",
    label_mapping=trfs_pipe.model.config.label2id,
)
print(trfs_eval_results, ov_eval_results)

Taking a look at the quantized model, we see that its memory size decreased by 3.8x from 344MB to 90MB. Running a fast benchmark on 5050 image predictions, we also notice a speedup in latency of 2.4x, from 98ms to 41ms per sample. That is not bad for a number of lines of code!

⚠️ A crucial thing to say is that the model is compiled just before the primary inference, which can inflate the latency of the primary inference. So before doing all your own benchmark, make certain to first warmup your model by doing at the least one prediction.

You will discover the resulting model hosted on the Hugging Face hub. To load it, you’ll be able to easily do as follows:

from optimum.intel.openvino import OVModelForImageClassification

ov_model = OVModelForImageClassification.from_pretrained("echarlaix/vit-food101-int8")

Now it is your turn

As you’ll be able to see, it’s pretty easy to speed up your models with 🤗 Optimum Intel and OpenVINO. Should you’d prefer to start, please visit the Optimum Intel repository, and remember to offer it a star ⭐. You will also find additional examples there. Should you’d prefer to dive deeper into OpenVINO, the Intel documentation has you covered.

Give it a try to tell us what you’re thinking that. We would love to listen to your feedback on the Hugging Face forum, and please be happy to request features or file issues on Github.

Have a good time with 🤗 Optimum Intel, and thanks for reading.

Source link

Speed up your models with 🤗 Optimum Intel and OpenVINO

Quantizing a Vision Transformer with Optimum Intel and OpenVINO

Now it is your turn

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Featured video: Coding for underwater robotics

Maximizing GPU Utilization with NVIDIA Run:ai and NVIDIA NIM

Coding the Pong Game from Scratch in Python

Develop Native Multimodal Agents with Qwen3.5 VLM Using NVIDIA GPU-Accelerated Endpoints

Stop Asking if a Model Is Interpretable

Speed up your models with 🤗 Optimum Intel and OpenVINO

Quantizing a Vision Transformer with Optimum Intel and OpenVINO

Now it is your turn

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.