Intel and Hugging Face Partner to Democratize Machine Learning Hardware Acceleration

-


Julien Simon's avatar


image

The mission of Hugging Face is to democratize good machine learning and maximize its positive impact across industries and society. Not only do we try to advance Transformer models, but we also work hard on simplifying their adoption.

Today, we’re excited to announce that Intel has officially joined our Hardware Partner Program. Due to the Optimum open-source library, Intel and Hugging Face will collaborate to construct state-of-the-art hardware acceleration to coach, fine-tune and predict with Transformers.

Transformer models are increasingly large and complicated, which might cause production challenges for latency-sensitive applications like search or chatbots. Unfortunately, latency optimization has long been a tough problem for Machine Learning (ML) practitioners. Even with deep knowledge of the underlying framework and hardware platform, it takes a whole lot of trial and error to determine which knobs and features to leverage.

Intel provides a whole foundation for accelerated AI with the Intel Xeon Scalable CPU platform and a wide selection of hardware-optimized AI software tools, frameworks, and libraries. Thus, it made perfect sense for Hugging Face and Intel to affix forces and collaborate on constructing powerful model optimization tools that permit users achieve one of the best performance, scale, and productivity on Intel platforms.

“*We’re excited to work with Hugging Face to bring the most recent innovations of Intel Xeon hardware and Intel AI software to the Transformers community, through open source integration and integrated developer experiences.*”, says Wei Li, Intel Vice President & General Manager, AI and Analytics.

In recent months, Intel and Hugging Face collaborated on scaling Transformer workloads. We published detailed tuning guides and benchmarks on inference (part 1, part 2) and achieved single-digit millisecond latency for DistilBERT on the most recent Intel Xeon Ice Lake CPUs. On the training side, we added support for Habana Gaudi accelerators, which deliver as much as 40% higher price-performance than GPUs.

The subsequent logical step was to expand on this work and share it with the ML community. Enter the Optimum Intel open source library! Let’s take a deeper take a look at it.



Get Peak Transformers Performance with Optimum Intel

Optimum is an open-source library created by Hugging Face to simplify Transformer acceleration across a growing range of coaching and inference devices. Due to built-in optimization techniques, you possibly can start accelerating your workloads in minutes, using ready-made scripts, or applying minimal changes to your existing code. Beginners can use Optimum out of the box with excellent results. Experts can keep tweaking for max performance.

Optimum Intel is an element of Optimum and builds on top of the Intel Neural Compressor (INC). INC is an open-source library that delivers unified interfaces across multiple deep learning frameworks for popular network compression technologies, akin to quantization, pruning, and knowledge distillation. This tool supports automatic accuracy-driven tuning strategies to assist users quickly construct one of the best quantized model.

With Optimum Intel, you possibly can apply state-of-the-art optimization techniques to your Transformers with minimal effort. Let’s take a look at a whole example.



Case study: Quantizing DistilBERT with Optimum Intel

In this instance, we’ll run post-training quantization on a DistilBERT model fine-tuned for classification. Quantization is a process that shrinks memory and compute requirements by reducing the bit width of model parameters. For instance, you possibly can often replace 32-bit floating-point parameters with 8-bit integers on the expense of a small drop in prediction accuracy.

We now have already fine-tuned the unique model to categorise product reviews for shoes based on their star rating (from 1 to five stars). You may view this model and its quantized version on the Hugging Face hub. You can too test the unique model on this Space.

Let’s start! All code is out there on this notebook.

As usual, step one is to put in all required libraries. It’s value mentioning that we now have to work with a CPU-only version of PyTorch for the quantization process to work appropriately.

pip -q uninstall torch -y 
pip -q install torch==1.11.0+cpu --extra-index-url https://download.pytorch.org/whl/cpu
pip -q install transformers datasets optimum[neural-compressor] evaluate --upgrade

Then, we prepare an evaluation dataset to evaluate model performance during quantization. Ranging from the dataset we used to fine-tune the unique model, we only keep a number of thousand reviews and their labels and save them to local storage.

Next, we load the unique model, its tokenizer, and the evaluation dataset from the Hugging Face hub.

from datasets import load_dataset
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_name = "juliensimon/distilbert-amazon-shoe-reviews"
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=5)
tokenizer = AutoTokenizer.from_pretrained(model_name)
eval_dataset = load_dataset("prashantgrao/amazon-shoe-reviews", split="test").select(range(300))

Next, we define an evaluation function that computes model metrics on the evaluation dataset. This enables the Optimum Intel library to match these metrics before and after quantization. For this purpose, the Hugging Face evaluate library may be very convenient!

import evaluate

def eval_func(model):
    task_evaluator = evaluate.evaluator("text-classification")
    results = task_evaluator.compute(
        model_or_pipeline=model,
        tokenizer=tokenizer,
        data=eval_dataset,
        metric=evaluate.load("accuracy"),
        label_column="labels",
        label_mapping=model.config.label2id,
    )
    return results["accuracy"]

We then arrange the quantization job using a [configuration]. Yow will discover details on this configuration on the Neural Compressor documentation. Here, we go for post-training dynamic quantization with a suitable accuracy drop of 5%. If accuracy drops greater than the allowed 5%, different a part of the model will then be quantized until it a suitable drop in accuracy or if the utmost variety of trials, here set to 10, is reached.

from neural_compressor.config import AccuracyCriterion, PostTrainingQuantConfig, TuningCriterion

tuning_criterion = TuningCriterion(max_trials=10)
accuracy_criterion = AccuracyCriterion(tolerable_loss=0.05)
# Load the quantization configuration detailing the quantization we want to apply
quantization_config = PostTrainingQuantConfig(
    approach="dynamic",
    accuracy_criterion=accuracy_criterion,
    tuning_criterion=tuning_criterion,
)

We are able to now launch the quantization job and save the resulting model and its configuration file to local storage.

from neural_compressor.config import PostTrainingQuantConfig
from optimum.intel.neural_compressor import INCQuantizer

# The directory where the quantized model shall be saved
save_dir = "./model_inc"
quantizer = INCQuantizer.from_pretrained(model=model, eval_fn=eval_func)
quantizer.quantize(quantization_config=quantization_config, save_directory=save_dir)

The log tells us that Optimum Intel has quantized 38 Linear and a couple of Embedding operators.

[INFO] |******Mixed Precision Statistics*****|
[INFO] +----------------+----------+---------+
[INFO] |    Op Type     |  Total   |   INT8  |
[INFO] +----------------+----------+---------+
[INFO] |   Embedding    |    2     |    2    |
[INFO] |     Linear     |    38    |    38   |
[INFO] +----------------+----------+---------+

Comparing the primary layer of the unique model (model.distilbert.transformer.layer[0]) and its quantized version (inc_model.distilbert.transformer.layer[0]), we see that Linear has indeed been replaced by DynamicQuantizedLinear, its quantized equivalent.

# Original model

TransformerBlock(
  (attention): MultiHeadSelfAttention(
    (dropout): Dropout(p=0.1, inplace=False)
    (q_lin): Linear(in_features=768, out_features=768, bias=True)
    (k_lin): Linear(in_features=768, out_features=768, bias=True)
    (v_lin): Linear(in_features=768, out_features=768, bias=True)
    (out_lin): Linear(in_features=768, out_features=768, bias=True)
  )
  (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
  (ffn): FFN(
    (dropout): Dropout(p=0.1, inplace=False)
    (lin1): Linear(in_features=768, out_features=3072, bias=True)
    (lin2): Linear(in_features=3072, out_features=768, bias=True)
  )
  (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
)
# Quantized model

TransformerBlock(
  (attention): MultiHeadSelfAttention(
    (dropout): Dropout(p=0.1, inplace=False)
    (q_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_channel_affine)
    (k_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_channel_affine)
    (v_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_channel_affine)
    (out_lin): DynamicQuantizedLinear(in_features=768, out_features=768, dtype=torch.qint8, qscheme=torch.per_channel_affine)
  )
  (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
  (ffn): FFN(
    (dropout): Dropout(p=0.1, inplace=False)
    (lin1): DynamicQuantizedLinear(in_features=768, out_features=3072, dtype=torch.qint8, qscheme=torch.per_channel_affine)
    (lin2): DynamicQuantizedLinear(in_features=3072, out_features=768, dtype=torch.qint8, qscheme=torch.per_channel_affine)
  )
  (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
)

Thoroughly, but how does this impact accuracy and prediction time?

Before and after each quantization step, Optimum Intel runs the evaluation function on the present model. The accuracy of the quantized model is now a bit lower ( 0.546) than the unique model (0.574). We also see that the evaluation step of the quantized model was 1.34x faster than the unique model. Not bad for a number of lines of code!

[INFO] |**********************Tune Result Statistics**********************|
[INFO] +--------------------+----------+---------------+------------------+
[INFO] |     Info Type      | Baseline | Tune 1 result | Best tune result |
[INFO] +--------------------+----------+---------------+------------------+
[INFO] |      Accuracy      | 0.5740   |    0.5460     |     0.5460       |
[INFO] | Duration (seconds) | 13.1534  |    9.7695     |     9.7695       |
[INFO] +--------------------+----------+---------------+------------------+

Yow will discover the resulting model hosted on the Hugging Face hub. To load a quantized model hosted locally or on the 🤗 hub, you possibly can do as follows :

from optimum.intel.neural_compressor import INCModelForSequenceClassification

inc_model = INCModelForSequenceClassification.from_pretrained(save_dir)



We’re only getting began

In this instance, we showed you methods to easily quantize models post-training with Optimum Intel, and that’s only the start. The library supports other forms of quantization in addition to pruning, a way that zeroes or removes model parameters which have little or no impact on the expected consequence.

We’re excited to partner with Intel to bring Hugging Face users peak efficiency on the most recent Intel Xeon CPUs and Intel AI libraries. Please give Optimum Intel a star to get updates, and stay tuned for a lot of upcoming features!

Many because of Ella Charlaix for her assistance on this post.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x