Blazing Fast SetFit Inference with 🤗 Optimum Intel on Xeon

-



SetFit is a promising solution for a standard modeling problem: easy methods to take care of lack of labeled data for training. Developed with Hugging Face’s research partners at Intel Labs and the UKP Lab, SetFit is an efficient framework for few-shot fine-tuning of Sentence Transformers models.

SetFit achieves high accuracy with little labeled data – for instance, SetFit outperforms GPT-3.5 in 3-shot prompting and with 5 shot it also outperforms 3-shot GPT-4 on the Banking 77 financial intent dataset.

In comparison with LLM based methods, SetFit has two unique benefits:

🗣 No prompts or verbalisers: few-shot in-context learning with LLMs requires handcrafted prompts which make the outcomes brittle, sensitive to phrasing and depending on user expertise. SetFit dispenses with prompts altogether by generating wealthy embeddings directly from a small variety of labeled text examples.

🏎 Fast to coach: SetFit doesn’t depend on LLMs equivalent to GPT-3.5 or Llama2 to attain high accuracy. In consequence, it is often an order of magnitude (or more) faster to coach and run inference with.

For more details on SetFit, take a look at our paper, blog, code, and data.

Setfit has been widely adopted by the AI developer community, with ~100k downloads per thirty days and ~1500 SetFit models on the Hub, and growing with a mean of ~4 models per day!



Faster!

On this blog post, we’ll explain how you’ll be able to speed up inference with SetFit by 7.8x on Intel CPUs, by optimizing your SetFit model with 🤗 Optimum Intel. We’ll show how you’ll be able to achieve huge throughput gains by performing a straightforward post-training quantization step in your model. This could enable production-grade deployment of SetFit solutions using Intel Xeon CPUs.

Optimum Intel is an open-source library that accelerates end-to-end pipelines built with Hugging Face libraries on Intel Hardware. Optimum Intel includes several techniques to speed up models equivalent to low-bit quantization, model weight pruning, distillation, and an accelerated runtime.

The runtime and optimizations included in Optimum Intel benefit from Intel® Advanced Vector Extensions 512 (Intel® AVX-512), Vector Neural Network Instructions (VNNI) and Intel® Advanced Matrix Extensions (Intel® AMX) on Intel CPUs to speed up models. Specifically, it has built-in BFloat16 (bf16) and int8 GEMM accelerators in every core to speed up deep learning training and inference workloads. AMX accelerated inference is introduced in PyTorch 2.0 and Intel Extension for PyTorch (IPEX) along with other optimizations for various common operators.

Optimizing pre-trained models will be done easily with Optimum Intel; many easy examples will be found here.
Our blog is accompanied by a notebook for a step-by-step walkthrough.



Step 1: Quantize the SetFit Model using 🤗 Optimum Intel

With a view to optimize our SetFit model, we’ll apply quantization to the model body, using Intel Neural Compressor (INC), a part of Optimum Intel.

Quantization is a highly regarded deep learning model optimization technique for improving inference speeds. It minimizes the variety of bits required to represent the weights and/or activations in a neural network. This is finished by converting a set of high-precision numbers right into a lower-bit data representations, equivalent to INT8. Furthermore, quantization can enable faster computations in lower precision.

Specifically, we’ll apply post-training static quantization (PTQ). PTQ can reduce the memory footprint and latency for inference, while still preserving the accuracy of the model, with only a small unlabeled calibration set and with none training.
Before you start, ensure you have got all of the crucial libraries installed and that your version of Optimum Intel is not less than 1.14.0 because the functionality was introduced in that version:

pip install --upgrade-strategy eager optimum[ipex]



Prepare a Calibration Dataset

The calibration dataset should give you the chance to represent the distribution of unseen data. Basically, preparing 100 samples is enough for calibration. We’ll use the rotten_tomatoes dataset in our case, because it’s composed of movie reviews, much like our goal dataset, sst2.

First, we’ll load 100 random samples from this dataset. Then, to organize the dataset for quantization, we’ll must tokenize each example. We won’t need the “text” and “label” columns, so let’s remove them.

calibration_set = load_dataset("rotten_tomatoes", split="train").shuffle(seed=42).select(range(100)) 

def tokenize(examples):
    return tokenizer(examples["text"], padding="max_length", max_length=512, truncation=True)
 
tokenizer = setfit_model.model_body.tokenizer
calibration_set = calibration_set.map(tokenize, remove_columns=["text", "label"])



Run Quantization

Before we run quantization, we’d like to define the specified quantization process – in our case – Static Post Training Quantization, and use optimum.intel to run the quantization on our calibration dataset:

from optimum.intel import INCQuantizer
from neural_compressor.config import PostTrainingQuantConfig

setfit_body = setfit_model.model_body[0].auto_model
quantizer = INCQuantizer.from_pretrained(setfit_body)
optimum_model_path = "/tmp/bge-small-en-v1.5_setfit-sst2-english_opt"
quantization_config = PostTrainingQuantConfig(approach="static", backend="ipex", domain="nlp")

quantizer.quantize(
    quantization_config=quantization_config,
    calibration_dataset=calibration_set,
    save_directory=optimum_model_path,
    batch_size=1,
)
tokenizer.save_pretrained(optimum_model_path)

That’s it! We now have a neighborhood copy of our quantized SetFit model. Let’s check it out.



Step 2: Benchmark Inference

In our notebook, we’ve arrange a PerformanceBenchmark class to compute model latency and throughput, in addition to an accuracy measure. Let’s use it to benchmark our Optimum Intel model with two other commonly used methods:

  • Using PyTorch and 🤗 Transformers library with fp32.
  • Using Intel Extension for PyTorch (IPEX) runtime with bf16 and tracing the model using TorchScript.

Load our test dataset, sst2, and run the benchmark using PyTorch and 🤗 Transformers library:

from datasets import load_dataset
from setfit import SetFitModel
test_dataset = load_dataset("SetFit/sst2")["validation"]

model_path = "dkorat/bge-small-en-v1.5_setfit-sst2-english"
setfit_model = SetFitModel.from_pretrained(model_path)
pb = PerformanceBenchmark(
    model=setfit_model,
    dataset=test_dataset,
    optim_type="bge-small (transformers)",
)
perf_metrics = pb.run_benchmark()

For the second benchmark, we’ll use Intel Extension for PyTorch (IPEX) with bf16 precision and TorchScript tracing.
To make use of IPEX we simply import the IPEX library and apply ipex.optimize() to the goal model, which, in our case, is the SetFit (transformer) model body:

dtype = torch.bfloat16
body = ipex.optimize(setfit_model.model_body, dtype=dtype)

For TorchScript tracing, we generate a random sequence based on the model’s maximum input length, with tokens sampled from the tokenizer’s vocabulary:

tokenizer = setfit_model.model_body.tokenizer
d = generate_random_sequences(batch_size=1, length=tokenizer.model_max_length, vocab_size=tokenizer.vocab_size)

body = torch.jit.trace(body, (d,), check_trace=False, strict=False)
setfit_model.model_body = torch.jit.freeze(body)

Now let’s run the benchmark using our quantized Optimum model. We’ll first must define a wrapper around our SetFit model which plugs in our quantized model body at inference (as an alternative of the unique model body). Then, we are able to run the benchmark using this wrapper.

from optimum.intel import IPEXModel

class OptimumSetFitModel:
    def __init__(self, setfit_model, model_body):
        model_body.tokenizer = setfit_model.model_body.tokenizer
        self.model_body = model_body
        self.model_head = setfit_model.model_head


optimum_model = IPEXModel.from_pretrained(optimum_model_path)
optimum_setfit_model = OptimumSetFitModel(setfit_model, model_body=optimum_model)

pb = PerformanceBenchmark(
    model=optimum_setfit_model,
    dataset=test_dataset,
    optim_type=f"bge-small (optimum-int8)",
    model_path=optimum_model_path,
    autocast_dtype=torch.bfloat16,
)
perf_metrics.update(pb.run_benchmark())



Results

Accuracy vs latency at batch size=1

bge-small (transformers) bge-small (ipex-bfloat16) bge-small (optimum-int8)
Model Size 127.32 MB 63.74 MB 44.65 MB
Accuracy on test set 88.4% 88.4% 88.1%
Latency (bs=1) 15.69 +/- 0.57 ms 5.67 +/- 0.66 ms 4.55 +/- 0.25 ms

When inspecting the performance at batch size 1, there’s a 3.45x reduction in latency with our optimized model. Note that that is achieved with virtually no drop in accuracy!
It is also price mentioning that the model size has shrunk by 2.85x.

We move on to our primary focus, which is the reported throughputs with different batch sizes.
Here, the optimization has garnered even greater speedups. When comparing the best achievable throughput (at any batch size), the optimized model is 7.8x faster than the unique transformers fp32 model!



Summary

On this blog post, we’ve got showed easy methods to use quantization capabilities present in 🤗 Optimum Intel to optimize SetFit models. After running a fast and simple post-training quantization procedure, we have observed that accuracy level was preserved, while inference throughput increased by 7.8x. This optimization method will be readily applied to any existing SetFit deployment running on Intel Xeon.



References

  • Lewis Tunstall, Nils Reimers, Unso Eun Website positioning Jo, Luke Bates, Daniel Korat, Moshe Wasserblat, Oren Pereg, 2022. “Efficient Few-Shot Learning Without Prompts”. https://arxiv.org/abs/2209.11055



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x