Accelerated Inference with Optimum and Transformers Pipelines

-


Philipp Schmid's avatar

Inference has landed in Optimum with support for Hugging Face Transformers pipelines, including text-generation using ONNX Runtime.

The adoption of BERT and Transformers continues to grow. Transformer-based models at the moment are not only achieving state-of-the-art performance in Natural Language Processing but additionally for Computer Vision, Speech, and Time-Series. 💬 🖼 🎤 ⏳

Firms at the moment are moving from the experimentation and research phase to the production phase with a view to use Transformer models for large-scale workloads. But by default BERT and its friends are relatively slow, big, and complicated models in comparison with traditional Machine Learning algorithms.

To resolve this challenge, we created Optimum – an extension of Hugging Face Transformers to speed up the training and inference of Transformer models like BERT.

On this blog post, you will learn:

Let’s start! 🚀



1. What’s Optimum? An ELI5

Hugging Face Optimum is an open-source library and an extension of Hugging Face Transformers, that gives a unified API of performance optimization tools to realize maximum efficiency to coach and run models on accelerated hardware, including toolkits for optimized performance on Graphcore IPU and Habana Gaudi. Optimum might be used for accelerated training, quantization, graph optimization, and now inference as well with support for transformers pipelines.



2. Recent Optimum inference and pipeline features

With release of Optimum 1.2, we’re adding support for inference and transformers pipelines. This enables Optimum users to leverage the identical API they’re used to from transformers with the ability of accelerated runtimes, like ONNX Runtime.

Switching from Transformers to Optimum Inference
The Optimum Inference models are API compatible with Hugging Face Transformers models. This implies you’ll be able to just replace your AutoModelForXxx class with the corresponding ORTModelForXxx class in Optimum. For instance, that is how you should use a matter answering model in Optimum:

from transformers import AutoTokenizer, pipeline
-from transformers import AutoModelForQuestionAnswering
+from optimum.onnxruntime import ORTModelForQuestionAnswering

-model = AutoModelForQuestionAnswering.from_pretrained("deepset/roberta-base-squad2") # pytorch checkpoint
+model = ORTModelForQuestionAnswering.from_pretrained("optimum/roberta-base-squad2") # onnx checkpoint
tokenizer = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2")

optimum_qa = pipeline("question-answering", model=model, tokenizer=tokenizer)

query = "What's my name?"
context = "My name is Philipp and I live in Nuremberg."
pred = optimum_qa(query, context)

In the primary release, we added support for ONNX Runtime but there may be more to return!
These latest ORTModelForXX can now be used with the transformers pipelines. Also they are fully integrated into the Hugging Face Hub to push and pull optimized checkpoints from the community. Along with this, you should use the ORTQuantizer and ORTOptimizer to first quantize and optimize your model after which run inference on it.
Try End-to-End Tutorial on accelerating RoBERTa for question-answering including quantization and optimization for more details.



3. End-to-End tutorial on accelerating RoBERTa for Query-Answering including quantization and optimization

On this End-to-End tutorial on accelerating RoBERTa for question-answering, you’ll learn the right way to:

  1. Install Optimum for ONNX Runtime
  2. Convert a Hugging Face Transformers model to ONNX for inference
  3. Use the ORTOptimizer to optimize the model
  4. Use the ORTQuantizer to use dynamic quantization
  5. Run accelerated inference using Transformers pipelines
  6. Evaluate the performance and speed

Let’s start 🚀

This tutorial was created and run on an m5.xlarge AWS EC2 Instance.



3.1 Install Optimum for Onnxruntime

Our first step is to put in Optimum with the onnxruntime utilities.

pip install "optimum[onnxruntime]==1.2.0"

This may install all required packages for us including transformers, torch, and onnxruntime. If you happen to are going to make use of a GPU you’ll be able to install optimum with pip install optimum[onnxruntime-gpu].



3.2 Convert a Hugging Face Transformers model to ONNX for inference**

Before we are able to start optimizing we want to convert our vanilla transformers model to the onnx format. To do that we are going to use the brand new ORTModelForQuestionAnswering class calling the from_pretrained() method with the from_transformers attribute. The model we’re using is the deepset/roberta-base-squad2 a fine-tuned RoBERTa model on the SQUAD2 dataset achieving an F1 rating of 82.91 and because the feature (task) question-answering.

from pathlib import Path
from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForQuestionAnswering

model_id = "deepset/roberta-base-squad2"
onnx_path = Path("onnx")
task = "question-answering"


model = ORTModelForQuestionAnswering.from_pretrained(model_id, from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)


model.save_pretrained(onnx_path)
tokenizer.save_pretrained(onnx_path)


optimum_qa = pipeline(task, model=model, tokenizer=tokenizer, handle_impossible_answer=True)
prediction = optimum_qa(query="What's my name?", context="My name is Philipp and I live in Nuremberg.")

print(prediction)

We successfully converted our vanilla transformers to onnx and used the model with the transformers.pipelines to run the primary prediction. Now let’s optimize it. 🏎

If you need to learn more about exporting transformers model check-out the documentation: Export 🤗 Transformers Models



3.3 Use the ORTOptimizer to optimize the model

After we saved our onnx checkpoint to onnx/ we are able to now use the ORTOptimizer to use graph optimization equivalent to operator fusion and constant folding to speed up latency and inference.

from optimum.onnxruntime import ORTOptimizer
from optimum.onnxruntime.configuration import OptimizationConfig


optimizer = ORTOptimizer.from_pretrained(model_id, feature=task)
optimization_config = OptimizationConfig(optimization_level=99) 


optimizer.export(
    onnx_model_path=onnx_path / "model.onnx",
    onnx_optimized_model_output_path=onnx_path / "model-optimized.onnx",
    optimization_config=optimization_config,
)

To check performance we are able to use the ORTModelForQuestionAnswering class again and supply an extra file_name parameter to load our optimized model. (This also works for models available on the hub).

from optimum.onnxruntime import ORTModelForQuestionAnswering


opt_model = ORTModelForQuestionAnswering.from_pretrained(onnx_path, file_name="model-optimized.onnx")


opt_optimum_qa = pipeline(task, model=opt_model, tokenizer=tokenizer, handle_impossible_answer=True)
prediction = opt_optimum_qa(query="What's my name?", context="My name is Philipp and I live in Nuremberg.")
print(prediction)

We are going to evaluate the performance changes in step 3.6 Evaluate the performance and speed intimately.



3.4 Use the ORTQuantizer to use dynamic quantization

After we’ve got optimized our model we are able to speed up it much more by quantizing it using the ORTQuantizer. The ORTOptimizer might be used to use dynamic quantization to diminish the scale of the model size and speed up latency and inference.

We use the avx512_vnni for the reason that instance is powered by an intel cascade-lake CPU supporting avx512.

from optimum.onnxruntime import ORTQuantizer
from optimum.onnxruntime.configuration import AutoQuantizationConfig


quantizer = ORTQuantizer.from_pretrained(model_id, feature=task)
qconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=True)


quantizer.export(
    onnx_model_path=onnx_path / "model-optimized.onnx",
    onnx_quantized_model_output_path=onnx_path / "model-quantized.onnx",
    quantization_config=qconfig,
)

We are able to now compare this model size in addition to some latency performance

import os

size = os.path.getsize(onnx_path / "model.onnx")/(1024*1024)
print(f"Vanilla Onnx Model file size: {size:.2f} MB")
size = os.path.getsize(onnx_path / "model-quantized.onnx")/(1024*1024)
print(f"Quantized Onnx Model file size: {size:.2f} MB")



Model size comparison

We decreased the scale of our model by almost 50% from 473MB to 291MB. To run inference we are able to use the ORTModelForQuestionAnswering class again and supply an extra file_name parameter to load our quantized model. (This also works for models available on the hub).


quantized_model = ORTModelForQuestionAnswering.from_pretrained(onnx_path, file_name="model-quantized.onnx")


quantized_optimum_qa = pipeline(task, model=quantized_model, tokenizer=tokenizer, handle_impossible_answer=True)
prediction = quantized_optimum_qa(query="What's my name?", context="My name is Philipp and I live in Nuremberg.")
print(prediction)

Nice! The model predicted the identical answer.



3.5 Run accelerated inference using Transformers pipelines

Optimum has built-in support for transformers pipelines. This enables us to leverage the identical API that we all know from using PyTorch and TensorFlow models. We’ve got already used this feature in steps 3.2,3.3 & 3.4 to check our converted and optimized models. On the time of writing this, we’re supporting ONNX Runtime with more to are available the long run. An example of the right way to use the transformers pipelines might be found below.

from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForQuestionAnswering

tokenizer = AutoTokenizer.from_pretrained(onnx_path)
model = ORTModelForQuestionAnswering.from_pretrained(onnx_path)

optimum_qa = pipeline("question-answering", model=model, tokenizer=tokenizer)
prediction = optimum_qa(query="What's my name?", context="My name is Philipp and I live in Nuremberg.")

print(prediction)

Along with this we added a pipelines API to Optimum to ensure more safety on your accelerated models. Meaning if you happen to are attempting to make use of optimum.pipelines with an unsupported model or task you will notice an error. You need to use optimum.pipelines as a substitute for transformers.pipelines.

from transformers import AutoTokenizer
from optimum.onnxruntime import ORTModelForQuestionAnswering
from optimum.pipelines import pipeline

tokenizer = AutoTokenizer.from_pretrained(onnx_path)
model = ORTModelForQuestionAnswering.from_pretrained(onnx_path)

optimum_qa = pipeline("question-answering", model=model, tokenizer=tokenizer, handle_impossible_answer=True)
prediction = optimum_qa(query="What's my name?", context="My name is Philipp and I live in Nuremberg.")

print(prediction)



3.6 Evaluate the performance and speed

During this End-to-End tutorial on accelerating RoBERTa for Query-Answering including quantization and optimization, we created 3 different models. A vanilla converted model, an optimized model, and a quantized model.

Because the last step of the tutorial, we would like to take an in depth take a look at the performance and accuracy of our model. Applying optimization techniques, like graph optimizations or quantization not only impact performance (latency) those also might need an impact on the accuracy of the model. So accelerating your model comes with a trade-off.

Let’s evaluate our models. Our transformers model deepset/roberta-base-squad2 was fine-tuned on the SQUAD2 dataset. This will likely be the dataset we use to guage our models.

from datasets import load_metric,load_dataset

metric = load_metric("squad_v2")
dataset = load_dataset("squad_v2")["validation"]

print(f"length of dataset {len(dataset)}")

We are able to now leverage the map function of datasets to iterate over the validation set of squad 2 and run prediction for every data point. Subsequently we write a evaluate helper method which uses our pipelines and applies some transformation to work with the squad v2 metric.

This could take quite some time (1.5h)

def evaluate(example):
  default = optimum_qa(query=example["question"], context=example["context"])
  optimized = opt_optimum_qa(query=example["question"], context=example["context"])
  quantized = quantized_optimum_qa(query=example["question"], context=example["context"])
  return {
      'reference': {'id': example['id'], 'answers': example['answers']},
      'default': {'id': example['id'],'prediction_text': default['answer'], 'no_answer_probability': 0.},
      'optimized': {'id': example['id'],'prediction_text': optimized['answer'], 'no_answer_probability': 0.},
      'quantized': {'id': example['id'],'prediction_text': quantized['answer'], 'no_answer_probability': 0.},
      }

result = dataset.map(evaluate)


Now lets compare the outcomes

default_acc = metric.compute(predictions=result["default"], references=result["reference"])
optimized = metric.compute(predictions=result["optimized"], references=result["reference"])
quantized = metric.compute(predictions=result["quantized"], references=result["reference"])

print(f"vanilla model: exact={default_acc['exact']}% f1={default_acc['f1']}%")
print(f"optimized model: exact={optimized['exact']}% f1={optimized['f1']}%")
print(f"quantized model: exact={quantized['exact']}% f1={quantized['f1']}%")




Our optimized & quantized model achieved an actual match of 78.75% and an f1 rating of 81.83% which is 99.61% of the unique accuracy. Achieving 99% of the unique model is excellent especially since we used dynamic quantization.

Okay, let’s test the performance (latency) of our optimized and quantized model.

But first, let’s extend our context and query to a more meaningful sequence length of 128.

context="Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I'm working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science. Prior to now I designed and implemented cloud-native machine learning architectures for fin-tech and insurance firms. I discovered my passion for cloud concepts and machine learning 5 years ago. Since then I never stopped learning. Currently, I'm focusing myself in the world NLP and the right way to leverage models like BERT, Roberta, T5, ViT, and GPT2 to generate business value."
query="As what's Philipp working?"

To maintain it easy, we’re going to use a python loop and calculate the avg/mean latency for our vanilla model and for the optimized and quantized model.

from time import perf_counter
import numpy as np

def measure_latency(pipe):
    latencies = []
    
    for _ in range(10):
        _ = pipe(query=query, context=context)
    
    for _ in range(100):
        start_time = perf_counter()
        _ =  pipe(query=query, context=context)
        latency = perf_counter() - start_time
        latencies.append(latency)
    
    time_avg_ms = 1000 * np.mean(latencies)
    time_std_ms = 1000 * np.std(latencies)
    return f"Average latency (ms) - {time_avg_ms:.2f} +- {time_std_ms:.2f}"

print(f"Vanilla model {measure_latency(optimum_qa)}")
print(f"Optimized & Quantized model {measure_latency(quantized_optimum_qa)}")



Latency & F1 results

We managed to speed up our model latency from 117.61ms to 64.94ms or roughly 2x while keeping 99.61% of the accuracy. Something we should always take into account is that we used a mid-performant CPU instance with 2 physical cores. By switching to GPU or a more performant CPU instance, e.g. ice-lake powered you’ll be able to decrease the latency number all the way down to just a few milliseconds.



4. Current Limitations

We just began supporting inference in https://github.com/huggingface/optimum so we would really like to share current limitations as well. All of those limitations are on the roadmap and will likely be resolved within the near future.

  • Distant Models > 2GB: Currently, only models smaller than 2GB might be loaded from the Hugging Face Hub. We’re working on adding support for models > 2GB / multi-file models.
  • Seq2Seq tasks/model: We don’t have support for seq2seq tasks, like summarization and models like T5 mostly because of the limitation of the only model support. But we’re actively working to resolve it, to give you the identical experience you might be accustomed to in transformers.
  • Past key values: Generation models like GPT-2 use something called past key values that are precomputed key-value pairs of the eye blocks and might be used to hurry up decoding. Currently the ORTModelForCausalLM just isn’t using past key values.
  • No cache: Currently when loading an optimized model (*.onnx), it’s going to not be cached locally.



5. Optimum Inference FAQ

Which tasks are supported?

Yow will discover a listing of all supported tasks within the documentation. Currently support pipelines tasks are feature-extraction, text-classification, token-classification, question-answering, zero-shot-classification, text-generation

Which models are supported?

Any model that might be exported with transformers.onnx and has a supported task might be used, this includes amongst others BERT, ALBERT, GPT2, RoBERTa, XLM-RoBERTa, DistilBERT ….

Which runtimes are supported?

Currently, ONNX Runtime is supported. We’re working on adding more in the long run. Tell us if you happen to are excited about a selected runtime.

How can I exploit Optimum with Transformers?

Yow will discover an example and directions in our documentation.

How can I exploit GPUs?

To have the ability to make use of GPUs you just need to put in optimum[onnxruntine-gpu] which can install the required GPU providers and use them by default.

How can I exploit a quantized and optimized model with pipelines?

You possibly can load the optimized or quantized model using the brand new ORTModelForXXX classes using the from_pretrained method. You possibly can learn more about it in our documentation.



6. What’s next?

What’s next for Optimum you ask? Loads of things. We’re focused on making Optimum the reference open-source toolkit to work with transformers for acceleration & optimization. To have the ability to realize this we are going to solve the present limitations, improve the documentation, create more content and examples and push the boundaries for accelerating and optimizing transformers.

Some vital features on the roadmap for Optimum amongst the current limitations are:

  • Support for speech models (Wav2vec2) and speech tasks (automatic speech recognition)
  • Support for vision models (ViT) and vision tasks (image classification)
  • Improve performance by adding support for OrtValue and IOBinding
  • Easier ways to guage accelerated models
  • Add support for other runtimes and providers like TensorRT and AWS-Neuron

Thanks for reading! If you happen to are as excited as I’m about accelerating Transformers, make them efficient and scale them to billions of requests. It is best to apply, we’re hiring.🚀

If you’ve got any questions, be happy to contact me, through Github, or on the forum. You too can connect with me on Twitter or LinkedIn.





Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x