Accelerating Hugging Face Transformers with AWS Inferentia2

Within the last five years, Transformer models [1] have grow to be the de facto standard for a lot of machine learning (ML) tasks, resembling natural language processing (NLP), computer vision (CV), speech, and more. Today, many data scientists and ML engineers depend on popular transformer architectures like BERT [2], RoBERTa [3], the Vision Transformer [4], or any of the 130,000+ pre-trained models available on the Hugging Face hub to unravel complex business problems with state-of-the-art accuracy.

Nevertheless, for all their greatness, Transformers could be difficult to deploy in production. On top of the infrastructure plumbing typically related to model deployment, which we largely solved with our Inference Endpoints service, Transformers are large models which routinely exceed the multi-gigabyte mark. Large language models (LLMs) like GPT-J-6B, Flan-T5, or Opt-30B are within the tens of gigabytes, not to say behemoths like BLOOM, our very own LLM, which clocks in at 350 gigabytes.

Fitting these models on a single accelerator could be quite difficult, let alone getting the high throughput and low inference latency that applications require, like conversational applications and search. To date, ML experts have designed complex manual techniques to slice large models, distribute them on a cluster of accelerators, and optimize their latency. Unfortunately, this work is amazingly difficult, time-consuming, and completely out of reach for a lot of ML practitioners.

At Hugging Face, we’re democratizing ML and at all times seeking to partner with firms who also imagine that each developer and organization should profit from state-of-the-art models. For this purpose, we’re excited to partner with Amazon Web Services to optimize Hugging Face Transformers for AWS Inferentia 2! It’s a brand new purpose-built inference accelerator that delivers unprecedented levels of throughput, latency, performance per watt, and scalability.

Introducing AWS Inferentia2

AWS Inferentia2 is the following generation to Inferentia1 launched in 2019. Powered by Inferentia1, Amazon EC2 Inf1 instances delivered 25% higher throughput and 70% lower cost than comparable G5 instances based on NVIDIA A10G GPU, and with Inferentia2, AWS is pushing the envelope again.

The brand new Inferentia2 chip delivers a 4x throughput increase and a 10x latency reduction in comparison with Inferentia. Likewise, the brand new Amazon EC2 Inf2 instances have as much as 2.6x higher throughput, 8.1x lower latency, and 50% higher performance per watt than comparable G5 instances. Inferentia 2 gives you the most effective of each worlds: cost-per-inference optimization due to high throughput and response time on your application due to low inference latency.

Inf2 instances can be found in multiple sizes, that are equipped with between 1 to 12 Inferentia 2 chips. When several chips are present, they’re interconnected by a blazing-fast direct Inferentia2 to Inferentia2 connectivity for distributed inference on large models. For instance, the most important instance size, inf2.48xlarge, has 12 chips and enough memory to load a 175-billion parameter model like GPT-3 or BLOOM.

Thankfully none of this comes on the expense of development complexity. With optimum neuron, you needn’t slice or modify your model. Due to the native integration in AWS Neuron SDK, all it takes is a single line of code to compile your model for Inferentia 2. You’ll be able to experiment in minutes! Test the performance your model could reach on Inferentia 2 and see for yourself.

Speaking of, let’s show you the way several Hugging Face models run on Inferentia 2. Benchmarking time!

Benchmarking Hugging Face Models on AWS Inferentia 2

We evaluated a few of the hottest NLP models from the Hugging Face Hub including BERT, RoBERTa, DistilBERT, and vision models like Vision Transformers.

The primary benchmark compares the performance of Inferentia, Inferentia 2, and GPUs. We ran all experiments on AWS with the next instance types:

Inferentia1 – inf1.2xlarge powered by a single Inferentia chip.
Inferentia2 – inf2.xlarge powered by a single Inferentia2 chip.
GPU – g5.2xlarge powered by a single NVIDIA A10G GPU.

Note: that we didn’t optimize the model for the GPU environment, the models were evaluated in fp32.

With regards to benchmarking Transformer models, there are two metrics which are most adopted:

Latency: the time it takes for the model to perform a single prediction (pre-process, prediction, post-process).
Throughput: the variety of executions performed in a set period of time for one benchmark configuration

We checked out latency across different setups and models to grasp the advantages and tradeoffs of the brand new Inferentia2 instance. If you must run the benchmark yourself, we created a Github repository with all the knowledge and scripts to achieve this.

Results

The benchmark confirms that the performance improvements claimed by AWS could be reproduced and validated by real use-cases and examples. On average, AWS Inferentia2 delivers 4.5x higher latency than NVIDIA A10G GPUs and 4x higher latency than Inferentia1 instances.

We ran 144 experiments on 6 different model architectures:

Accelerators: Inf1, Inf2, NVIDIA A10G
Models: BERT-base, BERT-Large, RoBERTa-base, DistilBERT, ALBERT-base, ViT-base
Sequence length: 8, 16, 32, 64, 128, 256, 512
Batch size: 1

In each experiment, we collected numbers for p95 latency. You’ll find the total details of the benchmark on this spreadsheet: HuggingFace: Benchmark Inferentia2.

Let’s highlight a number of insights of the benchmark.

BERT-base

Here is the latency comparison for running BERT-base on each of the infrastructure setups, with a logarithmic scale for latency. It’s remarkable to see how Inferentia2 outperforms all other setups by ~6x for sequence lengths as much as 256.

Figure 1. BERT-base p95 latency

Vision Transformer

Here is the latency comparison for running ViT-base on the various infrastructure setups. Inferentia2 delivers 2x higher latency than the NVIDIA A10G, with the potential to greatly help firms move from traditional architectures, like CNNs, to Transformers for – real-time applications.

Figure 1. ViT p95 latency

Conclusion

Transformer models have emerged because the go-to solution for a lot of machine learning tasks. Nevertheless, deploying them in production has been difficult as a consequence of their large size and latency requirements. Because of AWS Inferentia2 and the collaboration between Hugging Face and AWS, developers and organizations can now leverage the advantages of state-of-the-art models without the prior need for extensive machine learning expertise. You’ll be able to start testing for as little as 0.76$/h.

The initial benchmarking results are promising, and show that Inferentia2 delivers superior latency performance compared to each Inferentia and NVIDIA A10G GPUs. This latest breakthrough guarantees high-quality machine learning models could be made available to a much wider audience delivering AI accessibility to everyone.

Source link

Accelerating Hugging Face Transformers with AWS Inferentia2

Introducing AWS Inferentia2