Millisecond Latency using Hugging Face Infinity and modern CPUs

December 2022 Update: Infinity is not any longer offered by Hugging Face as a industrial inference solution. To deploy and speed up your models, we recommend the next latest solutions:

Introduction

Transfer learning has modified Machine Learning by reaching latest levels of accuracy from Natural Language Processing (NLP) to Audio and Computer Vision tasks. At Hugging Face, we work hard to make these latest complex models and huge checkpoints as easily accessible and usable as possible. But while researchers and data scientists have converted to the brand new world of Transformers, few corporations have been in a position to deploy these large, complex models in production at scale.

The predominant bottleneck is the latency of predictions which might make large deployments expensive to run and real-time use cases impractical. Solving this can be a difficult engineering challenge for any Machine Learning Engineering team and requires using advanced techniques to optimize models all the way in which all the way down to the hardware.

With Hugging Face Infinity, we provide a containerized solution that makes it easy to deploy low-latency, high-throughput, hardware-accelerated inference pipelines for the most well-liked Transformer models. Firms can get each the accuracy of Transformers and the efficiency needed for giant volume deployments, all in an easy to make use of package. On this blog post, we wish to share detailed performance results for Infinity running on the most recent generation of Intel Xeon CPU, to attain optimal cost, efficiency, and latency on your Transformer deployments.

What’s Hugging Face Infinity

Hugging Face Infinity is a containerized solution for patrons to deploy end-to-end optimized inference pipelines for State-of-the-Art Transformer models, on any infrastructure.

Hugging Face Infinity consists of two predominant services:

The Infinity Container is a hardware-optimized inference solution delivered as a Docker container.
Infinity Multiverse is a Model Optimization Service through which a Hugging Face Transformer model is optimized for the Goal Hardware. Infinity Multiverse is compatible with Infinity Container.

The Infinity Container is built specifically to run on a Goal Hardware architecture and exposes an HTTP /predict endpoint to run inference.

Figure 1. Infinity Overview

An Infinity Container is designed to serve 1 Model and 1 Task. A Task corresponds to machine learning tasks as defined within the Transformers Pipelines documentation. As of the writing of this blog post, supported tasks include feature extraction/document embedding, rating, sequence classification, and token classification.

You will discover more details about Hugging Face Infinity at hf.co/infinity, and should you are serious about testing it for yourself, you may join for a free trial at hf.co/infinity-trial.

Benchmark

Inference performance benchmarks often only measure the execution of the model. On this blog post, and when discussing the performance of Infinity, we at all times measure the end-to-end pipeline including pre-processing, prediction, post-processing. Please keep this in mind when comparing these results with other latency measurements.

Figure 2. Infinity End-to-End Pipeline

Environment

As a benchmark environment, we’re going to use the Amazon EC2 C6i instances, that are compute-optimized instances powered by the third generation of Intel Xeon Scalable processors. These latest Intel-based instances are using the ice-lake Process Technology and support Intel AVX-512, Intel Turbo Boost, and Intel Deep Learning Boost.

Along with superior performance for machine learning workloads, the Intel Ice Lake C6i instances offer great cost-performance and are our suggestion to deploy Infinity on Amazon Web Services. To learn more, visit the EC2 C6i instance page.

Methodologies

In the case of benchmarking BERT-like models, two metrics are most adopted:

Latency: Time it takes for a single prediction of the model (pre-process, prediction, post-process)
Throughput: Variety of executions performed in a set period of time for one benchmark configuration, respecting Physical CPU cores, Sequence Length, and Batch Size

These two metrics will likely be used to benchmark Hugging Face Infinity across different setups to grasp the advantages and tradeoffs on this blog post.

Results

To run the benchmark, we created an infinity container for the EC2 C6i instance (Ice-lake) and optimized a DistilBERT model for sequence classification using Infinity Multiverse.

This ice-lake optimized Infinity Container can achieve as much as 34% higher latency & throughput in comparison with existing cascade-lake-based instances, and as much as 800% higher latency & throughput in comparison with vanilla transformers running on ice-lake.

The Benchmark we created consists of 192 different experiments and configurations. We ran experiments for:

Physical CPU cores: 1, 2, 4, 8
Sequence length: 8, 16, 32, 64, 128, 256, 384, 512
Batch_size: 1, 2, 4, 8, 16, 32

In each experiment, we collect numbers for:

Throughput (requests per second)
Latency (min, max, avg, p90, p95, p99)

You will discover the total data of the benchmark on this google spreadsheet: 🤗 Infinity: CPU Ice-Lake Benchmark.

On this blog post, we are going to highlight just a few results of the benchmark including one of the best latency and throughput configurations.

Along with this, we deployed the DistilBERT model we used for the benchmark as an API endpoint on 2 physical cores. You’ll be able to test it and get a sense for the performance of Infinity. Below you will discover a curl command on find out how to send a request to the hosted endpoint. The API returns a x-compute-time HTTP Header, which accommodates the duration of the end-to-end pipeline.

curl --request POST `-i` 
  --url https://infinity.huggingface.co/cpu/distilbert-base-uncased-emotion 
  --header 'Content-Type: application/json' 
  --data '{"inputs":"I such as you. I like you"}'

Throughput

Below yow will discover the throughput comparison for running infinity on 2 physical cores with batch size 1, compared with vanilla transformers.

Figure 3. Throughput: Infinity vs Transformers

Sequence Length	Infinity	Transformers	improvement
8	248 req/sec	49 req/sec	+506%
16	212 req/sec	50 req/sec	+424%
32	150 req/sec	40 req/sec	+375%
64	97 req/sec	28 req/sec	+346%
128	55 req/sec	18 req/sec	+305%
256	27 req/sec	9 req/sec	+300%
384	17 req/sec	5 req/sec	+340%
512	12 req/sec	4 req/sec	+300%

Latency

Below, yow will discover the latency results for an experiment running Hugging Face Infinity on 2 Physical Cores with Batch Size 1. It’s remarkable to see how robust and constant Infinity is, with minimal deviation for p95, p99, or p100 (max latency). This result’s confirmed for other experiments as well within the benchmark.

Figure 4. Latency (Batch=1, Physical Cores=2)

Conclusion

On this post, we showed how Hugging Face Infinity performs on the brand new Intel Ice Lake Xeon CPU. We created an in depth benchmark with over 190 different configurations sharing the outcomes you may expect when using Hugging Face Infinity on CPU, what can be one of the best configuration to optimize your Infinity Container for latency, and what can be one of the best configuration to maximise throughput.

Hugging Face Infinity can deliver as much as 800% higher throughput in comparison with vanilla transformers, and all the way down to 1-4ms latency for sequence lengths as much as 64 tokens.

The flexibleness to optimize transformer models for throughput, latency, or each enables businesses to either reduce the quantity of infrastructure cost for a similar workload or to enable real-time use cases that were impossible before.

In the event you are serious about trying out Hugging Face Infinity join on your trial at hf.co/infinity-trial

Resources

Source link

Millisecond Latency using Hugging Face Infinity and modern CPUs

Introduction

What’s Hugging Face Infinity