This text will show you easy methods to easily deploy large language models with tons of of billions of parameters like BLOOM on Habana® Gaudi®2 using 🤗 Optimum Habana, which is the bridge between Gaudi2 and the 🤗 Transformers library. As demonstrated within the benchmark presented on this post, this can enable you to run inference faster than with any GPU currently available in the marketplace.
As models get larger and larger, deploying them into production to run inference has develop into increasingly difficult. Each hardware and software have seen numerous innovations to deal with these challenges, so let’s dive in to see easy methods to efficiently overcome them!
BLOOMZ
BLOOM is a 176-billion-parameter autoregressive model that was trained to finish sequences of text. It will probably handle 46 different languages and 13 programming languages. Designed and trained as a part of the BigScience initiative, BLOOM is an open-science project that involved a lot of researchers and engineers all around the world. More recently, one other model with the very same architecture was released: BLOOMZ, which is a fine-tuned version of BLOOM on several tasks leading to raised generalization and zero-shot[^1] capabilities.
Such large models raise recent challenges when it comes to memory and speed for each training and inference. Even in 16-bit precision, one instance requires 352 GB to suit! You’ll likely struggle to search out any device with a lot memory in the intervening time, but state-of-the-art hardware like Habana Gaudi2 does make it possible to perform inference on BLOOM and BLOOMZ models with low latencies.
Habana Gaudi2
Gaudi2 is the second-generation AI hardware accelerator designed by Habana Labs. A single server comprises 8 accelerator devices (called Habana Processing Units, or HPUs) with 96GB of memory each, which provides room to make very large models slot in. Nonetheless, hosting the model will not be very interesting if the computation is slow. Fortunately, Gaudi2 shines on that aspect: it differs from GPUs in that its architecture enables the accelerator to perform General Matrix Multiplication (GeMM) and other operations in parallel, which accelerates deep learning workflows. These features make Gaudi2 a terrific candidate for LLM training and inference.
Habana’s SDK, SynapseAI™, supports PyTorch and DeepSpeed for accelerating LLM training and inference. The SynapseAI graph compiler will optimize the execution of the operations gathered within the graph (e.g. operator fusion, data layout management, parallelization, pipelining and memory management, and graph-level optimizations).
Furthermore, support for HPU graphs and DeepSpeed-inference have only recently been introduced in SynapseAI, and these are well-suited for latency-sensitive applications as shown in our benchmark below.
All these features are integrated into the 🤗 Optimum Habana library in order that deploying your model on Gaudi could be very easy. Take a look at the quick-start page here.
If you happen to would really like to get access to Gaudi2, go to the Intel Developer Cloud and follow this guide.
Benchmarks
On this section, we’re going to provide an early benchmark of BLOOMZ on Gaudi2, first-generation Gaudi and Nvidia A100 80GB. Although these devices have quite numerous memory, the model is so large that a single device will not be enough to contain a single instance of BLOOMZ. To unravel this issue, we’re going to use DeepSpeed, which is a deep learning optimization library that permits many memory and speed improvements to speed up the model and make it fit the device. Particularly, we rely here on DeepSpeed-inference: it introduces several features reminiscent of model (or pipeline) parallelism to take advantage of the available devices. For Gaudi2, we use Habana’s DeepSpeed fork that adds support for HPUs.
Latency
We measured latencies (batch of 1 sample) for 2 different sizes of BLOOMZ, each with multi-billion parameters:
Runs were performed with DeepSpeed-inference in 16-bit precision with 8 devices and using a key-value cache. Note that while CUDA graphs usually are not currently compatible with model parallelism in DeepSpeed (DeepSpeed v0.8.2, see here), HPU graphs are supported in Habana’s DeepSpeed fork. All benchmarks are doing greedy generation of 100 token outputs. The input prompt is:
“DeepSpeed is a machine learning framework”
which consists of seven tokens with BLOOM’s tokenizer.
The outcomes for inference latency are displayed within the table below (the unit is seconds).
| Model | Variety of devices | Gaudi2 latency (seconds) | A100-80GB latency (seconds) | First-gen Gaudi latency (seconds) |
|---|---|---|---|---|
| BLOOMZ | 8 | 3.103 | 4.402 | / |
| BLOOMZ-7B | 8 | 0.734 | 2.417 | 3.321 |
| BLOOMZ-7B | 1 | 0.772 | 2.119 | 2.387 |
Update: the numbers above were updated with the releases of Optimum Habana 1.6 and SynapseAI 1.10, resulting in a x1.42 speedup on BLOOMZ with Gaudi2 in comparison with A100.
The Habana team recently introduced support for DeepSpeed-inference in SynapseAI 1.8, and thereby quickly enabled inference for 100+ billion parameter models. For the 176-billion-parameter checkpoint, Gaudi2 is 1.42x faster than A100 80GB. Smaller checkpoints present interesting results too. Gaudi2 is 2.89x faster than A100 for BLOOMZ-7B! It’s also interesting to notice that it manages to profit from model parallelism whereas A100 is quicker on a single device.
We also ran these models on first-gen Gaudi. While it’s slower than Gaudi2, it’s interesting from a price perspective as a DL1 instance on AWS costs roughly 13$ per hour. Latency for BLOOMZ-7B on first-gen Gaudi is 2.387 seconds. Thus, first-gen Gaudi offers for the 7-billion checkpoint a greater price-performance ratio than A100 which costs greater than 30$ per hour!
We expect the Habana team will optimize the performance of those models within the upcoming SynapseAI releases. For instance, in our last benchmark, we saw that Gaudi2 performs Stable Diffusion inference 2.2x faster than A100 and this has since been improved further to 2.37x with the most recent optimizations provided by Habana. We’ll update these numbers as recent versions of SynapseAI are released and integrated inside Optimum Habana.
Running inference on an entire dataset
The script we wrote enables using your model to finish sentences over a complete dataset. This is beneficial to try BLOOMZ inference on Gaudi2 on your individual data.
Here is an example with the tldr_news dataset. It comprises each the headline and content of several articles (you’ll be able to visualize it on the Hugging Face Hub). We kept only the content column and truncated each sample to the primary 16 tokens in order that the model generates the remainder of the sequence with 50 recent tokens. The primary five samples appear to be:
Batch n°1
Input: ['Facebook has released a report that shows what content was most widely viewed by Americans between']
Output: ['Facebook has released a report that shows what content was most widely viewed by Americans between January and June of this year. The report, which is based on data from the company’s mobile advertising platform, shows that the most popular content on Facebook was news, followed by sports, entertainment, and politics. The report also shows that the most']
--------------------------------------------------------------------------------------------------
Batch n°2
Input: ['A quantum effect called superabsorption allows a collection of molecules to absorb light more']
Output: ['A quantum effect called superabsorption allows a collection of molecules to absorb light more strongly than the sum of the individual absorptions of the molecules. This effect is due to the coherent interaction of the molecules with the electromagnetic field. The superabsorption effect has been observed in a number of systems, including liquid crystals, liquid crystals in']
--------------------------------------------------------------------------------------------------
Batch n°3
Input: ['A SpaceX Starship rocket prototype has exploded during a pressure test. It was']
Output: ['A SpaceX Starship rocket prototype has exploded during a pressure test. It was the first time a Starship prototype had been tested in the air. The explosion occurred at the SpaceX facility in Boca Chica, Texas. The Starship prototype was being tested for its ability to withstand the pressure of flight. The explosion occurred at']
--------------------------------------------------------------------------------------------------
Batch n°4
Input: ['Scalene is a high-performance CPU and memory profiler for Python.']
Output: ['Scalene is a high-performance CPU and memory profiler for Python. It is designed to be a lightweight, portable, and easy-to-use profiler. Scalene is a Python package that can be installed on any platform that supports Python. Scalene is a lightweight, portable, and easy-to-use profiler']
--------------------------------------------------------------------------------------------------
Batch n°5
Input: ['With the rise of cheap small "Cube Satellites", startups are now']
Output: ['With the rise of cheap small "Cube Satellites", startups are now able to launch their own satellites for a fraction of the cost of a traditional launch. This has led to a proliferation of small satellites, which are now being used for a wide range of applications. The most common use of small satellites is for communications,']
In the subsequent section, we explain easy methods to use the script we wrote to perform this benchmark or to use it on any dataset you want from the Hugging Face Hub!
The right way to reproduce these results?
The script used for benchmarking BLOOMZ on Gaudi2 and first-gen Gaudi is offered here. Before running it, please ensure that the most recent versions of SynapseAI and the Gaudi drivers are installed following the instructions given by Habana.
Then, run the next:
git clone https://github.com/huggingface/optimum-habana.git
cd optimum-habana && pip install . && cd examples/text-generation
pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.9.0
Finally, you’ll be able to launch the script as follows:
python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_generation.py --model_name_or_path bigscience/bloomz --use_hpu_graphs --use_kv_cache --max_new_tokens 100
For multi-node inference, you’ll be able to follow this guide from the documentation of Optimum Habana.
You can even load any dataset from the Hugging Face Hub to get prompts that might be used for generation using the argument --dataset_name my_dataset_name.
This benchmark was performed with Transformers v4.28.1, SynapseAI v1.9.0 and Optimum Habana v1.5.0.
For GPUs, here is the script that led to the outcomes that were previously presented in this blog post (and here are the instructions to make use of it). To make use of CUDA graphs, static shapes are mandatory and this will not be supported in 🤗 Transformers. You should utilize this repo written by the Habana team to enable them.
Conclusion
We see in this text that Habana Gaudi2 performs BLOOMZ inference faster than Nvidia A100 80GB. And there is no such thing as a need to put in writing a sophisticated script as 🤗 Optimum Habana provides easy-to-use tools to run inference with multi-billion-parameter models on HPUs. Future releases of Habana’s SynapseAI SDK are expected to hurry up performance, so we’ll update this benchmark commonly as LLM inference optimizations on SynapseAI proceed to advance. We’re also looking forward to the performance advantages that may include FP8 inference on Gaudi2.
We also presented the outcomes achieved with first-generation Gaudi. For smaller models, it may perform on par with and even higher than A100 for nearly a 3rd of its price. It’s an excellent alternative choice to using GPUs for running inference with such a giant model like BLOOMZ.
If you happen to are fascinated with accelerating your Machine Learning training and inference workflows using the most recent AI hardware accelerators and software libraries, take a look at our Expert Acceleration Program. To learn more about Habana solutions, examine our partnership and make contact with them here. To learn more about Hugging Face efforts to make AI hardware accelerators easy to make use of, take a look at our Hardware Partner Program.
Related Topics
Thanks for reading! If you may have any questions, be at liberty to contact me, either through Github or on the forum. You can even connect with me on LinkedIn.
[^1]: “Zero-shot” refers to the power of a model to finish a task on recent or unseen input data, i.e. without having been provided any training examples of this type of data. We offer the model with a prompt and a sequence of text that describes what we would like our model to do, in natural language. Zero-shot classification excludes any examples of the specified task being accomplished. This differs from single or few-shot classification, as these tasks include a single or just a few examples of the chosen task.
