Incredibly Fast BLOOM Inference with DeepSpeed and Speed up

This text shows get an incredibly fast per token throughput when generating with the 176B parameter BLOOM model.

Because the model needs 352GB in bf16 (bfloat16) weights (176*2), essentially the most efficient set-up is 8x80GB A100 GPUs. Also 2x8x40GB A100s or 2x8x48GB A6000 could be used. The foremost reason for using these GPUs is that on the time of this writing they supply the most important GPU memory, but other GPUs could be used as well. For instance, 24x32GB V100s could be used.

Using a single node will typically deliver a fastest throughput since more often than not intra-node GPU linking hardware is quicker than inter-node one, however it’s not at all times the case.

If you happen to do not have that much hardware, it’s still possible to run BLOOM inference on smaller GPUs, by utilizing CPU or NVMe offload, but after all, the generation time shall be much slower.

We’re also going to cover the 8bit quantized solutions, which require half the GPU memory at the price of barely slower throughput. We’ll discuss BitsAndBytes and Deepspeed-Inference libraries there.

Benchmarks

With none further delay let’s show some numbers.

For the sake of consistency, unless stated in another way, the benchmarks in this text were all done on the identical 8x80GB A100 node w/ 512GB of CPU memory on Jean Zay HPC. The JeanZay HPC users enjoy a really fast IO of about 3GB/s read speed (GPFS). This is very important for checkpoint loading time. A slow disc will end in slow loading time. Especially since we’re concurrently doing IO in multiple processes.

All benchmarks are doing greedy generation of 100 token outputs:

Generate args {'max_length': 100, 'do_sample': False}

The input prompt is comprised of just a couple of tokens. The previous token caching is on as well, as it might be quite slow to recalculate them on a regular basis.

First, let’s have a fast take a look at how long did it take to get able to generate – i.e. how long did it take to load and prepare the model:

project	secs
speed up	121
ds-inference shard-int8	61
ds-inference shard-fp16	60
ds-inference unsharded	662
ds-zero	462

Deepspeed-Inference comes with pre-sharded weight repositories and there the loading takes about 1 minuted. Speed up’s loading time is superb as well – at nearly 2 minutes. The opposite solutions are much slower here.

The loading time may or is probably not of importance, since once loaded you may continually generate tokens many times without an extra loading overhead.

Next an important benchmark of token generation throughput. The throughput metric here is a straightforward – how long did it take to generate 100 recent tokens divided by 100 and the batch size (i.e. divided by the overall variety of generated tokens).

Here is the throughput in msecs on 8x80GB GPUs:

project bs	1	8	16	32	64	128	256	512
speed up bf16	230.38	31.78	17.84	10.89	oom
speed up int8	286.56	40.92	22.65	13.27	oom
ds-inference fp16	44.02	5.70	3.01	1.68	1.00	0.69	oom
ds-inference int8	89.09	11.44	5.88	3.09	1.71	1.02	0.71	oom
ds-zero bf16	283	34.88	oom

where OOM == Out of Memory condition where the batch size was too big to suit into GPU memory.

Getting an under 1 msec throughput with Deepspeed-Inference’s Tensor Parallelism (TP) and custom fused CUDA kernels! That is absolutely amazing! Though using this solution for other models that it hasn’t been tried on may require some developer time to make it work.

Speed up is super fast as well. It uses a quite simple approach of naive Pipeline Parallelism (PP) and since it’s extremely easy it should work out of the box with any model.

Since Deepspeed-ZeRO can process multiple generate streams in parallel its throughput could be further divided by 8 or 16, depending on whether 8 or 16 GPUs were used in the course of the generate call. And, after all, it signifies that it may possibly process a batch size of 64 within the case of 8×80 A100 (the table above) and thus the throughput is about 4msec – so all 3 solutions are very close to one another.

Let’s revisit again how these numbers were calculated. To generate 100 recent tokens for a batch size of 128 took 8832 msecs in real time when using Deepspeed-Inference in fp16 mode. So now to calculate the throughput we did: walltime/(batch_size*new_tokens) or 8832/(128*100) = 0.69.

Now let’s take a look at the facility of quantized int8-based models provided by Deepspeed-Inference and BitsAndBytes, because it requires only half the unique GPU memory of inference in bfloat16 or float16.

Throughput in msecs 4x80GB A100:

project bs	1	8	16	32	64	128
speed up int8	284.15	40.14	21.97	oom
ds-inference int8	156.51	20.11	10.38	5.50	2.96	oom

To breed the benchmark results simply add --benchmark to any of those 3 scripts discussed below.

Solutions

First checkout the demo repository:

git clone https://github.com/huggingface/transformers-bloom-inference
cd transformers-bloom-inference

In this text we’re going to use 3 scripts positioned under bloom-inference-scripts/.

The framework-specific solutions are presented in an alphabetical order:

HuggingFace Speed up

Speed up

Speed up handles big models for inference in the next way:

Instantiate the model with empty weights.
Analyze the scale of every layer and the available space on each device (GPUs, CPU) to come to a decision where each layer should go.
Load the model checkpoint little by little and put each weight on its device

It then ensures the model runs properly with hooks that transfer the inputs and outputs on the best device and that the model weights offloaded on the CPU (and even the disk) are loaded on a GPU just before the forward pass, before being offloaded again once the forward pass is finished.

In a situation where there are multiple GPUs with enough space to accommodate the entire model, it switches control from one GPU to the subsequent until all layers have run. Just one GPU works at any given time, which sounds very inefficient however it does produce decent throughput despite the idling of the GPUs.

Additionally it is very flexible because the same code can run on any given setup. Speed up will use all available GPUs first, then offload on the CPU until the RAM is full, and eventually on the disk. Offloading to CPU or disk will make things slower. For example, users have reported running BLOOM with no code changes on just 2 A100s with a throughput of 15s per token as in comparison with 10 msecs on 8×80 A100s.

You’ll be able to learn more about this solution in Speed up documentation.

Setup

pip install transformers>=4.21.3 speed up>=0.12.0

Run

The easy execution is:

python bloom-inference-scripts/bloom-accelerate-inference.py --name bigscience/bloom --batch_size 1 --benchmark

To activate the 8bit quantized solution from BitsAndBytes first install bitsandbytes:

pip install bitsandbytes

after which add --dtype int8 to the previous command line:

python bloom-inference-scripts/bloom-accelerate-inference.py --name bigscience/bloom --dtype int8 --batch_size 1 --benchmark

if you’ve gotten greater than 4 GPUs you may tell it to make use of only 4 with:

CUDA_VISIBLE_DEVICES=0,1,2,3 python bloom-inference-scripts/bloom-accelerate-inference.py --name bigscience/bloom --dtype int8 --batch_size 1 --benchmark

The very best batch size we were in a position to run without OOM was 40 on this case. If you happen to look contained in the script we needed to tweak the memory allocation map to free the primary GPU to handle only activations and the previous tokens’ cache.

DeepSpeed-Inference

DeepSpeed-Inference uses Tensor-Parallelism and efficient fused CUDA kernels to deliver a super-fast <1msec per token inference on a big batch size of 128.

Setup

pip install deepspeed>=0.7.3

Run

the fastest approach is to make use of a TP-pre-sharded (TP = Tensor Parallel) checkpoint that takes only ~1min to load, as in comparison with 10min for non-pre-sharded bloom checkpoint:

deepspeed --num_gpus 8 bloom-inference-scripts/bloom-ds-inference.py --name microsoft/bloom-deepspeed-inference-fp16

1a.
if you should run the unique bloom checkpoint, which once loaded will run at the identical throughput because the previous solution, however the loading will take 10-20min:

deepspeed --num_gpus 8 bloom-inference-scripts/bloom-ds-inference.py --name bigscience/bloom

2a. The 8bit quantized version requires you to have only half the GPU memory of the traditional half precision version:

deepspeed --num_gpus 8 bloom-inference-scripts/bloom-ds-inference.py --name microsoft/bloom-deepspeed-inference-int8 --dtype int8

Here we used microsoft/bloom-deepspeed-inference-int8 and in addition told the script to run in int8.

And naturally, just 4x80GB A100 GPUs is now sufficient:

deepspeed --num_gpus 4 bloom-inference-scripts/bloom-ds-inference.py --name microsoft/bloom-deepspeed-inference-int8 --dtype int8

The very best batch size we were in a position to run without OOM was 128 on this case.

You’ll be able to see two aspects at play leading to raised performance here.

The throughput here was improved by utilizing Tensor Parallelism (TP) as an alternative of the Pipeline Parallelism (PP) of Speed up. Because Speed up is supposed to be very generic additionally it is unfortunately hard to maximise the GPU usage. All computations are done first on GPU 0, then on GPU 1, etc. until GPU 8, which implies 7 GPUs are idle on a regular basis. DeepSpeed-Inference then again uses TP, meaning it can send tensors to all GPUs, compute a part of the generation on each GPU after which all GPUs communicate to one another the outcomes, then move on to the subsequent layer. Which means all GPUs are energetic directly but they should communicate way more.
DeepSpeed-Inference also uses custom CUDA kernels to avoid allocating an excessive amount of memory and doing tensor copying to and from GPUs. The effect of that is lesser memory requirements and fewer kernel starts which improves the throughput and allows for larger batch sizes resulting in higher overall throughput.

If you happen to are concerned with more examples you may take a take a look at Speed up GPT-J inference with DeepSpeed-Inference on GPUs or Speed up BERT inference with DeepSpeed-Inference on GPUs.

Deepspeed ZeRO-Inference

Deepspeed ZeRO uses a magical sharding approach which may take almost any model and scale it across a couple of or a whole lot of GPUs and the do training or inference on it.

Setup

pip install deepspeed

Run

Note that the script currently runs the identical inputs on all GPUs, but you may run a special stream on each GPU, and get n_gpu times faster throughput. You’ll be able to’t try this with Deepspeed-Inference.

deepspeed --num_gpus 8 bloom-inference-scripts/bloom-ds-zero-inference.py --name bigscience/bloom --batch_size 1 --benchmark

Please do not forget that with ZeRO the user can generate multiple unique streams at the identical time – and thus the general performance needs to be throughput in secs/token divided by variety of participating GPUs – so 8x to 16x faster depending on whether 8 or 16 GPUs were used!

You may as well try the offloading solutions with only one smallish GPU, which is able to take a protracted time to run, but should you do not have 8 huge GPUs that is pretty much as good because it gets.

CPU-Offload (1x GPUs):

deepspeed --num_gpus 1 bloom-inference-scripts/bloom-ds-zero-inference.py --name bigscience/bloom --batch_size 8 --cpu_offload --benchmark

NVMe-Offload (1x GPUs):

deepspeed --num_gpus 1 bloom-inference-scripts/bloom-ds-zero-inference.py --name bigscience/bloom --batch_size 8 --nvme_offload_path=/path/to/nvme_offload --benchmark

be certain to regulate /path/to/nvme_offload to somewhere you’ve gotten ~400GB of free memory on a quick NVMe drive.

Additional Client and Server Solutions

At transformers-bloom-inference you can see more very efficient solutions, including server solutions.

Listed below are some previews.

Server solutions:

Blog credits

Huge due to the next kind folks who asked good questions and helped improve the readability of the article:
Olatunji Ruwase and Philipp Schmid.

Source link

Incredibly Fast BLOOM Inference with DeepSpeed and Speed up

Benchmarks

Solutions

HuggingFace Speed up

Setup

Run

DeepSpeed-Inference

Setup

Run

Deepspeed ZeRO-Inference

Setup

Run

Additional Client and Server Solutions

Blog credits

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

AI ads steal the show at Super Bowl LX

Study: Platforms that rank the most recent LLMs will be unreliable

Ethics and Society Newsletter #1

Efficient Few-Shot Learning Without Prompts

How 🤗 Speed up runs very large models due to PyTorch

Incredibly Fast BLOOM Inference with DeepSpeed and Speed up

Benchmarks

Solutions

HuggingFace Speed up

Setup

Run

DeepSpeed-Inference

Setup

Run

Deepspeed ZeRO-Inference

Setup

Run

Additional Client and Server Solutions

Blog credits

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.