This text shows get an incredibly fast per token throughput when generating with the 176B parameter BLOOM model.
Because the model needs 352GB in bf16 (bfloat16) weights (176*2), essentially the most efficient set-up is 8x80GB A100 GPUs. Also 2x8x40GB A100s or 2x8x48GB A6000 could be used. The foremost reason for using these GPUs is that on the time of this writing they supply the most important GPU memory, but other GPUs could be used as well. For instance, 24x32GB V100s could be used.
Using a single node will typically deliver a fastest throughput since more often than not intra-node GPU linking hardware is quicker than inter-node one, however it’s not at all times the case.
If you happen to do not have that much hardware, it’s still possible to run BLOOM inference on smaller GPUs, by utilizing CPU or NVMe offload, but after all, the generation time shall be much slower.
We’re also going to cover the 8bit quantized solutions, which require half the GPU memory at the price of barely slower throughput. We’ll discuss BitsAndBytes and Deepspeed-Inference libraries there.
Benchmarks
With none further delay let’s show some numbers.
For the sake of consistency, unless stated in another way, the benchmarks in this text were all done on the identical 8x80GB A100 node w/ 512GB of CPU memory on Jean Zay HPC. The JeanZay HPC users enjoy a really fast IO of about 3GB/s read speed (GPFS). This is very important for checkpoint loading time. A slow disc will end in slow loading time. Especially since we’re concurrently doing IO in multiple processes.
All benchmarks are doing greedy generation of 100 token outputs:
Generate args {'max_length': 100, 'do_sample': False}
The input prompt is comprised of just a couple of tokens. The previous token caching is on as well, as it might be quite slow to recalculate them on a regular basis.
First, let’s have a fast take a look at how long did it take to get able to generate – i.e. how long did it take to load and prepare the model:
| project | secs |
|---|---|
| speed up | 121 |
| ds-inference shard-int8 | 61 |
| ds-inference shard-fp16 | 60 |
| ds-inference unsharded | 662 |
| ds-zero | 462 |
Deepspeed-Inference comes with pre-sharded weight repositories and there the loading takes about 1 minuted. Speed up’s loading time is superb as well – at nearly 2 minutes. The opposite solutions are much slower here.
The loading time may or is probably not of importance, since once loaded you may continually generate tokens many times without an extra loading overhead.
Next an important benchmark of token generation throughput. The throughput metric here is a straightforward – how long did it take to generate 100 recent tokens divided by 100 and the batch size (i.e. divided by the overall variety of generated tokens).
Here is the throughput in msecs on 8x80GB GPUs:
| project bs | 1 | 8 | 16 | 32 | 64 | 128 | 256 | 512 |
|---|---|---|---|---|---|---|---|---|
| speed up bf16 | 230.38 | 31.78 | 17.84 | 10.89 | oom | |||
| speed up int8 | 286.56 | 40.92 | 22.65 | 13.27 | oom | |||
| ds-inference fp16 | 44.02 | 5.70 | 3.01 | 1.68 | 1.00 | 0.69 | oom | |
| ds-inference int8 | 89.09 | 11.44 | 5.88 | 3.09 | 1.71 | 1.02 | 0.71 | oom |
| ds-zero bf16 | 283 | 34.88 | oom |
where OOM == Out of Memory condition where the batch size was too big to suit into GPU memory.
Getting an under 1 msec throughput with Deepspeed-Inference’s Tensor Parallelism (TP) and custom fused CUDA kernels! That is absolutely amazing! Though using this solution for other models that it hasn’t been tried on may require some developer time to make it work.
Speed up is super fast as well. It uses a quite simple approach of naive Pipeline Parallelism (PP) and since it’s extremely easy it should work out of the box with any model.
Since Deepspeed-ZeRO can process multiple generate streams in parallel its throughput could be further divided by 8 or 16, depending on whether 8 or 16 GPUs were used in the course of the generate call. And, after all, it signifies that it may possibly process a batch size of 64 within the case of 8×80 A100 (the table above) and thus the throughput is about 4msec – so all 3 solutions are very close to one another.
Let’s revisit again how these numbers were calculated. To generate 100 recent tokens for a batch size of 128 took 8832 msecs in real time when using Deepspeed-Inference in fp16 mode. So now to calculate the throughput we did: walltime/(batch_size*new_tokens) or 8832/(128*100) = 0.69.
Now let’s take a look at the facility of quantized int8-based models provided by Deepspeed-Inference and BitsAndBytes, because it requires only half the unique GPU memory of inference in bfloat16 or float16.
Throughput in msecs 4x80GB A100:
| project bs | 1 | 8 | 16 | 32 | 64 | 128 |
|---|---|---|---|---|---|---|
| speed up int8 | 284.15 | 40.14 | 21.97 | oom | ||
| ds-inference int8 | 156.51 | 20.11 | 10.38 | 5.50 | 2.96 | oom |
To breed the benchmark results simply add --benchmark to any of those 3 scripts discussed below.
Solutions
First checkout the demo repository:
git clone https://github.com/huggingface/transformers-bloom-inference
cd transformers-bloom-inference
In this text we’re going to use 3 scripts positioned under bloom-inference-scripts/.
The framework-specific solutions are presented in an alphabetical order:
HuggingFace Speed up
Speed up handles big models for inference in the next way:
- Instantiate the model with empty weights.
- Analyze the scale of every layer and the available space on each device (GPUs, CPU) to come to a decision where each layer should go.
- Load the model checkpoint little by little and put each weight on its device
It then ensures the model runs properly with hooks that transfer the inputs and outputs on the best device and that the model weights offloaded on the CPU (and even the disk) are loaded on a GPU just before the forward pass, before being offloaded again once the forward pass is finished.
In a situation where there are multiple GPUs with enough space to accommodate the entire model, it switches control from one GPU to the subsequent until all layers have run. Just one GPU works at any given time, which sounds very inefficient however it does produce decent throughput despite the idling of the GPUs.
Additionally it is very flexible because the same code can run on any given setup. Speed up will use all available GPUs first, then offload on the CPU until the RAM is full, and eventually on the disk. Offloading to CPU or disk will make things slower. For example, users have reported running BLOOM with no code changes on just 2 A100s with a throughput of 15s per token as in comparison with 10 msecs on 8×80 A100s.
You’ll be able to learn more about this solution in Speed up documentation.
Setup
pip install transformers>=4.21.3 speed up>=0.12.0
Run
The easy execution is:
python bloom-inference-scripts/bloom-accelerate-inference.py --name bigscience/bloom --batch_size 1 --benchmark
To activate the 8bit quantized solution from BitsAndBytes first install bitsandbytes:
pip install bitsandbytes
after which add --dtype int8 to the previous command line:
python bloom-inference-scripts/bloom-accelerate-inference.py --name bigscience/bloom --dtype int8 --batch_size 1 --benchmark
if you’ve gotten greater than 4 GPUs you may tell it to make use of only 4 with:
CUDA_VISIBLE_DEVICES=0,1,2,3 python bloom-inference-scripts/bloom-accelerate-inference.py --name bigscience/bloom --dtype int8 --batch_size 1 --benchmark
The very best batch size we were in a position to run without OOM was 40 on this case. If you happen to look contained in the script we needed to tweak the memory allocation map to free the primary GPU to handle only activations and the previous tokens’ cache.
DeepSpeed-Inference
DeepSpeed-Inference uses Tensor-Parallelism and efficient fused CUDA kernels to deliver a super-fast <1msec per token inference on a big batch size of 128.
Setup
pip install deepspeed>=0.7.3
Run
- the fastest approach is to make use of a TP-pre-sharded (TP = Tensor Parallel) checkpoint that takes only ~1min to load, as in comparison with 10min for non-pre-sharded bloom checkpoint:
deepspeed --num_gpus 8 bloom-inference-scripts/bloom-ds-inference.py --name microsoft/bloom-deepspeed-inference-fp16
1a.
if you should run the unique bloom checkpoint, which once loaded will run at the identical throughput because the previous solution, however the loading will take 10-20min:
deepspeed --num_gpus 8 bloom-inference-scripts/bloom-ds-inference.py --name bigscience/bloom
2a. The 8bit quantized version requires you to have only half the GPU memory of the traditional half precision version:
deepspeed --num_gpus 8 bloom-inference-scripts/bloom-ds-inference.py --name microsoft/bloom-deepspeed-inference-int8 --dtype int8
Here we used microsoft/bloom-deepspeed-inference-int8 and in addition told the script to run in int8.
And naturally, just 4x80GB A100 GPUs is now sufficient:
deepspeed --num_gpus 4 bloom-inference-scripts/bloom-ds-inference.py --name microsoft/bloom-deepspeed-inference-int8 --dtype int8
The very best batch size we were in a position to run without OOM was 128 on this case.
You’ll be able to see two aspects at play leading to raised performance here.
-
The throughput here was improved by utilizing Tensor Parallelism (TP) as an alternative of the Pipeline Parallelism (PP) of Speed up. Because Speed up is supposed to be very generic additionally it is unfortunately hard to maximise the GPU usage. All computations are done first on GPU 0, then on GPU 1, etc. until GPU 8, which implies 7 GPUs are idle on a regular basis. DeepSpeed-Inference then again uses TP, meaning it can send tensors to all GPUs, compute a part of the generation on each GPU after which all GPUs communicate to one another the outcomes, then move on to the subsequent layer. Which means all GPUs are energetic directly but they should communicate way more.
-
DeepSpeed-Inference also uses custom CUDA kernels to avoid allocating an excessive amount of memory and doing tensor copying to and from GPUs. The effect of that is lesser memory requirements and fewer kernel starts which improves the throughput and allows for larger batch sizes resulting in higher overall throughput.
If you happen to are concerned with more examples you may take a take a look at Speed up GPT-J inference with DeepSpeed-Inference on GPUs or Speed up BERT inference with DeepSpeed-Inference on GPUs.
Deepspeed ZeRO-Inference
Deepspeed ZeRO uses a magical sharding approach which may take almost any model and scale it across a couple of or a whole lot of GPUs and the do training or inference on it.
Setup
pip install deepspeed
Run
Note that the script currently runs the identical inputs on all GPUs, but you may run a special stream on each GPU, and get n_gpu times faster throughput. You’ll be able to’t try this with Deepspeed-Inference.
deepspeed --num_gpus 8 bloom-inference-scripts/bloom-ds-zero-inference.py --name bigscience/bloom --batch_size 1 --benchmark
Please do not forget that with ZeRO the user can generate multiple unique streams at the identical time – and thus the general performance needs to be throughput in secs/token divided by variety of participating GPUs – so 8x to 16x faster depending on whether 8 or 16 GPUs were used!
You may as well try the offloading solutions with only one smallish GPU, which is able to take a protracted time to run, but should you do not have 8 huge GPUs that is pretty much as good because it gets.
CPU-Offload (1x GPUs):
deepspeed --num_gpus 1 bloom-inference-scripts/bloom-ds-zero-inference.py --name bigscience/bloom --batch_size 8 --cpu_offload --benchmark
NVMe-Offload (1x GPUs):
deepspeed --num_gpus 1 bloom-inference-scripts/bloom-ds-zero-inference.py --name bigscience/bloom --batch_size 8 --nvme_offload_path=/path/to/nvme_offload --benchmark
be certain to regulate /path/to/nvme_offload to somewhere you’ve gotten ~400GB of free memory on a quick NVMe drive.
Additional Client and Server Solutions
At transformers-bloom-inference you can see more very efficient solutions, including server solutions.
Listed below are some previews.
Server solutions:
More client-side solutions:
As this blog post is more likely to develop into outdated should you read this months after it was published please
use transformers-bloom-inference to search out the most recent solutions.
Blog credits
Huge due to the next kind folks who asked good questions and helped improve the readability of the article:
Olatunji Ruwase and Philipp Schmid.
