vLLM: PagedAttention for 24x Faster LLM Inference

Artificial Intelligence

vLLM: PagedAttention for 24x Faster LLM Inference

admin

June 25, 2023

vLLM: PagedAttention for 24x Faster LLM Inference

Just about all the big language models (LLM) depend on the Transformer neural architecture. While this architecture is praised for its efficiency, it has some well-known computational bottlenecks.

During decoding, one in every of these bottlenecks is within the computation of the eye with pairs of key-value tensors for every token of the input. All these tensors have to be stored in memory.

Note: I won’t explain in this text what’s the role of those key-value pairs. It’s one of the vital complicated and interesting facets of the Transformer architecture. When you don’t find out about it, I strongly recommend reading The Illustrated Transformer by Jay Alammar.

As LLM accepts longer and longer inputs, e.g., the LLM Claude accepts 100k token-long inputs, the memory consumed by these tensors can turn out to be very large.

Naively storing all these tensors in memory results in memory over-reservation and fragmentation. This fragmentation could make memory access very inefficient, especially for long sequences of tokens. As for over-reservation, the system does it to be certain that it has allocated enough memory for the tensors, even when it doesn’t devour all of it.

To alleviate these issues, UC Berkeley proposes PagedAttention.

PagedAttention is implemented in vLLM (Apache 2.0 license) which is deployed by LMSYS, a corporation for open research founded by students and school from UC Berkeley with the assistance of UCSD and CMU.

In this text, I explain what PagedAttention is and why it significantly quickens decoding. I show towards the top of the article how you can start with vLLM to take advantage of PagedAttention for inference and serving LLMs in your computer.

Kwon et al. (2023) propose PagedAttention.

The goal is to store key-value tensors more efficiently within the non-contiguous spaces of the GPU VRAM.

In brief, the concept behind PagedAttention is to create contiguous virtual blocks mapped to physical blocks within the GPU memory.

Each block is designed to store key-value pairs’ tensors for a predefined variety of tokens. All of the blocks are virtually contiguous and mapped to physical non-contiguous blocks, allocated on demand during inference, within the fragmented GPU memory. A straightforward index table can be created in memory to associate virtual with physical blocks.

The kernel of PagedAttention fetches as needed these blocks. That is efficient since the system fetches smaller numbers of key-value tensors as a consequence of the limited size of the blocks.

Let’s take the next prompt for illustration:

the cat is sleeping within the kitchen and the dog is

We’ve got key-value tensors for every token. With PageAttention, we will (arbitrarily) set the block size at 4. Each block comprises 4 key-value tensors, except the last one which comprises only 3 key-value tensors. The blocks are virtually contiguous but are usually not necessarily contiguous within the GPU memory, as illustrated by the figure within the introduction of this text.

For the computation of attention, for every query token, the system fetches the block one after the other, as illustrated below.

Illustration of virtual blocks containing key-value tensors for as much as 4 tokens — Image by the writer

By fetching key-value tensors by blocks, as a substitute of your entire sequence of tensors, the computation of attention is way faster.

One other advantage of PagedAttention is that the virtual blocks will be shared when sampling during inference. All of the sequences generated in parallel via sampling or beam search can use the identical virtual blocks, avoiding duplicates.

Of their experiments, LMSYS observed a 55% reduction in memory usage for beam search decoding.

Before trying it by ourselves, let’s have a have a look at the performance reported by the authors (UC Berkely/LMSYS) when using PagedAttention implemented in vLLM in comparison with the text generation inference library developed by Hugging Face.

Performance of LLaMa models for output completion tasks for the unique Hugging Face library (HF), text generation inference library (TGI), and vLLM with PagedAttention (vLLM) — Plots by UC Berkeley and LMSYS

vLLM looks much faster in line with these results, especially within the case of multiple output completions. The difference between TGI and vLLM increases with greater models. This is predicted since greater models require more memory and are thus more impacted by memory fragmentation.

Overall, vLLM is as much as 24x faster than the Hugging Face Transformers library.

Note: Actually, I’m also impressed by the advance from HF to TGI. I didn’t cover TGI yet on my blog but I’ll probably write a guide about it. TGI is utilized in production at Hugging Face. While it seems much slower than vLLM, TGI has other benefits similar to the support for a lot of more models and features.

Note: vLLM doesn’t support CUDA 12 yet. Use a lower version, similar to 11.8.

On this section, I’ll only undergo the fundamentals of how you can arrange and run vLLM in your computer. For more advanced usage, you possibly can have a have a look at the vLLM documentation.

As I write this text, vLLM only supports a couple of kinds of models:

GPT-2
GPT-NeoX and Pythia based
LLaMa based
OPT based

You may add the support of other models by following these instructions.

Within the code below, I take advantage of Dolly V2 (MIT license). It’s a chat model based on Pythia and trained by DataBricks.

I selected the smallest version with 3 billion parameters. It might probably run a consumer GPU with 24 GB of VRAM, e.g., an nVidia RTX 3080/3090.

Probably the most straightforward method to install vLLM is with pip:

pip install vllm

Note: This could take as much as 10 minutes.

But in my case, on each my computer and Google Colab, pip did not install the vllm library. The authors of vLLM confirm that there’s a problem with some nvcc versions and environments. Nonetheless, for many configurations, pip should install vLLM with none problem.

When you are in the identical situation as me, the workaround is just to make use of a Docker image. This one worked for me:

docker run --gpus all -it --rm --shm-size=8g nvcr.io/nvidia/pytorch:22.12-py3

Note: Once within the docker, the authors recommend removing Pytorch before installing vLLM: pip uninstall torch. Then, “pip install vllm” should work.

Then, we will start writing Python.

We first must import vllm, after which we load the model with vllm. The inference is triggered by llm.generate().

from vllm import LLMprompts = ["Tell me about gravity"] #You may put several prompts on this list
llm = LLM(model="databricks/dolly-v2-3b")  # Load the model
outputs = llm.generate(prompts)  # Trigger inference

You can too use vLLM for serving LLMs. It really works similarly to TGI. It’s also way more easy than running the NVIDIA Triton inference server that I described in a previous article.

You first need to start out the server:

 python -m vllm.entrypoints.openai.api_server --model databricks/dolly-v2-3b

Note: The server will listen on port 8000. Be sure that it is accessible or change it within the vLLM configuration file.

Then, you possibly can query the server with prompts as follows:

curl http://localhost:8000/v1/completions 
-H "Content-Type: application/json" 
-d '{
"model": "databricks/dolly-v2-3b",
"prompt": "Tell me about gravity",
"max_tokens": 200
}'

And that’s it! You could have a really efficient LLM server running in your computer.

PagedAttention significantly quickens inference. It’s one other step toward cheaper AI with LLM.

In further experiments, I confirmed that vLLM is very efficient with batches of prompts. To completely benefit from vLLM, consider optimizing your batching strategy for inference.

While beam search with large beams can have been prohibitive with standard attention computation, beam search with PagedAttention is quicker and more memory efficient.

One among my next experiments will probably be to mix PagedAttention with QLoRa to cut back memory usage. It must be straightforward. It will make running LLMs on consumer hardware much more efficient.

LEAVE A REPLY Cancel reply