Home Artificial Intelligence Meet vLLM: UC Berkeley’s Open Source Framework for Super Fast and Chearp LLM Serving Paged Attention Using vLLM The Performance

Meet vLLM: UC Berkeley’s Open Source Framework for Super Fast and Chearp LLM Serving Paged Attention Using vLLM The Performance

1
Meet vLLM: UC Berkeley’s Open Source Framework for Super Fast and Chearp LLM Serving
Paged Attention
Using vLLM
The Performance

The framework shows remarkable improvements in comparison with frameworks like Hugging Face’s Transformers.

Image Credit: UC Berkeley

I recently began an AI-focused educational newsletter, that already has over 160,000 subscribers. TheSequence is a no-BS (meaning no hype, no news, etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to maintain you up up to now with machine learning projects, research papers, and ideas. Please give it a try by subscribing below:

Despite the remarkable progress in large language models(LLMs), many elements of their lifecycle management remain a challenge. This is especially relevant with the brand new generation of open-source LLMs which have been rapidly emerging out there. Amongst those elements, serving stays particularly difficult. Given the architectural complexities of LLMs, it is rather common to experience poor performance even when running on expensive hardware. To handle a few of these challenges, a team from UC Berkeley open-sourced vLLM, a framework to speed up the inference and serving performance of LLMs. The framework showed remarkable performance gains in comparison with mainstream frameworks similar to Hugging Face’s Transformers. The core of vLLM is predicated on a brilliant creative latest algorithm.

Inside the realm of vLLM, we have now identified a memory bottleneck that hinders the performance of LLM serving. In the course of the autoregressive decoding process, the LLM generates attention key and value tensors for all input tokens, that are stored in GPU memory to generate subsequent tokens. To tackle this issue, vLLM introduces PagedAttention, an attention algorithm inspired by the concept of virtual memory and paging in operating systems. Unlike traditional attention algorithms, PagedAttention allows for storing continuous keys and values in non-contiguous memory spaces. Specifically, PagedAttention divides the KV cache of every sequence into blocks, with each block containing the keys and values for a hard and fast variety of tokens. During attention computation, the PagedAttention kernel efficiently identifies and fetches these blocks.

Image Credit: UC Berkeley

Because the blocks now not require contiguous memory, Paged Attention gains greater flexibility in managing keys and values, much like the virtual memory in operating systems. Consider the blocks as pages, tokens as bytes, and sequences as processes. The logically contiguous blocks of a sequence are mapped to physically non-contiguous blocks through a block table. As latest tokens are generated, physical blocks are allocated on demand.

Image Credit: UC Berkeley

With PagedAttention, memory waste occurs only within the last block of a sequence. In practice, this leads to near-optimal memory utilization, with a mere 4% waste. Such improvement in memory efficiency offers significant advantages: it enables batching more sequences together, increases GPU utilization, and consequently enhances throughput, as demonstrated within the performance results above.

PagedAttention also provides a further advantage: efficient memory sharing. As an example, during parallel sampling, multiple output sequences are generated from the identical prompt. In such cases, the computation and memory for the prompt may be shared among the many output sequences.

Image Credit: UC Berkeley

PagedAttention naturally facilitates memory sharing through its block table. PageAttention’s memory sharing significantly reduces the memory overhead of complex sampling algorithms like parallel sampling and beam search, reducing their memory usage by as much as 55%.

The means of using vLLM is fundamentally easy. The framework is applicable to each online and offline inference. The experience for offline inference only requires importing the library and calling the generate method.

from vllm import LLM

prompts = ["Hello, my name is", "The capital of France is"] # Sample prompts.
llm = LLM(model="lmsys/vicuna-7b-v1.3") # Create an LLM.
outputs = llm.generate(prompts) # Generate texts from the prompts.

For online inference, vLLM allows the creation of a server and query it using an OpenAI-compatible API.

$ python -m vllm.entrypoints.openai.api_server --model lmsys/vicuna-7b-v1.3

$ curl http://localhost:8000/v1/completions
-H "Content-Type: application/json"
-d '{
"model": "lmsys/vicuna-7b-v1.3",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'

In a remarkable demonstration of performance improvements, the UC Berkeley team compared vLLM against Hugging Face’s Transformer library for the LLaMA-7B model on an NVIDIA A10G GPU and LLaMA-13B on an NVIDIA A100 GPU (40GB). vLLM showed over 20x gains in performance throughout.

Image Credit: UC Berkeley

To guage the performance of VLLM by yourself, you should utilize an internet version deployed on the Chatbot Arena and Vicuna Demo.

vLLM is some of the impressive libraries within the LLM space that has been flying completely under the radar. Hopefully, we’ll see vLLM incorporated into LLM platforms within the near future.

1 COMMENT

LEAVE A REPLY

Please enter your comment!
Please enter your name here