Mastering Long Contexts in LLMs with KVPress

TL;DR: KVPress packs the newest KV cache compression techniques, enabling memory-efficient long-context LLMs. 🚀

Considered one of the important thing features of Large Language Models (LLMs) is their context window—the utmost variety of tokens they will process in a single request. As LLMs evolve, their context windows have gotten increasingly larger.

Larger context windows unlock incredible possibilities:

In-context retrieval: Seamlessly referencing large amounts of text inside a single query.
In-context learning: Adapting behavior to specific examples throughout the same session.
Prolonged reasoning: Handling very long chains of thought without breaking context.

Nevertheless, these prolonged windows come at a value—memory taken by the long context within the KV Cache becomes hard to administer. As an example, handling 1M tokens with Llama 3-70B in float16 demands 330GB for KV Cache, rendering it infeasible for a lot of applications.

On this blog post, we’ll address one solution for this problem: compressing the KV Cache for more efficient generation. To attain this, we’ll explore:

What the KV Cache is and why it matters.
KVPress, a robust toolkit from NVIDIA designed to compress KV Cache effectively.
The inner workings of KVPress and the way it achieves compression.

Before getting began, explore KVPress on this Space (you will find examples at the tip if needed):

What’s KV Cache and Why does it matter?

Figure 1: Key Value cache contained in the attention module (Source: NVIDIA)

In autoregressive models, text generation happens token by token, with each prediction counting on all preceding tokens for context. For instance:

To generate token 1000, the model must consider the representations of tokens 1 to 999.
To generate token 1001, the identical information (tokens 1 to 999) have to be processed again, together with token 1000.

This repetitive computation becomes inefficient because the sequence grows, especially for big models. KV Cache optimizes this process by storing the intermediate results—keys (K) and values (V)—from the eye layers, so the model can reuse them for future tokens as a substitute of recalculating them.

The Problem: KV Cache and its Linearly Scaling Burden

As powerful because the KV Cache is, it comes with a serious drawback-it scales linearly with the scale of the context window. While this won’t sound alarming at first, let’s break it right down to see why this becomes a serious bottleneck.

The Size of the KV Cache

The values stored within the KV Cache come from all the eye blocks utilized by the model. Subsequently, its size is dependent upon the model architecture, which dictates the variety of attention heads. More concretely, the memory consumed by the KV Cache is set by the next equation:

$text{Size}(text{KV}) = 2 times text{precision} times n_{layers} times n_{heads} times d times n_{tokens}$

Each of those aspects contributes to the explosion in memory usage. To make this more tangible, let’s consider a concrete example—Llama 3-70B running in bfloat16 precision (as really helpful by the model authors) with a context size of 1M tokens:

$Size (KV) = 2 \times 2 \times 80 \times 8 \times 128 \times 1 M = 327.6 GB text{Size}(text{KV}) = 2 times 2 times 80 times 8 times 128 times 1M = 327.6 text{GB}$

Since bfloat16 uses 2 bytes per parameter, the model weights alone require 140 GB (70B x 2 bytes). Which means that running the model with a 1M token context size demands roughly 470 GB of memory, with the KV cache alone accounting for a staggering 70% of this total.

KVPress: A toolkit for KV Cache Compression

As we have seen, the KV Cache is each a critical enabler and a big bottleneck for deploying large language models (LLMs) with long context windows. Addressing the linearly scaling memory problem requires progressive compression techniques, and that is exactly where KVPress steps in.

KVPress, developed by NVIDIA, is a Python toolkit designed to deal with the memory challenges of huge KV Caches by providing a set of state-of-the-art compression techniques. It also integrates with other approaches, corresponding to KV Cache Quantization, a way built into the transformers library to cut back memory usage (the precision term within the equation above), further expanding its utility (details here).

For researchers specializing in compression, KVPress offers a versatile and modular framework, making it easy to grasp and extend with latest methods. For developers, KVPress simplifies the technique of deploying these cutting-edge techniques, enabling quick and efficient integration into real-world applications.

KVPress in Motion

At its core, KVPress leverages presses, that are advanced compression algorithms specifically designed to cut back the memory footprint of the KV Cache.

Lots of these presses depend on a rating that’s utilized in each head to prune the KV pairs with the bottom importance. As an example the KnormPress prunes the KV pairs with lowest key value norm (paper), and SnapKVPress prunes the KV pairs related to a low attention weights for the newest queries (paper).

These presses are seamlessly integrated into the eye layers of the model using forward hooks.

Figure 2: KV Compression visualized (Source: NVIDIA)

During text generation, they dynamically compress the KV Cache, reducing memory usage without compromising the model’s ability to generate coherent and accurate outputs. Each press is characterised by a compression_ratio attribute, which determines the degree of compression applied to the KV Cache.

These presses integrate seamlessly with a custom transformers pipeline, enabling easy application and experimentation.

Here’s how you need to use a press with KVPress with one in all the numerous presses, ExpectedAttentionPress. This press prunes the KV pairs related to the bottom expected attention weight for future queries.

from transformers import pipeline
from kvpress import ExpectedAttentionPress

pipe = pipeline(
"kv-press-text-generation",
model="meta-llama/Llama-3.1-8B-Instruct",
device="cuda",
model_kwargs={"attn_implementation": "sdpa"}
)

context = "A really long text you wish to compress once and for all"
query = "nA query concerning the compressed context"  

press = ExpectedAttentionPress(compression_ratio=0.5)
answer = pipe(context, query=query, press=press)["answer"]

Try using it directly on this Hugging Face space or on this Google colab notebook!

By targeting the pre-filling phase, KVPress ensures that the cache is compressed when it’s largest—helping reduce memory overhead for sequences with tens of 1000’s and even hundreds of thousands of tokens.

The plot below demonstrates the GPU memory savings achieved with KVPress compression as prompt lengths increase. For shorter prompts, many of the memory is allocated to model weights—roughly 15GB for Llama 3.1 8B in bfloat16. Nevertheless, as prompt lengths grow, the KV cache becomes a serious contributor to memory consumption. For 128k context length, applying KVPress with a 50% compression ratio reduces peak memory usage from 45GB to 37GB. This smaller KV cache also improves decoding speed, from 11 tokens per second to 17 tokens per second on an A100 GPU (source). .

Figure 3: Memory usage vs Context length (Source: NVIDIA)

The research community has been actively developing various techniques for KV cache compression. KVPress encourages researchers to contribute their methods and already provides greater than a dozen presses.

To judge the performance of those presses, KVPress includes an easy CLI for benchmarking them on standard long-context datasets corresponding to RULER, InfiniteBench, and Loogle. The plot below benchmarks 9 different presses on the RULER dataset with 4k context length and different compression ratio. The very best performing press on this dataset is a mixture of the AdaKVPress (paper) and ExpectedAttentionPress, a brand new unpublished pruning technique created by the authors of KVPress (more information here).

Figure 4: Average rating vs Compression ratio (Source: NVIDIA)

The growing context windows of LLMs unlock latest possibilities but pose significant memory challenges with the linearly scaling KV Cache. KVPress addresses this by compressing the cache in the course of the critical pre-filling phase.

While KVPress improves memory efficiency, higher compression ratios can impact model accuracy, as shown within the benchmark plot. Further research is required to develop more practical compression algorithms that minimize trade-offs.

With its seamless integration into the transformers library and modular design, KVPress empowers researchers and developers to handle long-context LLMs efficiently and design latest compression techniques. It is a practical solution for scaling LLMs without overwhelming memory resources—ensuring innovation stays accessible as models grow.

Source link

Mastering Long Contexts in LLMs with KVPress

What’s KV Cache and Why does it matter?

The Problem: KV Cache and its Linearly Scaling Burden

The Size of the KV Cache

KVPress: A toolkit for KV Cache Compression

KVPress in Motion

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

3 Techniques to Effectively Utilize AI Agents for Coding

Yay! Organizations can now publish blog Articles

Generating Artwork in Python Inspired by Hirst’s Million-Dollar Spots Painting

Hugging Face and FriendliAI partner to supercharge model deployment on the Hub

Guided learning lets “untrainable” neural networks realize their potential

Mastering Long Contexts in LLMs with KVPress

What’s KV Cache and Why does it matter?

The Problem: KV Cache and its Linearly Scaling Burden

The Size of the KV Cache

KVPress: A toolkit for KV Cache Compression

KVPress in Motion

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.