At Hugging Face, we’re excited to share with you a brand new feature that is going to take your language models to the following level: KV Cache Quantization.
TL;DR: KV Cache Quantization reduces memory usage for long-context text generation in LLMs with minimal impact on quality, offering customizable trade-offs between memory efficiency and generation speed.
Have you ever ever tried generating a lengthy piece of text along with your language model, only to hit a wall due to pesky memory limitations? As language models proceed to grow in size and capabilities, supporting longer generations can start to actually eat up memory. It’s a standard frustration, especially if you’re coping with limited resources. That is where kv cache quantization swoops in to save lots of the day.
So, what exactly is kv cache quantization? In case you’re not aware of the term, don’t sweat it! Let’s break it down into two pieces: kv cache and quantization.
Key-value cache, or kv cache, is required to optimize the generation in autoregressive models, where the model predicts text token by token. This process will be slow because the model can generate just one token at a time, and every latest prediction relies on the previous context. Which means, to predict token number 1000 within the generation, you would like information from the previous 999 tokens, which is available in the shape of some matrix multiplications across the representations of those tokens. But to predict token number 1001, you furthermore may need the identical information from the primary 999 tokens, plus additional information from token number 1000. That’s where key-value cache is used to optimize the sequential generation process by storing previous calculations to reuse in subsequent tokens, in order that they don’t must be computed again.
More concretely, key-value cache acts as a memory bank for autoregressive generative models, where the model stores key-value pairs derived from self-attention layers for previously processed tokens. Within the transformer architecture, self-attention layers calculate attention scores by multiplying queries with keys, producing weighted sums of value vectors as outputs. By storing this information, the model can avoid redundant computations and as a substitute retrieve keys and values of previous tokens from the cache. For a visible explanation of this idea, take a have a look at how key-value cache functions within the image below. When calculating the eye scores for the K+1th token we don’t must recompute the entire previous keys and values, but moderately take it from cache and concatenate to the present vector. This often ends in faster and more efficient text generation.

Moving on to the second term, quantization is just a elaborate word for reducing the precision of numerical values to save lots of memory. During quantization, each numerical value is rounded or truncated to suit inside the reduced precision format, which can lead to a loss of knowledge. Nevertheless, careful number of quantization parameters and techniques can minimize this loss while still achieving satisfactory performance. There are different quantization methods, so when you’re curious to learn more you’ll want to try our previous blog post for a deeper dive into the world of quantization.
Regardless that kv cache hastens autoregressive generation, it may well change into a memory bottleneck with long context length or high batch size. Let’s estimate how much memory we are going to must store kv cache for an input of sequence length 10000 tokens for a 7B Llama-2 model. The memory required to store kv cache of 1 token is roughly 2 * 2 * num_layers * num_key_value_heads * head_dim, where the primary 2 accounts for keys and values and the second 2 is the variety of bytes we want (assuming the model is loaded in float16). So if we’ve a context of length 10000 tokens, we would want
2 * 2 * 32 * 32 * 128 * 10000 ≈ 5GB
of memory only to store the previous key-value cache, which is nearly one third of the memory required to store model parameters in half-precision.
Due to this fact, by compressing kv cache right into a more compact form we are able to save up a whole lot of memory and run longer context generation on consumer GPUs. In our experiments, we were capable of significantly reduce the memory footprint without sacrificing an excessive amount of quality by quantizing the kv cache into lower precision formats. With this latest quantization feature, we are able to now support longer generations without running out of memory, which implies you may expand your model’s context length without worrying about hitting a memory constraint.
Implementation Details
Key-value cache quantization in Transformers was largely inspired by the KIVI: A Tuning-Free Asymmetric 2bit Quantization for kv Cache paper. The paper introduced a 2bit asymmetrical quantization for big language models without quality degradation. KIVI quantizes the important thing cache per-channel and the worth cache per-token, because they showed that for LLMs keys have higher magnitudes of outliers in some channels while values don’t show such a pattern. Due to this fact, the relative error between quantized and original precision is far smaller when keys are quantized per-channel and the values per-token.
In the strategy we integrated in Transformers the important thing and values are each quantized per-token. The predominant bottleneck when quantizing per-token is the necessity to quantize and de-quantize keys and values each time a latest token is added, that’s every generation step. That may cause a decelerate in generation. To beat this issue we decided to retain a set size residual cache to store keys and values of their original precision. When the residual cache reaches its maximum capability the stored keys and values are quantized and the cache content is discarded. This small trick also allows preserving accuracy since some a part of essentially the most recent keys and values are at all times stored of their original precision. The predominant consideration is the memory-efficiency trade-off when setting the residual cache length. While residual cache stores keys and values of their original precision, which will lead to overall memory usage increase. We found that using a residual length of 128 works well as a baseline.
So given a key or value of shape batch size, num of heads, num of tokens, head dim we group it to num of groups, group size and perform affine quantization as follows:
X_Q = round(X / S) - Z
where,
- X_Q is the quantized tensor
- S is the size, calculated as
(maxX - minX) / (max_val_for_precision - min_val_for_precision) - Z is the zeropoint, calculated as
round(-minX / S)
Currently, the kv quantization works on quanto backend with int2 and int4 precisions and HQQ backend with int2, int4 and int8 precisions. For more details about quanto seek advice from the previous blogpost. Although we do not currently support more quantization backends, we’re open to community contributions that would help integrate them. Specifically, quantization methods that don’t need calibration data and might dynamically calculate lower-bit tensors on-the-fly will be easily integrated. Moreover, you may indicate essentially the most common quantization parameters within the config, thus have freedom to tweak the quantization process, e.g. resolve whether to perform per-channel or per-token quantization depending in your use case.
Comparing performance of fp16 and quantized cache
We all know visuals speak louder than words, so we have prepared some comparison plots to offer you a snapshot of how quantization stacks up against FP16 precision. These plots show you at a look how the model’s generation holds up when it comes to quality after we tweak the precision settings for kv cache. We calculated the perplexity of Llama2-7b-chat model on the PG-19 dataset with the next quantization parameters: nbits=4, group_size=64, resildual_length=128, per_token=True
We are able to see that the int4 cache performs almost the identical as the unique fp16 precision for each backends, while the standard degrades when using int2. The script to breed the outcomes is obtainable here.

The identical conclusion holds when calculating performance on the LongBench benchmark comparing it to results from the KIVI paper. Int4 quanto precision is comparable and even outperforms barely the fp16 in the entire datasets within the table below (higher is best).
| Dataset | KIVI f16p | KIVI int2 | Transformers fp16 | Quanto int4 | Quanto int2 |
|---|---|---|---|---|---|
| TREC | 63.0 | 67.5 | 63.0 | 63.0 | 55.0 |
| SAMSum | 41.12 | 42.18 | 41.12 | 41.3 | 14.04 |
| TriviaQA | NA | NA | 84.28 | 84.76 | 63.64 |
| HotPotQA | NA | NA | 30.08 | 30.04 | 17.3 |
| Passage_retrieval_en | NA | NA | 8.5 | 9.5 | 4.82 |
Now, let’s talk in regards to the trade-off between memory savings and speed. After we quantize the kv cache in models, we’re making them less memory hungry, but sometimes that comes at a tiny cost to generation speed. While quantizing the cache to int4 can offer roughly an x2.5 memory saving, the generation speed starts to diminish with higher batch sizes. One has to make a decision whether using quantized kv cache and potentially sacrificing a little bit of speed is well worth the trade-off for the numerous gains in memory efficiency. It’s all about finding the approach that most closely fits your specific use case and priorities.
Below are the performance metrics for kv cache in original precision and quantized format. Script to acquire the next figures is obtainable here.



Wondering what happens after we throw weight quantization into the combination? Sure, combining these techniques can further slim down your model’s memory footprint, but there is a catch – it would slow things down much more. The truth is, our experiments show that weight quantization along with kv cache quantization can result in a threefold decrease in speed. But we’re consistently tinkering away to seek out ways to make this combo work seamlessly. And while we do not currently have optimized kernels within the quanto library, we’re open to community contributions that would help improve computational efficiency. Our goal is to make sure your model runs easily while maintaining high latency and accuracy.
It is also value noting that initial processing of the input prompt (aka pre-fill stage) still requires computing all the key-value matrices in a single go for the entire input, which could also be one other memory bottleneck for long contexts. That is the rationale why the latency related to generating the primary token tends to be higher in comparison with subsequent tokens. There are other different strategies to diminish the memory burden of the pre-fill stage by optimizing the eye computation stage, akin to Local Windowed Attention or Flash-Attention. In case you are out of memory for the pre-fill stage, you should utilize FlashAttention in 🤗 Transformers together with the kv cache quantization to diminish memory usage much more for long input prompts. See the docs for more information on that.
In case you are involved in what number of tokens we are able to fit within the context, with memory usage pushed to its limits, a quantized kv cache can support as much as 128k tokens with Flash Attention enabled in an 80GB A100. For the cache in half precision, the utmost capability is 40k tokens.
use quantized kv cache in 🤗 Transformers?
To make use of kv cache quantization in 🤗 Transformers we’ve to put in external dependencies first by running pip install quanto. To activate quantization on kv cache, we’ve to pass in cache_implementation="quantized" and indicate quantization parameters in a cache config in dictionary format. And that is all we want to begin using kv cache quantization. Moreover, since quanto is device agnostic, you may quantize and run your model regardless when you are on CPU/GPU/MPS (Apple Silicon).
Here you’ll find a brief Colab notebook with usage examples.
>>> import torch
>>> from transformers import AutoTokenizer, AutoModelForCausalLM
>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
>>> model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", torch_dtype=torch.float16, device_map="cuda:0")
>>> inputs = tokenizer("I like rock music because", return_tensors="pt").to(model.device)
>>> out = model.generate(**inputs, do_sample=False, max_new_tokens=20, cache_implementation="quantized", cache_config={"backend": "quanto", "nbits": 4})
>>> print(tokenizer.batch_decode(out, skip_special_tokens=True)[0])
I like rock music since it's loud and energetic. It's an amazing technique to express myself and rel
>>> out = model.generate(**inputs, do_sample=False, max_new_tokens=20)
>>> print(tokenizer.batch_decode(out, skip_special_tokens=True)[0])
I like rock music since it's loud and energetic. I wish to hearken to it once I'm feeling
Conclusion
There are lots of more different methods to cut back memory usage by key-value cache, including MultiQueryAttention, GroupedQueryAttention or recent kv cache retrieval methods. While a few of these methods are sure to the model architecture decisions, others will be applied post-training. Quantization is one in every of such post-training optimization techniques and we are able to draw the next conclusion from our short blogpost:
-
Memory vs Speed trade-off: By quantizing the kv cache into lower precision formats, memory usage is significantly reduced, allowing for longer text generations without encountering memory constraints. But users have to make a decision on whether giving up a tiny little bit of generation speed suits their use-case.
-
Maintained Accuracy: Despite the reduction in precision, kv cache quantization in
int4preserves model accuracy to a satisfactory extent, ensuring that generated text stays contextually relevant and coherent. -
Flexibility: Users have the flexibleness to make a choice from different precision formats based on their specific requirements, allowing for personalization to suit various use cases and priorities.
-
Potential for Further Optimization: While kv cache quantization provides significant advantages by itself, it may well even be combined with other optimization techniques, akin to weight quantization, to further enhance memory efficiency and computational speed.
Acknowledgment
Special because of Younes and Marc for his or her assistance and advice on quantization techniques. Their expertise greatly contributed to the event of this feature.
Moreover, I would love to thank Joao for his invaluable support.
Additional Resources
- Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Braverman, V., Beidi Chen, & Hu, X. (2023). KIVI : Plug-and-play 2bit KV Cache Quantization with Streaming Asymmetric Quantization.
- Blogpost from Databricks on LLM Inference Performance Engineering: Best Practices
- Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, & Amir Gholami. (2024). KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization.
- T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer, (2022). LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale.
- A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer, (2021). A Survey of Quantization Methods for Efficient Neural Network Inference.
