Optimizing Inference for Long Context and Large Batch Sizes with NVFP4 KV Cache

-


Quantization is one in all the strongest levers for large-scale inference. By reducing the precision of weights, activations, and KV cache, we are able to reduce the memory footprint and compute cost—directly improving throughput, latency, and achievable context length. 

This blog introduces NVFP4 KV cache quantization, a brand new KV format that permits significant performance gains on NVIDIA Blackwell GPUs. NVFP4 cuts KV cache memory footprint by as much as 50% and might effectively double context budgets, unlocking larger batch sizes, longer sequences, and better cache-hit rates. These gains include <1% accuracy loss across code-generation, knowledge, and long-context benchmarks.

Within the sections that follow, we’ll explore how this optimization delivers tangible gains for inference workloads and strengthens the stacking effects of the NVIDIA extreme co-design stack.

What’s KV cache? 

Large language models (LLMs) depend on an autoregressive means of generating tokens one after the other based on all previous tokens. This process allows for consideration of the sequence’s full context, which is at the center of why LLMs perform so well at natural language modeling tasks. This same behavior ends in significant compute inefficiencies as models try and recalculate each preceding token’s attention projection, generally known as the important thing and value tensors, every time a brand new token is generated. 

Figure 1 below provides a simplified representation of the eye computations with and without KV cache. For the reason that previous tokens’ attention values are masked from attending to future tokens, the important thing and value vectors for all past tokens (including the unique input sequence) never change. In consequence, recomputing them and redoing the associated matrix-multiply-add (MMA) operations for each recent token is redundant and wastes computation.

A GIF diagram comparing self attention with and without a key value cache in a transformer language model. In the upper section, without a cache, a four-token query tensor multiplies a four-token key tensor to form a four-by-four attention matrix, which then multiplies a four-token value tensor to produce four attention outputs; all keys and values are recomputed at each step. In the lower section, with a cache, only a single token query is computed, which interacts with cached keys and cached values to produce a single token attention output, with highlighted blocks indicating reused cached data and masked positions removed, and a caption stating that matrices computed with key value caching are much smaller.A GIF diagram comparing self attention with and without a key value cache in a transformer language model. In the upper section, without a cache, a four-token query tensor multiplies a four-token key tensor to form a four-by-four attention matrix, which then multiplies a four-token value tensor to produce four attention outputs; all keys and values are recomputed at each step. In the lower section, with a cache, only a single token query is computed, which interacts with cached keys and cached values to produce a single token attention output, with highlighted blocks indicating reused cached data and masked positions removed, and a caption stating that matrices computed with key value caching are much smaller.
Figure 1. A GIF of how key value caching reduces the work done by self attention in an autoregressive transformer. The highest panel, labeled “No KV Cache,” shows that each recent step recomputes queries, keys, values, and the total attention output for all tokens seen thus far. The underside panel, labeled “With KV Cache,” shows that only the present token’s query is newly computed, while all past keys and values are loaded from the cache, so the eye and output matrices are much smaller and redundant computation is avoided.

KV cache was introduced to alleviate the compute bottleneck created by having to regenerate key and value vectors for each previously seen token. By paying a price in memory footprint and bandwidth, those K/V tensors are stored once after which fetched directly during attention, fairly than recomputed. In practice, the cache sits behind a fixed-size memory pool, as shown in Figure 2 below. 

A schematic diagram showing how a transformer model uses a fixed-size KV (key/value) cache during inference. On the left, a vertical yellow KV cache block stores previously computed K/V tensors, with red dashed boxes indicating evicted entries when memory is exceeded. Incoming tokens flow downward into the attention block. If the needed K/V tensors are present in the cache, a green arrow labeled “Cache Hit” retrieves them, saving compute. If not, a gray arrow labeled “Cache Miss” triggers recomputation of K/V tensors, shown as yellow blocks, which are then stored back into the cache. The output continues through an MLP block and into the next transformer block. The diagram highlights both the efficiency of cache hits and the compute overhead of cache misses.A schematic diagram showing how a transformer model uses a fixed-size KV (key/value) cache during inference. On the left, a vertical yellow KV cache block stores previously computed K/V tensors, with red dashed boxes indicating evicted entries when memory is exceeded. Incoming tokens flow downward into the attention block. If the needed K/V tensors are present in the cache, a green arrow labeled “Cache Hit” retrieves them, saving compute. If not, a gray arrow labeled “Cache Miss” triggers recomputation of K/V tensors, shown as yellow blocks, which are then stored back into the cache. The output continues through an MLP block and into the next transformer block. The diagram highlights both the efficiency of cache hits and the compute overhead of cache misses.
Figure 2. Incoming tokens query a hard and fast memory pool of K/V tensors (KV cache); cache hits reuse stored values to scale back compute, while cache misses trigger K/V recomputation and potential eviction when memory limits are reached.

When that pool fills, the KV cache manager evicts portions of older context. If a future request references an evicted span, the system takes a cache miss and is forced to recompute the missing K/V tensors. The online effect is that the actual performance gain hinges on cache-hit rate: High hit rates preserve the intended compute savings, while lower hit rates push the model back toward the very recomputation path the KV cache was meant to eliminate.

During inference, this cache is populated and used across two distinct phases. Within the prefill phase, the model ingests all the input sequence, running large, highly parallel matrix multiply and accumulate (MMA) operations to compute attention, and storing the resulting key and value vectors for all input tokens into the KV cache. The model then enters the decode phase, where it generates recent tokens one after the other; each step requires a full forward pass, but the eye blocks now fetch key and value vectors for all previous tokens from the KV cache, compute the present token’s key and value vectors, and append them back into the cache in order that they could be reused on the subsequent decoding step.

Optimizing KV cache with NVFP4

One among the newest opportunities to optimize KV cache performance is thru NVFP4 and the NVIDIA TensorRT Model Optimizer. This recent feature allows for the quantization of the KV cache from its native 16-bit precision right down to 4-bit. 

The quantization of KV cache will not be entirely recent, as FP8 KV caches are well utilized in production; nonetheless, the increasing size of models and scale of inference deployments implies that the KV cache can still end in significant bottlenecks during prefill and decode. The quantization of KV cache helps alleviate the burden on multiple components of the inference pipeline, impacting compute, memory capability, and memory bandwidth:

  • Memory capability: NVFP4 KV cache reduces the memory footprint of the KV cache by about 50% in comparison with FP8 KV cache. This permits larger context lengths, batch sizes, and user concurrency.
  • Memory bandwidth: In the course of the decode phase, which involves many read/writes of KV cache and puts significant pressure on memory bandwidth, smaller KV cache consumes less memory bandwidth.

The present implementation of NVFP4 KV cache requires that values be dequantized from NVFP4 to FP8 before attention and context matrix math. The brand new token’s key and value vectors are quantized to NVFP4 before being appended to the KV cache (Figure 3).

The diagram shows the attention flow during transformer inference, with a focus on how the KV cache is used and where quantization and dequantization happen. On the left, the current token representation is multiplied by the query weight matrix to create the query vector. In the middle, the query vector is compared with the stored key vectors from the KV cache, which are dequantized when retrieved. The resulting attention scores are scaled, passed through a softmax operation, and converted into attention weights. On the right, these attention weights are combined with the dequantized value vectors from the KV cache to produce the context vector. At the same time, new key and value vectors for the current token are generated, quantized, and added back into the KV cache, represented as a grid of green blocks to show storage and reuse across decoding steps.The diagram shows the attention flow during transformer inference, with a focus on how the KV cache is used and where quantization and dequantization happen. On the left, the current token representation is multiplied by the query weight matrix to create the query vector. In the middle, the query vector is compared with the stored key vectors from the KV cache, which are dequantized when retrieved. The resulting attention scores are scaled, passed through a softmax operation, and converted into attention weights. On the right, these attention weights are combined with the dequantized value vectors from the KV cache to produce the context vector. At the same time, new key and value vectors for the current token are generated, quantized, and added back into the KV cache, represented as a grid of green blocks to show storage and reuse across decoding steps.
Figure 3. KV cache-driven attention flow showing where quantization and dequantization occur during inference.

The quantize API from Model Optimizer could be used to perform post-training quantization (PTQ) or quantization aware training (QAT). To enable NVFP4 KV cache during PTQ or QAT the identical quantize API could be used—and it only requires changing the quantization configuration. 

The code snippet below prepares the model for quantization to NVFP4 KV cache on top of FP8 weights and activations. To also get the advantage of 4-bit math, the model weights might be compressed to NVFP4 by changing quant_cfg to mtq.NVFP4_DEFAULT_CFG.

# configure fp8 quantization and fp4 for KV cache
quant_cfg = mtq.FP8_DEFAULT_CFG
quant_cfg["quant_cfg"].update(mtq.NVFP4_KV_CFG["quant_cfg"])

# Define forward loop for calibration with
def forward_loop(model):
    for data in calib_set:
        model(data)


# Quantize the modelmodel = mtq.quantize(model, quant_cfg, forward_loop)

# Model is prepared for Post Training Quantization (PTQ) deployment

# (Optional) Quantization-aware training (QAT)
Train quantized model further for improving accuracy
# adjust training parameters, e.g., lr, schedule, epochs
# HuggingFace and Megatron models supported
train(model, train_loader, optimizer, scheduler, ...)

How KV cache impacts performance

As mentioned above, KV cache eliminates redundant recomputation for previously processed tokens, at the fee of memory. Compressing KV cache to NVFP4 reduces this cost by 50% and doubles the content budget over the present standard FP8 KV cache, allowing models to carry double the context for inference. This advantages use cases that leverage textbook-scale sources and deep-reasoning—which might otherwise quickly exhaust KV cache memory budgets.

Higher hit rates save prefill compute

During prefill, latency is heavily impacted by how much of the incoming request’s context is already resident within the KV cache. NVFP4 improves this by delivering higher effective cache-hit rates than FP8 because the 4-bit footprint allows roughly 2x more context to stay on-device. This reduces evictions and preserves larger spans of previously processed tokens. When the model can retrieve these KV entries directly as an alternative of recomputing them, prefill experiences fewer stalls and better sustained ingestion throughput, leading to as much as 3x higher time-to-first-token (TTFT) latency.

A two-panel graph comparing NVFP4 and FP8 KV Cache performance for transformer inference on NVIDIA Blackwell GPUs. The left graph shows “Average Time to First Token (TTFT) Latency” versus “KV Cache Memory per GPU (GB),” with NVFP4 in green achieving markedly lower latency (up to 3x improvement) as cache memory grows. The right graph displays “Cache Hit Rate” versus “KV Cache Memory per GPU (GB),” where NVFP4 presents up to 20% higher hit rates over FP8, indicating more effective cache utilization and improved inference efficiency as memory increases.A two-panel graph comparing NVFP4 and FP8 KV Cache performance for transformer inference on NVIDIA Blackwell GPUs. The left graph shows “Average Time to First Token (TTFT) Latency” versus “KV Cache Memory per GPU (GB),” with NVFP4 in green achieving markedly lower latency (up to 3x improvement) as cache memory grows. The right graph displays “Cache Hit Rate” versus “KV Cache Memory per GPU (GB),” where NVFP4 presents up to 20% higher hit rates over FP8, indicating more effective cache utilization and improved inference efficiency as memory increases.
Figure 4. NVFP4 KV cache delivers as much as 3x lower latency and 20% higher cache hit rate in comparison with FP8 KV cache, showcasing significant performance benefits as cache memory per GPU increases. Evaluation performed using Qwen3-Coder-480B-A35B.

Because the KV cache grows, it captures more K/V tensors and naturally drives higher hit rates. This results in a plateau effect where the latency and hit-rate delta between NVFP4 and FP8 narrows (Figure 4 above)—highly model and context-length dependent. But an ever-inflating, unoptimized KV cache consumes an increasing share of the HBM budget. NVFP4 restores efficiency by making KV caching dramatically more HBM-effective, freeing budget for model weights, and enabling stronger stacking advantages with other co-designed components across the stack—NVLink, kernel optimizations, and Wide Expert Parallelism.

How NVFP4 KV cache impacts accuracy

We observe an accuracy lack of lower than 1%, in comparison with BF16 and FP8 baselines, on modern LLM benchmarks resembling LiveCodeBench, MMLU-PRO, MBPP, and Ruler 64K. Specifically, near parity on LiveCodeBench shows that the quantization preserves precise multi-step code generation, where small numerical errors can easily turn into syntax, compilation, or logic failures. 

Likewise, maintaining performance on Ruler 64K demonstrates robustness for long-context reasoning over 64K-token sequences, a setting where quantization noise typically accumulates. Together, these results indicate that the proposed format delivers efficiency gains without sacrificing end-to-end capability on difficult code and long-context workloads.

Bar chart titled “Benchmarking Performance of Different KV Cache Precisions – Qwen3‑480B‑A35B.” The x‑axis lists four benchmarks: LiveCodeBench, MMLU‑PRO, MBPP, and Ruler 64K. For each benchmark there are three green bars representing FP16, FP8, and NVFP4 KV‑cache formats. LiveCodeBench shows all three around 58%. On MMLU‑PRO, FP16 is about 78.2%, FP8 about 78.1%, and NVFP4 about 77.4%. On MBPP, FP16 is about 80.8%, FP8 about 79.7%, and NVFP4 about 79.9%. On Ruler 64K, FP16 is about 95.6%, FP8 about 95.5%, and NVFP4 about 94.6%. The y‑axis is labeled “Benchmark Accuracy” from 50% to 100%, highlighting that reduced‑precision KV caches maintain accuracy very close to full FP16.Bar chart titled “Benchmarking Performance of Different KV Cache Precisions – Qwen3‑480B‑A35B.” The x‑axis lists four benchmarks: LiveCodeBench, MMLU‑PRO, MBPP, and Ruler 64K. For each benchmark there are three green bars representing FP16, FP8, and NVFP4 KV‑cache formats. LiveCodeBench shows all three around 58%. On MMLU‑PRO, FP16 is about 78.2%, FP8 about 78.1%, and NVFP4 about 77.4%. On MBPP, FP16 is about 80.8%, FP8 about 79.7%, and NVFP4 about 79.9%. On Ruler 64K, FP16 is about 95.6%, FP8 about 95.5%, and NVFP4 about 94.6%. The y‑axis is labeled “Benchmark Accuracy” from 50% to 100%, highlighting that reduced‑precision KV caches maintain accuracy very close to full FP16.
Figure 5. Benchmark comparison of FP16, FP8, and NVFP4 KV cache precisions on Qwen3‑480B‑A35B, showing FP8 and NVFP4 closely match FP16 accuracy across coding, knowledge, and long‑context tasks.

One other critical insight is how NVFP4 compares to MXFP4 for KV cache quantization. Figure 6 shows the impact on MMLU model accuracy scores across BF16, FP8, NVFP4, and MXFP4. For the model tested, Llama 3.3 70B, we observe 5% higher accuracy when the KV cache is in NVFP4 versus MXFP4. These advantages come from NVFP4’s more granular block scaling and better precision E4M3 FP8 scaling aspects, which together allow for lower quantization error through the dequantization step.

Bar chart titled “Comparing Lower Precision Format KV Cache Accuracy,” with the y‑axis labeled “MMLU Accuracy” from 75% to 83% and the x‑axis labeled “KV Cache Format.” Three vertical bars represent different formats: FP8 at about 82.5% accuracy, NVFP4 at about 81.9%, and MXFP4 at about 77.8%. A vertical bracket and label to the right highlight that FP8 and NVFP4 provide roughly 5% better accuracy than MXFP4.Bar chart titled “Comparing Lower Precision Format KV Cache Accuracy,” with the y‑axis labeled “MMLU Accuracy” from 75% to 83% and the x‑axis labeled “KV Cache Format.” Three vertical bars represent different formats: FP8 at about 82.5% accuracy, NVFP4 at about 81.9%, and MXFP4 at about 77.8%. A vertical bracket and label to the right highlight that FP8 and NVFP4 provide roughly 5% better accuracy than MXFP4.
Figure 6. Comparison of FP8, NVFP4, and MXFP4 KV cache formats showing FP8 and NVFP4 delivering significantly higher MMLU accuracy than MXFP4.

Looking forward

NVFP4 KV cache is yet one more practical step within the broader software–hardware co‑design of the NVIDIA inference stack. Because the ecosystem around it matures, it could be combined with KV‑aware routing and offload in NVIDIA Dynamo and stacked with large‑scale expert parallelism in NVIDIA TensorRT‑LLM’s Wide‑EP to enhance utilization across big MoE deployments. 

On the hardware side, tighter KV cache optimization can higher exploit the NVL72 scale‑up domain and NVLink fabric for multi‑agent inference and long‑context deep‑reasoning workloads. Together, these pieces make it more feasible to serve larger experts, longer sequences, and better concurrency without giving up accuracy. 

To begin applying these techniques, we recommend leveraging the Model Optimizer code samples and notebooks as a base recipe for custom quantization workflows. 

Kai Xu, Shengliang Xu, Tian Zheng, and Asma Kuriparambil Thekkumpate contributed to the engineering efforts described on this blog.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x