Quantization is one in all the strongest levers for large-scale inference. By reducing the precision of weights, activations, and KV cache, we are able to reduce the memory footprint and compute cost—directly improving throughput, latency, and achievable context length.
This blog introduces NVFP4 KV cache quantization, a brand new KV format that permits significant performance gains on NVIDIA Blackwell GPUs. NVFP4 cuts KV cache memory footprint by as much as 50% and might effectively double context budgets, unlocking larger batch sizes, longer sequences, and better cache-hit rates. These gains include <1% accuracy loss across code-generation, knowledge, and long-context benchmarks.
Within the sections that follow, we’ll explore how this optimization delivers tangible gains for inference workloads and strengthens the stacking effects of the NVIDIA extreme co-design stack.
What’s KV cache?
Large language models (LLMs) depend on an autoregressive means of generating tokens one after the other based on all previous tokens. This process allows for consideration of the sequence’s full context, which is at the center of why LLMs perform so well at natural language modeling tasks. This same behavior ends in significant compute inefficiencies as models try and recalculate each preceding token’s attention projection, generally known as the important thing and value tensors, every time a brand new token is generated.
Figure 1 below provides a simplified representation of the eye computations with and without KV cache. For the reason that previous tokens’ attention values are masked from attending to future tokens, the important thing and value vectors for all past tokens (including the unique input sequence) never change. In consequence, recomputing them and redoing the associated matrix-multiply-add (MMA) operations for each recent token is redundant and wastes computation.


KV cache was introduced to alleviate the compute bottleneck created by having to regenerate key and value vectors for each previously seen token. By paying a price in memory footprint and bandwidth, those K/V tensors are stored once after which fetched directly during attention, fairly than recomputed. In practice, the cache sits behind a fixed-size memory pool, as shown in Figure 2 below.


When that pool fills, the KV cache manager evicts portions of older context. If a future request references an evicted span, the system takes a cache miss and is forced to recompute the missing K/V tensors. The online effect is that the actual performance gain hinges on cache-hit rate: High hit rates preserve the intended compute savings, while lower hit rates push the model back toward the very recomputation path the KV cache was meant to eliminate.
During inference, this cache is populated and used across two distinct phases. Within the prefill phase, the model ingests all the input sequence, running large, highly parallel matrix multiply and accumulate (MMA) operations to compute attention, and storing the resulting key and value vectors for all input tokens into the KV cache. The model then enters the decode phase, where it generates recent tokens one after the other; each step requires a full forward pass, but the eye blocks now fetch key and value vectors for all previous tokens from the KV cache, compute the present token’s key and value vectors, and append them back into the cache in order that they could be reused on the subsequent decoding step.
Optimizing KV cache with NVFP4
One among the newest opportunities to optimize KV cache performance is thru NVFP4 and the NVIDIA TensorRT Model Optimizer. This recent feature allows for the quantization of the KV cache from its native 16-bit precision right down to 4-bit.
The quantization of KV cache will not be entirely recent, as FP8 KV caches are well utilized in production; nonetheless, the increasing size of models and scale of inference deployments implies that the KV cache can still end in significant bottlenecks during prefill and decode. The quantization of KV cache helps alleviate the burden on multiple components of the inference pipeline, impacting compute, memory capability, and memory bandwidth:
- Memory capability: NVFP4 KV cache reduces the memory footprint of the KV cache by about 50% in comparison with FP8 KV cache. This permits larger context lengths, batch sizes, and user concurrency.
- Memory bandwidth: In the course of the decode phase, which involves many read/writes of KV cache and puts significant pressure on memory bandwidth, smaller KV cache consumes less memory bandwidth.
The present implementation of NVFP4 KV cache requires that values be dequantized from NVFP4 to FP8 before attention and context matrix math. The brand new token’s key and value vectors are quantized to NVFP4 before being appended to the KV cache (Figure 3).


The quantize API from Model Optimizer could be used to perform post-training quantization (PTQ) or quantization aware training (QAT). To enable NVFP4 KV cache during PTQ or QAT the identical quantize API could be used—and it only requires changing the quantization configuration.
The code snippet below prepares the model for quantization to NVFP4 KV cache on top of FP8 weights and activations. To also get the advantage of 4-bit math, the model weights might be compressed to NVFP4 by changing quant_cfg to mtq.NVFP4_DEFAULT_CFG.
# configure fp8 quantization and fp4 for KV cache
quant_cfg = mtq.FP8_DEFAULT_CFG
quant_cfg["quant_cfg"].update(mtq.NVFP4_KV_CFG["quant_cfg"])
# Define forward loop for calibration with
def forward_loop(model):
for data in calib_set:
model(data)
# Quantize the modelmodel = mtq.quantize(model, quant_cfg, forward_loop)
# Model is prepared for Post Training Quantization (PTQ) deployment
# (Optional) Quantization-aware training (QAT)
Train quantized model further for improving accuracy
# adjust training parameters, e.g., lr, schedule, epochs
# HuggingFace and Megatron models supported
train(model, train_loader, optimizer, scheduler, ...)
How KV cache impacts performance
As mentioned above, KV cache eliminates redundant recomputation for previously processed tokens, at the fee of memory. Compressing KV cache to NVFP4 reduces this cost by 50% and doubles the content budget over the present standard FP8 KV cache, allowing models to carry double the context for inference. This advantages use cases that leverage textbook-scale sources and deep-reasoning—which might otherwise quickly exhaust KV cache memory budgets.
Higher hit rates save prefill compute
During prefill, latency is heavily impacted by how much of the incoming request’s context is already resident within the KV cache. NVFP4 improves this by delivering higher effective cache-hit rates than FP8 because the 4-bit footprint allows roughly 2x more context to stay on-device. This reduces evictions and preserves larger spans of previously processed tokens. When the model can retrieve these KV entries directly as an alternative of recomputing them, prefill experiences fewer stalls and better sustained ingestion throughput, leading to as much as 3x higher time-to-first-token (TTFT) latency.


Because the KV cache grows, it captures more K/V tensors and naturally drives higher hit rates. This results in a plateau effect where the latency and hit-rate delta between NVFP4 and FP8 narrows (Figure 4 above)—highly model and context-length dependent. But an ever-inflating, unoptimized KV cache consumes an increasing share of the HBM budget. NVFP4 restores efficiency by making KV caching dramatically more HBM-effective, freeing budget for model weights, and enabling stronger stacking advantages with other co-designed components across the stack—NVLink, kernel optimizations, and Wide Expert Parallelism.
How NVFP4 KV cache impacts accuracy
We observe an accuracy lack of lower than 1%, in comparison with BF16 and FP8 baselines, on modern LLM benchmarks resembling LiveCodeBench, MMLU-PRO, MBPP, and Ruler 64K. Specifically, near parity on LiveCodeBench shows that the quantization preserves precise multi-step code generation, where small numerical errors can easily turn into syntax, compilation, or logic failures.
Likewise, maintaining performance on Ruler 64K demonstrates robustness for long-context reasoning over 64K-token sequences, a setting where quantization noise typically accumulates. Together, these results indicate that the proposed format delivers efficiency gains without sacrificing end-to-end capability on difficult code and long-context workloads.


One other critical insight is how NVFP4 compares to MXFP4 for KV cache quantization. Figure 6 shows the impact on MMLU model accuracy scores across BF16, FP8, NVFP4, and MXFP4. For the model tested, Llama 3.3 70B, we observe 5% higher accuracy when the KV cache is in NVFP4 versus MXFP4. These advantages come from NVFP4’s more granular block scaling and better precision E4M3 FP8 scaling aspects, which together allow for lower quantization error through the dequantization step.


Looking forward
NVFP4 KV cache is yet one more practical step within the broader software–hardware co‑design of the NVIDIA inference stack. Because the ecosystem around it matures, it could be combined with KV‑aware routing and offload in NVIDIA Dynamo and stacked with large‑scale expert parallelism in NVIDIA TensorRT‑LLM’s Wide‑EP to enhance utilization across big MoE deployments.
On the hardware side, tighter KV cache optimization can higher exploit the NVL72 scale‑up domain and NVLink fabric for multi‑agent inference and long‑context deep‑reasoning workloads. Together, these pieces make it more feasible to serve larger experts, longer sequences, and better concurrency without giving up accuracy.
To begin applying these techniques, we recommend leveraging the Model Optimizer code samples and notebooks as a base recipe for custom quantization workflows.
Kai Xu, Shengliang Xu, Tian Zheng, and Asma Kuriparambil Thekkumpate contributed to the engineering efforts described on this blog.
