As AI models get larger and architectures more complex, researchers and engineers are repeatedly finding latest techniques to optimize the performance and overall cost of bringing AI systems to production.
Model optimization is a category of techniques focused on addressing inference service efficiency. These techniques represent the very best “bang for buck” opportunities to optimize cost, improve user experience, and scale. These techniques range from fast and effective approaches like model quantization to powerful multistep workflows like pruning and distillation.
This post covers the highest five model optimization techniques enabled through NVIDIA Model Optimizer and the way each contributes to improving the performance, TCO, and scalability of deployments on NVIDIA GPUs.
These techniques are probably the most powerful and scalable levers currently available in Model Optimizer that teams can apply immediately to cut back cost per token, improve throughput, and speed up inference at scale.


1. Post-training quantization
Post-training quantization (PTQ) is the fastest path to model optimization. You’ll be able to leverage an existing model (FP16/BF16/FP8) and compress it to a lower precision format (FP8, NVFP4, INT8, INT4) using a calibration dataset—without touching the unique training loop. That is where most teams should begin. It is straightforward to use with Model Optimizer, and delivers immediate latency plus throughput wins, even on massive foundation models.


| Pros | Cons |
| –Fastest time to value –Achievable with small calibration dataset –Memory, latency, and throughput gains stack with other optimizations –Highly custom quantization recipes (NVFP4 KV Cache, for instance) |
–May require a special technique (QAT/QAD) if the standard floor drops under SLA |
To learn more, see Optimizing LLMs for Performance and Accuracy with Post-Training Quantization.
2. Quantization-aware training
Quantization-aware training (QAT) injects a brief, targeted fine-tuning phase where the model is tuned to account for low precision error. It simulates quantization noise within the forward loop while computing gradients in higher precision. QAT is a really helpful next step when additional accuracy is required beyond what PTQ has delivered.


| Pros | Cons |
| –Recovers all or a lot of the accuracy loss at low precision –Fully compatible with NVFP4, especially for FP4 stability |
–Requires training budget plus data –Takes longer to implement than PTQ alone |
To learn more, see How Quantization-Aware Training Enables Low-Precision Accuracy Recovery.
3. Quantization-aware distillation
Quantization-aware distillation (QAD) goes one level beyond QAT. With this system, the coed model learns to account for quantization errors while concurrently aligned to the total precision teacher through distillation loss. QAD doubles down on QAT by adding teaching elements from principles of distillation, enabling you to extract the utmost quality possible while running ultra-low precision at inference time. QAD is an efficient option for downstream tasks that notoriously suffer from significant performance degradation after quantization.


| Pros | Cons |
| –Highest accuracy recovery –Ideal for multistage post-training pipelines for simple setup and robust convergence |
–Additional training cycles after pretraining –Larger memory footprint –Barely more complex pipeline to implement today |
To learn more, see How Quantization-Aware Training Enables Low-Precision Accuracy Recovery.
4. Speculative decoding
The decode step in inference is well-known for affected by sequential processing algorithmic bottlenecks. Speculative decoding tackles this directly by utilizing a smaller or faster draft model (like EAGLE-3) to propose multiple tokens ahead, then verifying them in parallel with the goal model. This collapses sequential latency into single steps and dramatically reduces required forward passes at long sequence lengths, without touching model weights.
Speculative decoding is really helpful while you want immediate generation speedups without retraining or quantization, and it stacks cleanly with the opposite optimizations on this list to compound throughput and latency gains.


| Pros | Cons |
| –Radically reduces decode latency –Stacks perfectly with PTQ/QAT/QAD plus NVFP4 |
–Requires tuning (acceptance rate is the whole lot) –Second model or head required depending on variant |
To learn more, see An Introduction to Speculative Decoding for Reducing Latency in AI Inference.
5. Pruning plus knowledge distillation
Pruning is a structural optimization path. This system removes weights, layers, and/or heads to make the model smaller. Distillation then teaches the brand new smaller model the right way to think just like the larger teacher. This multistep optimization strategy permanently changes model performance since the baseline compute and memory footprint are permanently lowered.
Pruning plus knowledge distillation may be leveraged when other techniques on this list are unable to deliver the memory or compute savings obligatory to fulfill application requirements. This approach can be used when teams are open to creating more aggressive changes to an existing model to adapt it for specific specialized downstream use cases.


| Pros | Cons |
| –Reduces parameter count → everlasting plus structural cost savings –Enables smaller models that also behave like large models |
–Aggressive pruning without distill → cliffs accuracy –Requires more work to pipeline versus PTQ alone |
To learn more, see Pruning and Distilling LLMs Using NVIDIA TensorRT Model Optimizer.
Start with AI model optimization
Optimization techniques are available in all sizes and styles. This post highlights the highest five model optimization techniques enabled through Model Optimizer.
- PTQ, QAT, QAD, and pruning plus distillation make your model intrinsically cheaper, smaller, and more memory efficient to operate.
- Speculative decoding makes generation intrinsically faster by collapsing sequential latency.
To start and learn more, explore the deep-dive posts related to each technique for technical explainers, performance insights, and Jupyter Notebook walkthroughs.
