Pruning and Distilling LLMs Using NVIDIA TensorRT Model Optimizer

Magnitude pruning: Sets weights with small absolute values near zero.
Activation-based pruning: Uses a calibration dataset to estimate the importance of various parts of the model based on their activations.
Structural pruning: Removes entire structures, like layers or attention heads.

Large language models (LLMs) have set a high bar in natural language processing (NLP) tasks corresponding to coding, reasoning, and math. Nevertheless, their deployment stays resource-intensive, motivating a growing interest in small language models (SLMs) that supply strong performance at a fraction of the fee.

NVIDIA researchers and engineers have demonstrated a way that mixes structured weight pruning with knowledge distillation, a strong strategy for compressing large models into smaller, efficient variants without significant loss in quality. For more details, see Compact Language Models via Pruning and Knowledge Distillation.

This post explains model pruning and knowledge distillation, how they work, and the way you may easily apply them to your individual models to realize optimal performance using NVIDIA TensorRT Model Optimizer.