Large language models (LLMs) have set a high bar in natural language processing (NLP) tasks corresponding to coding, reasoning, and math. Nevertheless, their deployment stays resource-intensive, motivating a growing interest in small language models (SLMs) that supply strong performance at a fraction of the fee.
NVIDIA researchers and engineers have demonstrated a way that mixes structured weight pruning with knowledge distillation, a strong strategy for compressing large models into smaller, efficient variants without significant loss in quality. For more details, see Compact Language Models via Pruning and Knowledge Distillation.
This post explains model pruning and knowledge distillation, how they work, and the way you may easily apply them to your individual models to realize optimal performance using NVIDIA TensorRT Model Optimizer.
What’s model pruning?
Pruning is a model optimization technique that leverages the common over-parameterization of neural networks occurring from training models with enough capability to learn complex features and ensure smooth convergence. Pruning systematically identifies and removes unimportant parameters corresponding to weights, neurons, and even layers from a trained model.
This process can often eliminate large amounts of a model’s weights with minimal impact on accuracy, directly translating to a more compact model with accelerated inference speeds and lower computational cost. Just like how an arborist trims a tree to enhance its health and growth, model pruning makes a model smaller and more efficient.
Depth pruning and width pruning are the 2 major approaches.
Depth pruning removes entire layers from the neural network, reducing the general depth and complexity (Figure 1).


Width pruning eliminates internal structures corresponding to individual neurons, attention heads, or embedding channels, slimming down the model’s width (Figure 2).


The core idea is to discover and take away parts of the LLM that contribute the least to its overall performance. Different methods are used to evaluate the importance of various components, corresponding to:
- Magnitude pruning: Sets weights with small absolute values near zero.
- Activation-based pruning: Uses a calibration dataset to estimate the importance of various parts of the model based on their activations.
- Structural pruning: Removes entire structures, like layers or attention heads.
Research shows that width pruning typically achieves higher accuracy than depth pruning, though depth pruning often reduces inference latency more at the identical variety of parameters. The alternative between depth pruning, width pruning, or a mixture of each should rely upon the specified balance between accuracy and latency. For more information, see LLM Pruning and Distillation in Practice: The Minitron Approach.
What’s knowledge distillation?
Knowledge distillation is a model compression technique that transfers knowledge from a bigger “teacher” model to a smaller and more efficient “student” model (Figure 3). The goal is to create a compact model that retains the high performance of the larger model, making it suitable for deployment at a lower resource cost.


Knowledge distillation trains a compact student model to emulate a bigger teacher, not by relying solely on hard labels, but by learning from the teacher’s guidance. This transfers wealthy, generalizable behavior so the coed approaches the teacher’s accuracy while running much more efficiently.
Two common distillation styles, response-based and feature-based, differ in how each passes knowledge from teacher to student.
What’s response-based knowledge distillation?
Response-based knowledge distillation transfers a teacher model’s knowledge to a student by training the coed to match the teacher’s soft output probabilities relatively than only hard labels. These soft targets convey inter-class similarities, for instance that “cat” is closer to “tiger” than to “automobile,” and the coed is optimized to align with them using KL divergence.
The approach is easy to implement, requires no access to the teacher’s internal features, and is extremely effective for classification tasks. In practice, it’s common to mix the distillation loss with standard cross-entropy on ground-truth labels and tune the loss weights to balance stability and fidelity, yielding compact models that preserve much of the teacher’s accuracy.


What’s feature-based knowledge distillation?
Feature-based knowledge distillation transfers a teacher’s intermediate representations hidden activations or feature maps to guide a student toward learning similar internal structure, not only similar outputs. During training, chosen teacher and student layers are paired and aligned, projection layers are sometimes used when dimensions differ.
This deeper, layer-level supervision provides richer signals than response-based KD and has proven effective across vision (CNN feature maps, for instance) and NLP (Transformer hidden states and attentions, for instance). Since it relies on internal activations, this method requires access to the teacher’s intermediate layers and careful layer selection and weighting alongside the usual task loss to balance stability and accuracy.


Pruning and distillation form a strong pipeline for model compression, enabling the creation of SLMs which might be well-suited for deployment in production environments and edge applications. TensorRT Model Optimizer streamlines applying these techniques at scale, turning state-of-the-art LLMs into deployable, cost-effective solutions.
prune a model using TensorRT Model Optimizer
This section walks you thru learn how to construct a pipeline using TensorRT Model Optimizer. It includes dataset preparation, fine-tuning a teacher model on the WikiText dataset, and applying pruning and distillation techniques to provide a 6B-parameter model from Qwen3-8B. For more information, see the Qwen3-8B Pruning and Distillation with NeMo 2.0 Framework notebook.
Prior to pruning and distillation, it’s vital to convert Hugging Face models to the NVIDIA NeMo checkpoint format and preprocess the dataset. For detailed instructions, confer with the model conversion and data preparation step.
Here, we are going to exhibit learn how to prune using each the depth pruning and width pruning approaches. The scripts provided might be run contained in the NVIDIA NeMo framework container nvcr.io/nvidia/nemo:25.09.
depth prune the model to create a student
The initial approach involves trimming the Qwen3 8B model from 36 to 24 layers (about 6B parameters) by robotically choosing the perfect 24 layers to maintain using a small calibration dataset of 1,024 samples.
The script for this process is provided below, showing learn how to prune using a two-GPU pipeline parallel setup.
torchrun --nproc_per_node 2 /opt/NeMo/scripts/llm/gpt_prune.py
--devices 2
--pp_size 2
--restore_path Qwen3-8B-nemo
--legacy_ckpt
--save_path Qwen3-8B-nemo-depth-pruned
--seq_length 4096
--num_train_samples 1024
--mbs 4
--data_paths wikitext-data/wikitext-train_text_document
--target_num_layers 24
width prune the model to create a student
The second, alternative approach to model size reduction involves width pruning. That is achieved by shrinking key architectural components: the MLP intermediate (ffn_hidden_size) is reduced from 12,288 to 9,216, and the Embedding (hidden_size) from 4,096 to three,584, also leading to a 6B model.
Further reductions within the variety of attention heads (num_attention_heads) and GQA query groups (num_query_groups) might be implemented as needed. The layer count (num_layers) can also be adjusted to realize the specified model size.
The script for this process is provided below, showing learn how to prune using a two-GPU pipeline parallel setup.
torchrun --nproc_per_node 2 /opt/NeMo/scripts/llm/gpt_prune.py
--devices 2
--pp_size 2
--restore_path Qwen3-8B-nemo
--legacy_ckpt
--save_path Qwen3-8B-nemo-width-pruned
--seq_length 4096
--num_train_samples 1024
--mbs 4
--data_paths wikitext-data/wikitext-train_text_document
--target_ffn_hidden_size 9216
--target_hidden_size 3584
By trimming redundant or low-importance weights, pruning not only shrinks the model’s memory footprint but may also speed up inference. Nevertheless, this process is usually followed by fine-tuning or retraining to recuperate any accuracy lost in the course of the pruning phase and to make sure the pruned model maintains high performance on the right track tasks. That is where distillation is available in.
use TensorRT Model Optimizer for distillation
This instance distills the Qwen3 depth- and width-pruned models using knowledge distillation with Model Optimizer and the NeMo 2.0 Framework.
When distilling knowledge from the teacher model to a depth-pruned model, the trail of the coed model can be Qwen3-8B-nemo-depth-pruned. This path corresponds to the output of the depth-pruning step, as detailed within the NeMo distillation notebook.
The script for this process is provided below, showing learn how to distill using a single-node eight-GPU Tensor Parallel setup. In practice, we recommend multinode training for faster training.
torchrun --nproc_per_node 8 /opt/NeMo/scripts/llm/gpt_train.py
--name Qwen3-8B-nemo-depth-pruned-distill
--devices 8
--num_nodes 1
--tp_size 8
--model_path Qwen3-8B-nemo-depth-pruned
--teacher_path Qwen3-8B-nemo
--legacy_ckpt
--max_steps 40
--warmup_steps 1
--gbs 768
--mbs 8
--lr 1e-4
--min_lr 1e-5
--seq_length 4096
--log_dir .
--log_interval 5
--val_check_interval 5
--limit_val_batches 2
--data_paths wikitext-data/wikitext-train_text_document
While distilling knowledge from the teacher to the width-pruned model, the student_model_path model could be Qwen3-8B-nemo-width-pruned as produced by the width-pruning step within the NeMo pruning notebook. Further details present in the NeMo distillation notebook.
The script for this process is provided below, showing learn how to distill using a single-node eight-GPU tensor parallel setup. In practice, we recommend multinode training for faster training.
torchrun --nproc_per_node 8 /opt/NeMo/scripts/llm/gpt_train.py
--name Qwen3-8B-nemo-width-pruned-distill
--devices 8
--num_nodes 1
--tp_size 8
--model_path Qwen3-8B-nemo-width-pruned
--teacher_path Qwen3-8B-nemo
--legacy_ckpt
--max_steps 40
--warmup_steps 1
--gbs 768
--mbs 8
--lr 1e-4
--min_lr 1e-5
--seq_length 4096
--log_dir .
--log_interval 5
--val_check_interval 5
--limit_val_batches 2
--data_paths wikitext-data/wikitext-train_text_document
For more comprehensive information, see the NeMo Framework distillation documentation. These resources will show you how to easily enable and integrate distillation into your workflow.
How do pruning and distillation impact model performance?
Experimental results for pruning and distillation from Qwen3 8B using Model Optimizer show that Qwen3 Depth Pruned 6B model is 30% faster than the Qwen3 4B model, and it also performs higher on the MMLU (Massive Multitask Language Understanding) benchmark. Depth pruning was applied to cut back the model from 36 to 24 layers, leading to a 6B model, using one NVIDIA H100 80 GB HBM3.
The Pruned model is distilled from Qwen3-8B using the OptimalScale/ClimbMix data processed from nvidia/ClimbMix pretraining dataset. The experiment uses 25% of the information, which is roughly 90B tokens. Distillation takes 8 hours with 96 nodes, each having eight NVIDIA H100 GPUs (6K GPU hours).


The 6B pruned model demonstrates a big advancement in performance in comparison with its 4B counterpart. Notably, the 6B pruned model achieves a 30% increase in speed, making it considerably more efficient for various computational tasks. For Throughput comparison, all models are quantized to FP8 precision using Model Optimizer and run with TensorRT-LLM.
Beyond its speed advantage, the 6B pruned model also exhibits superior accuracy, as evidenced by its higher rating on the MMLU benchmark. With a rating of 72.5, it surpasses the 4B model’s rating of 70.0, indicating a greater understanding and capability across a broad range of language-related tasks.
This dual improvement in each speed and accuracy positions the 6B pruned model as a more robust and effective solution for applications requiring each rapid processing and high-quality results.
The pruned models were distilled on a pretraining dataset, so the model is a base variant. Having a base model, we only compared all of the models on base model benchmarks corresponding to MMLU. Practically using these models for reasoning tasks would require performing post-training on the models as well.
Start with pruning and knowledge distillation
Pruning and knowledge distillation are highly cost-effective methods to progressively shrink LLMs while matching or exceeding baseline accuracy across domains, they usually’re typically more data-efficient than either synthetic-data fine-tuning or full pretraining.
Able to start? Take a look at the Qwen3 8B Pruning and Distillation with NeMo 2.0 Framework notebook. Visit the NVIDIA/TensorRT-Model-Optimizer GitHub repo to learn more about pruning and distillation. For more details about model optimization techniques using TensorRT Model Optimizer, see related posts on post-training quantization, quantization-aware training, and speculative decoding.
