Model Compression: Make Your Machine Learning Models Lighter and Faster

Whether you’re preparing for interviews or constructing Machine Learning systems at your job, model compression has grow to be vital skill. Within the era of LLMs, where models are getting larger and bigger, the challenges around compressing these models to make them more efficient, smaller, and usable on lightweight machines have never been more relevant.

In this text, I’ll undergo 4 fundamental compression techniques that each ML practitioner should understand and master. I explore pruning, quantization, low-rank factorization, and Knowledge Distillation, each offering unique benefits. I may also add some minimal PyTorch code samples for every of those methods.

I hope you benefit from the article!

Model pruning

Pruning might be essentially the most intuitive compression technique. The concept may be very easy: remove among the weights of the network, either randomly or remove the “less essential” ones. In fact, once we speak about “removing” weights within the context of neural networks, it means setting the weights to zero.

Model pruning (Image by the writer and ChatGPT | Inspiration: [3])

Structured vs unstructured pruning

Let’s start with a easy heuristic: removing weights smaller than a threshold.

[ w’_{ij} = begin{cases} w_{ij} & text{if } |w_{ij}| ge theta_0
0 & text{if } |w_{ij}| < theta_0
end{cases} ]

In fact, this isn’t ideal because we would wish to seek out a method to find the appropriate threshold for our problem! A more practical approach is to remove a specified proportion of weights with the smallest magnitudes (norm) inside one layer. There are 2 common ways of implementing pruning in a single layer:

Structured pruning: remove entire components of the network (e.g. a random row from the burden tensor, or a random channel in a convulational layer)
Unstructured pruning: remove individual weights no matter their positions and of the structure of the tensor

We also can use global pruning with either of the 2 above methods. This can remove the chosen proportion of weights across multiple layers, and potentially have different removal rates depending on the variety of parameters in each layer.

PyTorch makes this gorgeous straightforward (by the way in which, you’ll find all code snippets in my GitHub repo).

import torch.nn.utils.prune as prune

# 1. Random unstructured pruning (20% of weights at random)
prune.random_unstructured(model.layer, name="weight", amount=0.2)                           

# 2. L1‑norm unstructured pruning (20% of smallest weights)
prune.l1_unstructured(model.layer, name="weight", amount=0.2)

# 3. Global unstructured pruning (40% of all weights by L1 norm across layers)
prune.global_unstructured(
    [(model.layer1, "weight"), (model.layer2, "weight")],
    pruning_method=prune.L1Unstructured,
    amount=0.4
)                                             

# 4. Structured pruning (remove 30% of rows with lowest L2 norm)
prune.ln_structured(model.layer, name="weight", amount=0.3, n=2, dim=0)

Why does pruning work? The Lottery Ticket Hypothesis

I would really like to conclude that section with a fast mention of the Lottery Ticket Hypothesis, which is each an application of pruning and an interesting explanation of how removing weights can often improve a model. I like to recommend reading the associated paper ([7]) for more details.

Authors use the next procedure:

Train the complete model to convergence
Prune the smallest-magnitude weights (say 10%)
Reset the remaining weights to their original initialization values
Retrain this pruned network
Repeat the method multiple times

After doing this 30 times, you find yourself with only 0.9³⁰ ~ 4% of the unique parameters. And surprisingly, this network can do in addition to the unique one.

This means that there is significant parameter redundancy. In other words, there exists a sub-network (“a lottery ticket”) that really does a lot of the work!

Pruning is one method to unveil this sub-network.

I like to recommend this superb video that covers the subject!

Quantization

While pruning focuses on removing parameters entirely, Quantization takes a distinct approach: reducing the precision of every parameter.

Keep in mind that every number in a pc is stored as a sequence of bits. A float32 value uses 32 bits (see example picture below), whereas an 8-bit integer (int8) uses just 8 bits.

An example of how float32 numbers are represented with 32 bits (Image by the writer and ChatGPT | Inspiration: [2])

Most deep learning models are trained using 32-bit floating-point numbers (FP32). Quantization converts these high-precision values to lower-precision formats like 16-bit floating-point (FP16), 8-bit integers (INT8), and even 4-bit representations.

The savings listed below are obvious: INT8 requires 75% less memory than FP32. But how can we actually perform this conversion without destroying our model’s performance?

The maths behind quantization

To convert from floating-point to integer representation, we want to map the continual range of values to a discrete set of integers. For INT8 quantization, we’re mapping to 256 possible values (from -128 to 127).

Suppose our weights are normalized between -1.0 and 1.0 (common in deep learning):

[ text{scale} = frac{text{float_max} – text{float_min}}{text{int8_max} – text{int8_min}} = frac{1.0 – (-1.0)}{127 – (-128)} = frac{2.0}{255} ]

Then, the quantized value is given by

[text{quantized_value} = text{round}(frac{text{original_value}}{text{scale}} ] + text{zero_point})

Here, zero_point=0 because we would like 0 to be mapped to 0. We are able to then round this value to the closest integer to get integers between -127 and 128.

And, you guessed it: to get integers back to drift, we are able to use the inverse operation: [text{float_value} = text{integer_value} times text{scale} – text{zero_point} ]

The right way to apply quantization?

Quantization will be applied at different stages and with different strategies. Listed here are just a few techniques price knowing about:

Post-training quantization (PTQ):
- Static Quantization: quantize each weights and activations offline (after training and before inference)
- Dynamic Quantization: quantize weights offline, but activations on-the-fly during inference. That is different from offline quantization since the scaling factor is decided based on the values seen to this point during inference.
Quantize-aware training (QAT): simulate quantization during training by rounding values, but calculations are still done with floating-point numbers. This makes the model learn weights which might be more robust to quantization, which shall be applied after training. Under the hood, the concept is to add “fake” operations: x -> dequantize(quantize(x)): this latest value is near x, nevertheless it still helps the model tolerate the 8-bit rounding and clipping noise.

import torch.quantization as tq

# 1. Post‑training static quantization (weights + activations offline)
model.eval()
model.qconfig = tq.get_default_qconfig('fbgemm') # assign a static quantization config
tq.prepare(model, inplace=True)
# we want to make use of a calibration dataset to find out the ranges of values
with torch.no_grad():
    for data, _ in calibration_data:
        model(data)
tq.convert(model, inplace=True) # convert to a completely int8 model

# 2. Post‑training dynamic quantization (weights offline, activations on‑the‑fly)
dynamic_model = tq.quantize_dynamic(
    model,
    {torch.nn.Linear, torch.nn.LSTM}, # layers to quantize
    dtype=torch.qint8
)

# 3. Quantization‑Aware Training (QAT)
model.train()
model.qconfig = tq.get_default_qat_qconfig('fbgemm')  # arrange QAT config
tq.prepare_qat(model, inplace=True) # insert fake‑quant modules
# [here, train or fine‑tune the model as usual]
qat_model = tq.convert(model.eval(), inplace=False) # convert to real int8 after QAT

Quantization may be very flexible! You’ll be able to apply different precision levels to different parts of the model. As an illustration, you would possibly quantize most linear layers to 8-bit for optimum speed and memory savings, while leaving critical components (e.g. attention heads, or batch-norm layers) at 16-bit or full-precision.

Low-Rank Factorization

Now let’s speak about low-rank factorization — a technique that has been popularized with the rise of LLMs.

The important thing remark: many weight matrices in neural networks have effective ranks much lower than their dimensions suggest. In plain English, which means there may be plenty of redundancy within the parameters.

The linear algebra behind low-rank factorization

Take a weight matrix W. Every real matrix will be represented using a Singular Value Decomposition (SVD):

[ W = USigma V^T ]

where Σ is a diagonal matrix with singular values in non-increasing order. The variety of positive coefficients actually corresponds to the rank of the matrix W.

SVD visualized for a matrix of rank r (Image by the writer and ChatGPT | Inspiration: [5])

To approximate W with a matrix of rank k < r, we are able to select the k best elements of sigma, and the corresponding first k columns and first k rows of U and V respectively:

[ begin{aligned} W_k &= U_k,Sigma_k,V_k^T
[6pt] &= underbrace{U_k,Sigma_k^{1/2}}_{Ainmathbb{R}^{mtimes k}} underbrace{Sigma_k^{1/2},V_k^T}_{Binmathbb{R}^{ktimes n}}. end{aligned} ]

See how the brand new matrix will be decomposed because the product of A and B, with the whole variety of parameters now being m * k + k * n = k*(m+n) as an alternative of m*n! It is a huge improvement, especially when k is far smaller than m and n.

In practice, it’s corresponding to replacing a linear layer x → Wx with 2 consecutive ones: x → A(Bx).

In PyTorch

We are able to either apply low-rank factorization before training (parameterizing each linear layer as two smaller matrices – not likely a compression method, but a design alternative) or after training (applying a truncated SVD on weight matrices). The second approach is by far essentially the most common one and is implemented below.

import torch

# 1. Extract weight and select rank
W = model.layer.weight.data # (m, n)
k = 64 # desired rank

# 2. Approximate low-rank SVD
U, S, V = torch.svd_lowrank(W, q=k) # U: (m, k), S: (k, k), V: (n, k)

# 3. Form aspects A and B
A = U * S.sqrt() # [m, k]
B = V.t() * S.sqrt().unsqueeze(1) # [k, n]

# 4. Replace with two linear layers and insert the matrices A and B
orig = model.layer
model.layer = torch.nn.Sequential(
    torch.nn.Linear(orig.in_features, k, bias=False),
    torch.nn.Linear(k, orig.out_features, bias=False),
)
model.layer[0].weight.data.copy_(B)
model.layer[1].weight.data.copy_(A)

LoRA: an application of low-rank approximation

LoRA fine-tuning: W is fixed, A and B are trained (source: [1])

I believe it’s crucial to say LoRA: you’ve got probably heard of LoRA (Low-Rank Adaptation) if you’ve got been following LLM fine-tuning developments. Though not strictly a compression technique, LoRA has grow to be extremely popular for efficiently adapting large language models and making fine-tuning very efficient.

The concept is straightforward: during fine-tuning, relatively than modifying the unique model weights W, LoRA freezes them and learn trainable low-rank updates:

$$W’ = W + Delta W = W + AB$$

where A and B are low-rank matrices. This permits for task-specific adaptation with only a fraction of the parameters.

Even higher: QLoRA takes this further by combining quantization with low-rank adaptation!

Again, this can be a very flexible technique and will be applied at various stages. Normally, LoRA is applied only on specific layers (for instance, Attention layers’ weights).

Knowledge Distillation

Knowledge distillation takes a fundamentally different approach from what we’ve seen to this point. As an alternative of modifying an existing model’s parameters, it transfers the “knowledge” from a large, complex model (the “teacher”) to a smaller, more efficient model (the “student”). The goal is to coach the coed model to mimic the behavior and replicate the performance of the teacher, often a better task than solving the unique problem from scratch.

The distillation loss

Let’s explain some concepts within the case of a classification problem:

The teacher model is often a big, complex model that achieves high performance on the duty at hand
The student model is a second, smaller model with a distinct architecture, but tailored to the identical task
Soft targets: these are the teacher’s model predictions (probabilities, and never labels!). They shall be utilized by the coed model to mimic the teacher’s behaviors. Note that we use raw predictions and never labels because in addition they contain information in regards to the confidence of the predictions
Temperature: along with the teacher’s prediction, we also use a coefficient T (called temperature) within the softmax function to extract more information from the soft targets. Increasing T softens the distribution and helps the coed model give more importance to mistaken predictions.

In practice, it’s pretty straightforward to coach the coed model. We mix the same old loss (standard cross-entropy loss based on hard labels) with the “distillation” loss (based on the teacher’s soft targets):

$$ L_{text{total}} = alpha L_{text{hard}} + (1 – alpha) L_{text{distill}} $$

The distillation loss is nothing however the KL divergence between the teacher and student distribution ().

$$ L_{text{distill}} = D{KL}(q_{text{teacher}} | | q_{text{student}}) = sum_i q_{text{teacher}, i} log left( frac{q_{text{teacher}, i}}{q_{text{student}, i}} right) $$

As for the opposite methods, it is feasible and encouraged to adapt this framework depending on the use case: for instance, one also can compare logits and activations from intermediate layers within the network between the coed and teacher model, as an alternative of only comparing the ultimate outputs.

Knowledge distillation in practice

Just like the previous techniques, there are two options:

Offline distillation: the pre-trained teacher model is fixed, and a separate student model is trained to mimic it. Each models are completely separate, and the teacher’s weights remain frozen in the course of the distillation process.
Online distillation: each models are trained concurrently, with knowledge transfer happening in the course of the joint training process.

And below, a straightforward method to apply offline distillation (the last code block of this text 🙂):

import torch.nn.functional as F

def distillation_loss_fn(student_logits, teacher_logits, labels, temperature=2.0, alpha=0.5):
    # Standard Cross-Entropy loss with hard labels
    student_loss = F.cross_entropy(student_logits, labels)

    # Distillation loss with soft targets (KL Divergence)
    soft_teacher_probs = F.softmax(teacher_logits / temperature, dim=-1)
    soft_student_log_probs = F.log_softmax(student_logits / temperature, dim=-1)

		# kl_div expects log probabilities as input for the primary argument!
    distill_loss = F.kl_div(
        soft_student_log_probs,
        soft_teacher_probs.detach(), # don't calculate gradients for teacher
        reduction='batchmean'
    ) * (temperature ** 2) # optional, a scaling factor

    # Mix losses in keeping with formula
    total_loss = alpha * student_loss + (1 - alpha) * distill_loss
    return total_loss

teacher_model.eval()
student_model.train()
with torch.no_grad():
     teacher_logits = teacher_model(inputs)
	 student_logits = student_model(inputs)
	 loss = distillation_loss_fn(student_logits, teacher_logits, labels, temperature=T, alpha=alpha)
	 loss.backward()
	 optimizer.step()

Conclusion

Thanks for reading this text! Within the era of LLMs, with billions and even trillions of parameters, model compression has grow to be a fundamental concept, essential in almost every scenario to make models more efficient and simply deployable.

But as we’ve seen, model compression isn’t nearly reducing the model size – it’s about making thoughtful design decisions. Whether selecting between online and offline methods, compressing your complete network, or targeting specific layers or channels, each alternative significantly impacts performance and value. Most models now mix several of those techniques (try this model, for example).

Beyond introducing you to the principal methods, I hope this text also inspires you to experiment and develop your personal creative solutions!

Don’t forget to examine out the GitHub repository, where you’ll find all of the code snippets and a side-by-side comparison of the 4 compression methods discussed in this text.

Try my previous articles:

References

[1] Hu, E., et al. (2021). Low-rank Adaptation of Large Language Models. .
[2] Lightning AI. Accelerating Large Language Models with Mixed Precision Techniques. .
[3] TensorFlow Blog. Pruning API in TensorFlow Model Optimization Toolkit. , May 2019.
[4] Toward AI. A Gentle Introduction to Knowledge Distillation. , Aug 2022.
[5] Ju, A. ML Algorithm: Singular Value Decomposition (SVD). .
[6] Algorithmic Simplicity. THIS is why large language models can understand the world. , Apr 2023.
[7] Frankle, J., & Carbin, M. (2019). The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. .

Model Compression: Make Your Machine Learning Models Lighter and Faster

Model pruning

Structured vs unstructured pruning

Why does pruning work? The Lottery Ticket Hypothesis

Quantization

The maths behind quantization

The right way to apply quantization?

Low-Rank Factorization

The linear algebra behind low-rank factorization

In PyTorch

LoRA: an application of low-rank approximation

Knowledge Distillation

The distillation loss

Knowledge distillation in practice

Conclusion

References

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Nano Banana Pro changes the image generation game (again)

Tips on how to Use Gemini 3 Pro Efficiently

Recent AI agent learns to make use of CAD to create 3D objects from sketches

Designing digital resilience within the agentic AI era

OpenAI pushes Codex to the Max

Model Compression: Make Your Machine Learning Models Lighter and Faster

Model pruning

Structured vs unstructured pruning

Why does pruning work? The Lottery Ticket Hypothesis

Quantization

The maths behind quantization

The right way to apply quantization?

Low-Rank Factorization

The linear algebra behind low-rank factorization

In PyTorch

LoRA: an application of low-rank approximation

Knowledge Distillation

The distillation loss

Knowledge distillation in practice

Conclusion

References

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.