While working on my Knowledge Distillation problem for intent classification, I faced a puzzling roadblock. My setup involved a teacher model, which is RoBERTa-large (finetuned on my intent classification), and a student model, which I used to be attempting to train without losing an excessive amount of accuracy in comparison with the teacher.
I experimented with multiple mapping techniques, connecting every 2nd layer to the coed layer, averaging two teacher layers into one, and even assigning custom weights like giving (0.3 to l1 and 0.7 to l2). But irrespective of what combination I attempted, the teacher’s accuracy never matched the coed model.
That’s once I began exploring find out how to map probably the most informative layers. I wanted a method to quantify which layer of the teacher model truly matters for distillation.
Curious, I made a decision to adapt the concept to text data – and For the primary time, my student model began pondering almost like its teacher.
Source: Creator
Here’s the layer intensity graph of my fine-tuned RoBERTa-large model. Based on the spectral insights, I chosen layers 1–9 and 21–23 for my student model during knowledge distillation, those carrying the richest information.
I can’t share my dataset or code for confidentiality reasons, but I’ll walk you thru how the paper’s image-based approach inspired my text-based adaptation, and the way you may take into consideration doing the identical.
Behind the Scenes: How FFT Reveals a Model’s Spectral Soul
So, let’s start with spectral intensity, and slowly dive into the true magician here: the Fast Fourier Transform (FFT).
Within the spectralKD paper, the authors introduce a framework that helps us to see Vision Transformer(ViTs), not only what they’re predicting, but additionally how the data flows within the layers. As a substitute of counting on intuition or visualisation, they use spectral evaluation, a way to measure the frequency richness of the model’s internal representations.
Imagine each Transformer layer because the musician in an orchestra, some layers play high notes(effective details), while others play low notes(broad features). The FFT helps us to take heed to each player’s music individually and filter out which one is having the strongest melodies, i.e., probably the most information-rich signals.
Source: Creator
Step 1: Feature maps, The raw material
B is batch size C is variety of channels and, H,W is the spatial height and width.
Step 2: Applying the fourier Transform
The authors apply a 1-dimensional FFT along the channel dimension to translate these real-valued activations into the frequency domain: F(X)=FFT(X)
This implies: For each spatial location (b, h, w), a 1D FFT is computed across all channels. The result’s a complex-valued tensor (since FFT outputs real + imaginary parts). F(X) subsequently tells us how much of every frequency is present in that layer’s representation.
And in the event you’re wondering, — hold that thought. Because later on this blog, we’re going to uncover exactly why FFT is the proper tool to measure a model’s inner intensity.
Step 3: measuring frequency strength
Re(F(X)) is the true part, Im(F(X)) is the imaginary part.
Step 4: Averaging across the map
Now we wish to summarize this intensity across all positions within the layer:
This step tells us the common intensity of the only channel
And then you definitely can simply do average of every channels. Voilà! Now you will have the spectral intensity of the only layer of the Vision Transformer.
Peeking into the Frequency Realm: The Fourier Lens of SpectralKD
Let’s look into the Fast Fourier Transform:
Xₖ is the input sequence (your signal, feature, or activation pattern). xₙ is the frequency component on the frequency index. N is the variety of points within the sequence (i.e., variety of channels or features).
Each term e⁻ʲ²πᵏⁿ/ᴺ acts as a rotating phasor, a tiny complex wave spinning through the signal space, and together, they form one of the crucial beautiful ideas in signal processing.
Source: Creator (Here, a rotating phasor e⁻ʲ²πᵏⁿ/ᴺ is getting multiplied by g(t) in a posh plane)source: Creator (Average out all of the points within the complex plane, then it will provide you with the middle of mass of the phasor entity, and it gets peaked only at a particular frequency or K (within the above case, it’s 3))
.OMG! What just happened here? Let me break it down.
While you multiply your hidden activations xₙ (say, across channels or feature dimensions) by this phasor, you’re essentially asking:
“Hey, layer, how much of the do you contain in your representations?”
Each frequency k corresponds to a definite pattern scale across the feature dimensions.
Lower k values capture broad, smooth semantic structures (like topic-level context), while higher k values capture rapid, fine-grained variations (like token-level nuances or syntactic signals).
Now here’s the fun part: if some layer resonates with a selected frequency pattern, the multiplication of the Fourier Transform aligns perfectly, and the sum within the Fourier formula produces a strong response for that k.
If not, the rotations cancel out, meaning that frequency doesn’t play a giant role in that layer’s representation.
So, the Fourier Transform isn’t adding anything latest; it’s just checking out how our layer encodes information across different scales of abstraction.
It’s like zooming out and realizing:
Some layers hum quietly with smooth, conceptual meanings (low frequencies),
Others buzz with sharp, detailed interactions between tokens (high frequencies).
The FFT mainly turns a layer’s hidden states right into a frequency fingerprint — a map of what kinds of data that layer is specializing in.
And that’s exactly what SpectralKD uses to work out which layers are during knowledge distillation.
From Vision to Language: How Spectral Intensity Guided My Intent Classifier
Source: Creator
Let a layer activation tensor be:
where:
N = variety of samples (batch size)
L = sequence length (variety of tokens/time steps)
H = hidden dimension (variety of channels/features produced by the layer)
Each Sample i has an activation matrix Xᵢ ∈ Rᴸ ˣ ᴴ (sequence positions x hidden features)
Now again, you may compute the FFT of that Xᵢ after which measure the frequency length using the true and imaginary components and average out across the channels, after which for every layer.
Frequency length:
Frequency across channels:
Frequency across a layer:
Here, K is the variety of bins retained.
Conclusion
Their evaluation shows two major insights:
Not all layers contribute equally. In uniform transformer architectures, only a number of and layers show strong spectral activity, the true “hotspots” of data flow.
Different transformer types, similar melodies. Despite architectural variations, each hierarchical and uniform transformers share surprisingly similar spectral patterns, hinting at a universal way these models learn and represent knowledge.
Constructing on these findings, SpectralKD introduces a easy, parameter-free knowledge distillation (KD) strategy. By selectively aligning the spectral behavior of early and final layers between a teacher and a student model, the coed learns to , even in intermediate layers that were never explicitly aligned.
The outcomes are striking within the paper: the distilled student (DeiT-Tiny) doesn’t just match performance on benchmarks like ImageNet-1K, it also , capturing each local and global information with remarkable allegiance.
Ultimately, SpectralKD bridges interpretability and distillation, offering a fresh method to visualize what happens inside transformers during learning. It opens a brand new line of research, the authors call “distillation dynamics”, a journey into how knowledge itself flows, oscillates, and harmonizes between teacher and student networks.