of my Machine Learning Advent Calendar.
Before closing this series, I would really like to sincerely thank everyone who followed it, shared feedback, and supported it, specifically the Towards Data Science team.
Ending this calendar with Transformers just isn’t a coincidence. The Transformer just isn’t just a flowery name. It’s the backbone of recent Large Language Models.
There’s quite a bit to say about RNNs, LSTMs, and GRUs. They played a key historical role in sequence modeling. But today, modern LLMs are overwhelmingly based on Transformers.
The name Transformer itself marks a rupture. From a naming perspective, the authors could have chosen something like Attention Neural Networks, consistent with Recurrent Neural Networks or Convolutional Neural Networks. As a Cartesian mind, I might have appreciated a more consistent naming structure. But naming aside, the conceptual shift introduced by Transformers fully justifies the excellence.
Transformers may be utilized in other ways. Encoder architectures are commonly used for classification. Decoder architectures are used for next-token prediction, so for text generation.
In this text, we’ll concentrate on one core idea only: how the eye matrix transforms input embeddings into something more meaningful.
Within the previous article, we introduced 1D Convolutional Neural Networks for text. We saw that a CNN scans a sentence using small windows and reacts when it recognizes local patterns. This approach is already very powerful, but it surely has a transparent limitation: a CNN only looks locally.
Today, we move one step further.
The Transformer answers a fundamentally different query.
What if every word could take a look at all the opposite words directly?
1. The identical word in two different contexts
To know why attention is required, we’ll start with an easy idea.
We are going to use two different input sentences, each containing the word , but used in numerous contexts.
In the primary input, appears in a sentence with . Within the second input, appears in a sentence with .
On the input level, we deliberately use the identical embedding for the word “mouse” in each cases. This is significant. At this stage, the model doesn’t know which meaning is meant.
The embedding for incorporates each:
- a robust animal component
- a robust tech component
This ambiguity is intentional. Without context, could confer with an animal or to a pc device.
All other words provide clearer signals. is strongly animal. is strongly tech. Words like or mainly carry grammatical information. Words like and are weakly informative on their very own.
At this point, nothing within the input embeddings allows the model to come to a decision which meaning of is correct.
In the subsequent chapter, we’ll see how the eye matrix performs this transformation, step-by-step.
2. Self-attention: how context is injected into embeddings
2.1 Self-attention, not only attention
We first make clear what sort of attention we’re using here. This chapter focuses on self-attention.
Self-attention implies that each word looks at the opposite words of the same input sequence.
On this simplified example, we make an extra pedagogical selection. We assume that Queries and Keys are directly equal to the input embeddings. In other words, there aren’t any learned weight matrices for Q and K on this chapter.
This can be a deliberate simplification. It allows us to focus entirely on the eye mechanism, without introducing extra parameters. Similarity between words is computed directly from their embeddings.
Conceptually, this implies:
Q = Input
K = Input
Only the Value vectors are used later to propagate information to the output.
In real Transformer models, Q, K, and V are all obtained through learned linear projections. Those projections add flexibility, but they don’t change the logic of attention itself. The simplified version shown here captures the core idea.
Here is the entire picture that we’ll decompose.

2.2 From input embeddings to raw attention scores
We start from the input embedding matrix, where each row corresponds to a word and every column corresponds to a semantic dimension.
The primary operation is to check every word with every other word. This is completed by computing dot products between Queries and Keys.
Because Queries and Keys are equal to the input embeddings in this instance, this step reduces to computing dot products between input vectors.
All dot products are computed directly using a matrix multiplication:
Scores = Input × Inputᵀ
Each cell of this matrix answers an easy query: how similar are these two words, given their embeddings?
At this stage, the values are raw scores. They aren’t probabilities, and so they don’t yet have a direct interpretation as weights.

2.3 Scaling and normalization
Raw dot products can grow large because the embedding dimension increases. To maintain values in a stable range, the scores are scaled by the square root of the embedding dimension.
ScaledScores = Scores / √d
This scaling step just isn’t conceptually deep, but it surely is practically vital. It prevents the subsequent step, the softmax, from becoming too sharp.

Once scaled, a softmax is applied row by row. This converts raw scores into positive values that sum to at least one.
The result’s the attention matrix.
And attention is all you wish.
Each row of this matrix describes how much attention a given word pays to each other word within the sentence.

2.4 Interpreting the eye matrix
The eye matrix is the central object of self-attention.
For a given word, its row in the eye matrix answers the query: when updating this word, which other words matter, and the way much?
For instance, the row corresponding to assigns higher weights to words which are semantically related in the present context. Within the sentence with and , attends more to animal-related words. Within the sentence with and , it attends more to technical words.
The mechanism is equivalent in each cases. Only the encompassing words change the consequence.
2.5 From attention weights to output embeddings
The eye matrix itself just isn’t the . It’s a set of weights.
To supply the output embeddings, we mix these weights with the Value vectors.
Output = Attention × V
On this simplified example, the Value vectors are taken directly from the input embeddings. Each output word vector is due to this fact a weighted average of the input vectors, with weights given by the corresponding row of the eye matrix.
For a word like , because of this its final representation becomes a mix of:
- its own embedding
- the embeddings of the words it attends to most
That is the precise moment where context is injected into the representation.

At the top of self-attention, the embeddings aren’t any longer ambiguous.
The word now not has the identical representation in each sentences. Its output vector reflects its context. In a single case, it behaves like an animal. In the opposite, it behaves like a technical object.
Nothing within the embedding table modified. What modified is how information was combined across words.
That is the core idea of self-attention, and the muse on which Transformer models are built.
If we now compare the 2 examples, on the left and on the precise, the effect of self-attention becomes explicit.
In each cases, the input embedding of is equivalent. Yet the ultimate representation differs. Within the sentence with , the output embedding of is dominated by the animal dimension. Within the sentence with , the technical dimension becomes more outstanding. Nothing within the embedding table modified. The difference comes entirely from how attention redistributed weights across words before mixing the values.
This comparison highlights the role of self-attention: it doesn’t change words in isolation, but reshapes their representations by taking the total context into consideration.

3. Learning the right way to mix information

3.1 Introducing learned weights for Q, K, and V
Until now, we’ve focused on the mechanics of self-attention itself. We now introduce a crucial element: learned weights.
In an actual Transformer, Queries, Keys, and Values aren’t taken directly from the input embeddings. As a substitute, they’re produced by learned linear transformations.
For every word embedding, the model computes:
Q = Input × W_Q
K = Input × W_K
V = Input × W_V
These weight matrices are learned during training.
At this stage, we often keep the identical dimensionality. The input embeddings, Q, K, V, and the output embeddings all have the identical variety of dimensions. This makes the role of attention easier to know: it modifies representations without changing the space they live in.
Conceptually, these weights allow the model to come to a decision:
- which elements of a word matter for comparison (Q and K)
- which elements of a word must be transmitted to others (V)

3.2 What the model actually learns
The eye mechanism itself is fixed. Dot products, scaling, softmax, and matrix multiplications all the time work the identical way. What the model actually learns are the projections.
By adjusting the Q and K weights, the model learns the right way to measure relationships between words for a given task. By adjusting the V weights, it learns what information must be propagated when attention is high. The structure defines how information flows, while the weights define what information flows.
Because the eye matrix depends upon Q and K, it’s partially interpretable. We will inspect which words attend to which others and observe patterns that usually align with syntax or semantics.
This becomes clear when comparing the identical word in two different contexts. In each examples, the word starts with the exact same input embedding, containing each an animal and a tech component. By itself, it’s ambiguous.
What changes just isn’t the word, but the eye it receives. Within the sentence with and , attention emphasizes animal-related words. Within the sentence with and , attention shifts toward technical words. The mechanism and the weights are equivalent in each cases, yet the output embeddings differ. The difference comes entirely from how the learned projections interact with the encompassing context.
That is precisely why the eye matrix is interpretable: it reveals which relationships the model has learned to think about meaningful for the duty.

3.3 Changing the dimensionality on purpose
Nothing, nonetheless, forces Q, K, and V to have the identical dimensionality because the input.
The Value projection, specifically, can map embeddings right into a space of a distinct size. When this happens, the output embeddings inherit the dimensionality of the Value vectors.
This just isn’t a theoretical curiosity. It is strictly what happens in real models, especially in multi-head attention. Each head operates in its own subspace, often with a smaller dimension, and the outcomes are later concatenated right into a larger representation.
So attention can do two things:
- mix information across words
- reshape the space during which this information lives
This explains why Transformers scale so well.
They don’t depend on fixed features. They learn:
- the right way to compare words
- the right way to route information
- the right way to project meaning into different spaces

The eye matrix controls information flows.
The learned projections control information flows and .
Together, they form the core mechanism behind modern language models.
Conclusion
This Advent Calendar was built around an easy idea: understanding machine learning models by taking a look at how they really transform data.
Transformers are a fitting approach to close this journey. They don’t depend on fixed rules or local patterns, but on learned relationships between all elements of a sequence. Through attention, they turn static embeddings into contextual representations, which is the muse of recent language models.
Thanks again to everyone who followed this series, shared feedback, and supported it, especially the Towards Data Science team.
Merry Christmas 🎄
All of the Excel files can be found through this Kofi link. Your support means quite a bit to me. The value will increase in the course of the month, so early supporters get the very best value.
Discount codes are hidden across the articles from Day 19 to Day 24. Find them and pick the one you favor.

