Gall’s Law
A fancy system that works is invariably found to have evolved from an easy
system that worked
John Gall
This post walks you thru the step-by-step discovery of state-of-the-art positional encoding in transformer models. We’ll achieve
this by iteratively improving our approach to encoding position, arriving at Rotary Postional Encoding (RoPE) utilized in the newest LLama 3.2 release and newest transformers. This post intends to limit the mathematical knowledge required to follow along, but some basic linear algebra, trigonometry and understanding of self attention is predicted.
Problem Statement
You shall know a word by the corporate it keeps
John Rupert Firth
As with all problems, it’s best to first start with understanding exactly what we try to realize. The self attention mechanism in transformers is utilized to grasp relationships
between tokens in a sequence. Self attention is a set operation, which
means it’s permutation equivariant. If we don’t
enrich self attention with positional information, many essential relationships are
incapable of being determined.
That is best demonstrated by example.
Motivating Example
Consider this sentence with the identical word in several positions:
Intuitively, “dog” refers to 2 different entities. Let’s examine what happens if we first tokenize them, map to the actual token embeddings of Llama 3.2 1B and pass them through torch.nn.MultiheadAttention.
import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModel
model_id = "meta-llama/Llama-3.2-1B"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id)
text = "The dog chased one other dog"
tokens = tok(text, return_tensors="pt")["input_ids"]
embeddings = model.embed_tokens(tokens)
hdim = embeddings.shape[-1]
W_q = nn.Linear(hdim, hdim, bias=False)
W_k = nn.Linear(hdim, hdim, bias=False)
W_v = nn.Linear(hdim, hdim, bias=False)
mha = nn.MultiheadAttention(embed_dim=hdim, num_heads=4, batch_first=True)
with torch.no_grad():
for param in mha.parameters():
nn.init.normal_(param, std=0.1)
output, _ = mha(W_q(embeddings), W_k(embeddings), W_v(embeddings))
dog1_out = output[0, 2]
dog2_out = output[0, 5]
print(f"Dog output an identical?: {torch.allclose(dog1_out, dog2_out, atol=1e-6)}")
As we are able to see, with none positional information, the output of a (multi
headed) self attention operation is an identical for a similar token in
different positions, despite the tokens clearly representing distinct entities. Let’s begin designing a technique of enhancing self attention with positional information, such that it will probably determine relationships between words encoded by
their positions.
How should a super positional encoding scheme behave?
Desirable Properties
Let’s try to define some desirable properties that may make the optimization
process as easy as possible.
Property 1 – Unique encoding for every position (across sequences)
Each position needs a singular encoding that continues to be consistent no matter sequence length – a token at position 5 must have the identical encoding whether the present sequence is of length 10 or 10,000.
Property 2 – Linear relation between two encoded positions
The connection between positions ought to be mathematically easy. If we all know the encoding for position , it ought to be straightforward to compute the encoding for position , making it easier for the model to learn positional patterns.
Should you take into consideration how we represent numbers on a number line, it is easy to grasp that 5 is 2 steps away from 3, or that 10 is 5 steps from 15. The identical intuitive relationship should exist in our encodings.
Property 3 – Generalizes to longer sequences than those encountered in training
To extend our models’ utility in the actual world, they need to generalize outside
their training distribution. Subsequently, our encoding scheme must be
adaptable enough to handle unexpected input lengths, without
violating any of our other desirable properties.
Property 4 – Generated by a deterministic process the model can learn
It might be ideal if our positional encodings could possibly be drawn from a
deterministic process. This could allow the model to learn the mechanism
behind our encoding scheme efficiently.
Property 5 – Extensible to multiple dimensions
With multimodal models becoming the norm, it’s crucial that our positional
encoding scheme can naturally extend from to . This may allow models to devour data like images or brain scans, that are and
respectively.
Now we all know the best properties (henceforth known as ), let’s start designing and iterating on our encoding scheme.
Integer Position Encoding
The primary approach that will jump to mind is solely so as to add the integer value of the token position to every component of the token embedding, with values starting from where is the
length of our current sequence.
Within the above animation, we create our positional encoding vector for the token from the index and add it to our token embedding. The embedding values listed here are a subset of the actual values from Llama 3.2 1B. We are able to observe that they are clustered around 0. This
is desirable to avoid vanishing or exploding gradients during training and due to this fact is something we would like to keep up throughout the model.
It’s clear that our current naïve approach goes to cause problems. The magnitude of the position value
vastly exceeds the actual values of our input. This implies the signal-to-noise
ratio could be very low, and it’s hard for the model to separate the semantic
information from the positional information.
With this latest knowledge, a natural follow on may be to normalize the position value by . This constrains the values between 0 and 1, but introduces one other problem. If we elect to be the length of the present sequence, then the position values will likely be completely different for every sequence of differing lengths, violating .
Is there a greater strategy to ensure our numbers are between 0 and 1?
If we thought really hard about this for some time, we would give you switching
from decimal to binary numbers.
Binary Position Encoding
As an alternative of adding our (potentially normalized) integer position to every
component of the embedding, we could as a substitute convert it into its binary
representation and s t r e t c h our worth out to match our embedding dimension, as demonstrated below.
We have converted the position of interest (252) into its binary representation
(11111100) and added each bit to the corresponding component of the
token embedding. The least significant bit (LSB) will cycle between 0 and 1 for each
subsequent token, whilst probably the most significant bit (MSB) will cycle every tokens where is the variety of bits.
You’ll be able to see the positional encoding vector for various indices within the animation below .
We have solved the worth range problem, and we now have unique encodings which are
consistent across different sequence lengths. What happens if we plot a low dimensional version of our token embedding and visualize the addition of our binary positional vector for various values.
We are able to see that the result could be very “jumpy” (as we would expect from the
discrete nature of binary). The optimization process likes smooth, continuous and
predictable changes. Will we know any functions with similar value ranges which are smooth and continuous?
If we looked around just a little, we would notice that each and fit the bill!
Sinusoidal positional encoding
The above animation visualizes our position embedding if each component is
alternatively drawn from and with steadily increasing
wavelengths. Should you compare it with the previous animation, you may notice a striking similarity!
We have now arrived at Sinusoidal embeddings; originally defined within the Attention is all you wish paper.
Let’s take a look at the equations:
where is the tokens position index, is the component index
within the positional encoding vector, and is the model dimension. is the base wavelength (henceforth known as ), which we stretch or compress as a function of the component index. I encourage you to plug in some realistic values to get a feel for this
geometric progression.
There’s just a few parts of this equation which are confusing at first glance. How did the
authors select ? Why are we using and for even and odd positions respectively?
Plainly using for the bottom wavelength was determined experimentally . Deciphering the usage of each and is more involved, but crucial
for our iterative approach to understanding. The important thing here is our desire for a linear relation between two encoded positions . To know how using and in tandem produce this linear relation, we can have to dive into some trigonometry.
Consider a sequence of sine and cosine pairs, each related to a frequency . Our goal is to seek out a linear transformation matrix that may shift these sinusoidal functions by a set offset :
The frequencies follow a geometrical progression that decreases with dimension index , defined as:
To seek out this transformation matrix, we are able to express it as a general 2×2 matrix with unknown coefficients , , , and :
By applying the trigonometric addition theorem to the right-hand side, we are able to expand this into:
This expansion gives us a system of two equations by matching coefficients:
By comparing terms with and on either side, we are able to solve for the unknown coefficients:
These solutions give us our final transformation matrix :
Should you’ve done any game programming before, you would possibly notice that the
results of our derivation is oddly familiar. That is right, it is the Rotation Matrix! .
So the encoding scheme designed by Noam Shazeer in Attention is all you wish was already encoding relative position as a rotation back in 2017! It took one other 4 years to go from Sinusoidal Encoding to RoPE, despite rotations already being on the table…
Absolute vs Relative Position Encoding
With the knowledge in hand that rotations are essential here, let’s
return to our motivating example and take a look at to find some intuitions for our next iteration.
&hspace{0.3em}text{-2} hspace{1.4em} text{-1} hspace{1.7em} 0 hspace{2.6em} 1 hspace{2.4em} 2 &text{The dog color{#699C52}chased color{black}one other dog} end{align*}
Above we are able to see absolutely the positions of our tokens, and the relative
positions from to each other token. With Sinusoidal Encoding, we
generated a separate vector which represents absolutely the position,
and using some trigonometric trickery we were capable of encode relative positions.
After we’re trying to grasp these sentences, does it matter that this word is the 2157th word on this blog post? Or will we care about its relationship to the words around it? Absolutely the position of a word rarely matters for meaning – what matters is how words relate to one another.
Positional encoding in context
From this point on, it’s key to contemplate positional encoding within the context of
self attention. To reiterate, the self-attention mechanism enables the model to weigh the importance of various elements in an input sequence and dynamically adjust their influence on the output.
In all our previous iterations, we have generated a separate positional encoding
vector and added it to our token embedding prior to our , and projections.
By adding the positional information on to our token embedding, we’re
polluting the semantic information with the positional information. We should always
be attempting to encode the data without modifying the norm. Shifting to multiplicative is the
key.
Using the dictionary analogy, when looking up a word (query) in our dictionary (keys), nearby words must have more influence than distant ones. The influence of 1 token upon one other is decided by the dot product – in order that’s exactly where we should always focus our positional encoding!
The geometric interpretation of the dot product shown above gives us an impressive insight.
We are able to modulate the results of our dot product of two vectors purely by
increasing or decreasing the angle between them. Moreover, by rotating the
vector, we’ve got absolutely zero impact on the norm of the vector, which encodes
the semantic information of our token.
So now we all know where to focus our attention, and have seen from one other angle why a
rotation may be a smart “channel” through which to encode our positional
information, let’s put all of it together!
Rotary Postional Encoding
Rotary Postional Encoding or RoPE was defined within the
RoFormer paper (Jianlin Su designed it independently on his blog here and here).
While it could appear to be voodoo if you happen to skip to the top result, by enthusiastic about Sinusoidal Encoding within the
context of self attention (and more specifically dot products), we are able to see how
all of it comes together.
Very similar to in Sinusoidal Encoding, we decompose our vectors or , as a substitute of pre-projection ) into 2D pairs/chunks. Relatively than encoding absolute position directly by adding a vector we drew from sinusoidal functions of slowly decreasing frequencies, we cut to the chase and encode relative position by multiplying each pair with the rotation matrix.
Let or be our input vector at position . We create a block diagonal matrix
where is the corresponding rotation matrix for that component
pairs desired rotation:
Much the identical as Sinusoidal Encoding, is solely:
In practice, we do not use a matrix multiplication to compute RoPE as it might be
computationally inefficient with such a sparse matrix. As an alternative, we are able to directly apply the rotations to pairs of elements independently, profiting from the regular pattern within the computation:
That is all there’s to it! By artfully applying our rotations to 2D chunks of and prior to their dot product, and switching from additive to
multiplicative, we are able to gain an enormous performance boost in evaluations .
Extending RoPE to -Dimensions
We have explored the case for RoPE and by this point I hope you’ve got gained an
intuitive understanding of an admittedly unintuitive component of transformers.
Finally, let’s explore extending it to higher dimensions, corresponding to images.
A natural first intuition could possibly be to directly use the coordinate pairs from the image. This may appear intuitive, in spite of everything, we were almost arbitrarily pairing up our components previously. Nevertheless, this might be a mistake!
Within the case, we encode the relative position through a rotation of pairs
of values from our input vector. For data, we’d like to encode each horizontal and vertical relative positions, say and independently. RoPE’s brilliance lies in the way it handles multiple dimensions. As an alternative of attempting to encode all positional information in a single rotation, we pair components inside the same dimension and rotate those, otherwise we could be intermixing the and offset information. By handling each dimension independently, we maintain the natural structure of the space. This may generalize to as many dimensions as required!
The long run of positional encoding
Is RoPE the ultimate incarnation of positional encoding? This recent paper from DeepMind deeply analyses RoPE and highlights some fundamental problems. TLDR: RoPE is not an ideal solution, and the models mostly give attention to the lower frequencies, however the paper shows that removing (not rotating) the bottom frequencies improves performance on Gemma 2B!
I anticipate some future breakthroughs, perhaps taking inspiration from
signal processing with ideas like wavelets or hierarchical implementations. As models
are increasingly quantized for deployment, I’d also expect to see some
innovation in encoding schemes that remain robust under low-precision arithmetic.
Conclusion
Positional encoding has and continues to be treated as an after thought in
transformers. I consider we should always view it in a different way – self attention has an
Achilles heel that has been repeatedly patched.
I hope this blog post showed you that you just too could have discovered state of the
art positional encoding, despite it being unintuitive at first. In a follow up
post I’d like to explore practical implementation details for RoPE with a view to
maximise performance.
This post was originally published here.
References
[^1]: Binary and Sinusoidal animations are reproductions of animations contained
in this video.
[^2]: Using gives us unique positions, or a
theoretical upper sure on the context length at ~63,000.
[^3]: Pieces of this post are based on this improbable
post
by Amirhossein Kazemnejad.
[^4]: For empirical evidence, see this great post by EleutherAI.
