You may have designed state-of-the-art positional encoding

Gall’s Law
A fancy system that works is invariably found to have evolved from an easy
system that worked
John Gall

This post walks you thru the step-by-step discovery of state-of-the-art positional encoding in transformer models. We’ll achieve
this by iteratively improving our approach to encoding position, arriving at Rotary Postional Encoding (RoPE) utilized in the newest LLama 3.2 release and newest transformers. This post intends to limit the mathematical knowledge required to follow along, but some basic linear algebra, trigonometry and understanding of self attention is predicted.

Problem Statement

You shall know a word by the corporate it keeps
John Rupert Firth

As with all problems, it’s best to first start with understanding exactly what we try to realize. The self attention mechanism in transformers is utilized to grasp relationships
between tokens in a sequence. Self attention is a set operation, which
means it’s permutation equivariant. If we don’t
enrich self attention with positional information, many essential relationships are
incapable of being determined.

That is best demonstrated by example.

Motivating Example

Consider this sentence with the identical word in several positions:

$The dog chased one other dog text{The dog chased one other dog}$

Intuitively, “dog” refers to 2 different entities. Let’s examine what happens if we first tokenize them, map to the actual token embeddings of Llama 3.2 1B and pass them through torch.nn.MultiheadAttention.

import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModel

model_id = "meta-llama/Llama-3.2-1B"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id)

text = "The dog chased one other dog"
tokens = tok(text, return_tensors="pt")["input_ids"]
embeddings = model.embed_tokens(tokens)
hdim = embeddings.shape[-1]

W_q = nn.Linear(hdim, hdim, bias=False)
W_k = nn.Linear(hdim, hdim, bias=False)
W_v = nn.Linear(hdim, hdim, bias=False)
mha = nn.MultiheadAttention(embed_dim=hdim, num_heads=4, batch_first=True)

with torch.no_grad():
    for param in mha.parameters():
        nn.init.normal_(param, std=0.1) 

output, _ = mha(W_q(embeddings), W_k(embeddings), W_v(embeddings))

dog1_out = output[0, 2]
dog2_out = output[0, 5]
print(f"Dog output an identical?: {torch.allclose(dog1_out, dog2_out, atol=1e-6)}")

As we are able to see, with none positional information, the output of a (multi
headed) self attention operation is an identical for a similar token in
different positions, despite the tokens clearly representing distinct entities. Let’s begin designing a technique of enhancing self attention with positional information, such that it will probably determine relationships between words encoded by
their positions.

How should a super positional encoding scheme behave?

Desirable Properties

Let’s try to define some desirable properties that may make the optimization
process as easy as possible.

Property 1 – Unique encoding for every position (across sequences)

Each position needs a singular encoding that continues to be consistent no matter sequence length – a token at position 5 must have the identical encoding whether the present sequence is of length 10 or 10,000.

Property 2 – Linear relation between two encoded positions

The connection between positions ought to be mathematically easy. If we all know the encoding for position $p$

Should you take into consideration how we represent numbers on a number line, it is easy to grasp that 5 is 2 steps away from 3, or that 10 is 5 steps from 15. The identical intuitive relationship should exist in our encodings.

Property 3 – Generalizes to longer sequences than those encountered in training

To extend our models’ utility in the actual world, they need to generalize outside
their training distribution. Subsequently, our encoding scheme must be
adaptable enough to handle unexpected input lengths, without
violating any of our other desirable properties.

Property 4 – Generated by a deterministic process the model can learn

It might be ideal if our positional encodings could possibly be drawn from a
deterministic process. This could allow the model to learn the mechanism
behind our encoding scheme efficiently.

Property 5 – Extensible to multiple dimensions

With multimodal models becoming the norm, it’s crucial that our positional
encoding scheme can naturally extend from $1 D$

Now we all know the best properties (henceforth known as $P r_{n}$

Integer Position Encoding

The primary approach that will jump to mind is solely so as to add the integer value of the token position to every component of the token embedding, with values starting from $0 \to L 0 rightarrow L$

Within the above animation, we create our positional encoding vector for the token $chased color{#699C52}text{chased}$

It’s clear that our current naïve approach goes to cause problems. The magnitude of the position value
vastly exceeds the actual values of our input. This implies the signal-to-noise
ratio could be very low, and it’s hard for the model to separate the semantic
information from the positional information.

With this latest knowledge, a natural follow on may be to normalize the position value by $frac{1}{N}$

Is there a greater strategy to ensure our numbers are between 0 and 1?
If we thought really hard about this for some time, we would give you switching
from decimal to binary numbers.

Binary Position Encoding

As an alternative of adding our (potentially normalized) integer position to every
component of the embedding, we could as a substitute convert it into its binary
representation and s t r e t c h our worth out to match our embedding dimension, as demonstrated below.

We have converted the position of interest (252) into its binary representation
(11111100) and added each bit to the corresponding component of the
token embedding. The least significant bit (LSB) will cycle between 0 and 1 for each
subsequent token, whilst probably the most significant bit (MSB) will cycle every $2^{n-1}$

We have solved the worth range problem, and we now have unique encodings which are
consistent across different sequence lengths. What happens if we plot a low dimensional version of our token embedding and visualize the addition of our binary positional vector for various values.

We are able to see that the result could be very “jumpy” (as we would expect from the
discrete nature of binary). The optimization process likes smooth, continuous and
predictable changes. Will we know any functions with similar value ranges which are smooth and continuous?

If we looked around just a little, we would notice that each $\sin$

Sinusoidal positional encoding

The above animation visualizes our position embedding if each component is
alternatively drawn from $\sin$

We have now arrived at Sinusoidal embeddings; originally defined within the Attention is all you wish paper.
Let’s take a look at the equations:

$PE_{(pos,2i)} = color{#58C4DD}sinleft(color{black}frac{pos}{10000^{2i/d}}color{#58C4DD}right)color{black} quad PE_{(pos,2i+1)} = color{#FC6255}cosleft(color{black}frac{pos}{10000^{2i/d}}color{#FC6255}right)color{black}$

where $p o s$

There’s just a few parts of this equation which are confusing at first glance. How did the
authors select $10, 000$

Plainly using $10, 000$

Consider a sequence of sine and cosine pairs, each related to a frequency $ω_{i} omega_i$

$mathbf{M} cdot begin{bmatrix} sin(omega_i p) cos(omega_i p) end{bmatrix} = begin{bmatrix} sin(omega_i(p + k)) cos(omega_i(p + k)) end{bmatrix}$

The frequencies $ω_{i} omega_i$

$omega_i = frac{1}{10000^{2i/d}}$

To seek out this transformation matrix, we are able to express it as a general 2×2 matrix with unknown coefficients $u_{1}$

$begin{bmatrix} u_1 & v_1 u_2 & v_2 end{bmatrix} cdot begin{bmatrix} sin(omega_i p) cos(omega_i p) end{bmatrix} = begin{bmatrix} sin(omega_i(p+k)) cos(omega_i(p+k)) end{bmatrix}$

By applying the trigonometric addition theorem to the right-hand side, we are able to expand this into:

$begin{bmatrix} u_1 & v_1 u_2 & v_2 end{bmatrix} cdot begin{bmatrix} sin(omega_i p) cos(omega_i p) end{bmatrix} = begin{bmatrix} sin(omega_i p)cos(omega_i k) + cos(omega_i p)sin(omega_i k) cos(omega_i p)cos(omega_i k) – sin(omega_i p)sin(omega_i k) end{bmatrix}$

This expansion gives us a system of two equations by matching coefficients:

$begin{align} u_1sin(omega_i p) + v_1cos(omega_i p) &= cos(omega_i k)sin(omega_i p) + sin(omega_i k)cos(omega_i p) u_2sin(omega_i p) + v_2cos(omega_i p) &= -sin(omega_i k)sin(omega_i p) + cos(omega_i k)cos(omega_i p) end{align}$

By comparing terms with $\sin (ω_{i} p) sin(omega_i p)$

$begin{align} u_1 &= cos(omega_i k) & v_1 &= sin(omega_i k) u_2 &= -sin(omega_i k) & v_2 &= cos(omega_i k) end{align}$

These solutions give us our final transformation matrix $mathbf{M_k}$

$mathbf{M_k} = begin{bmatrix} cos(omega_i k) & sin(omega_i k) -sin(omega_i k) & cos(omega_i k) end{bmatrix}$

Should you’ve done any game programming before, you would possibly notice that the
results of our derivation is oddly familiar. That is right, it is the Rotation Matrix! $[^{3}]$

So the encoding scheme designed by Noam Shazeer in Attention is all you wish was already encoding relative position as a rotation back in 2017! It took one other 4 years to go from Sinusoidal Encoding to RoPE, despite rotations already being on the table…

Absolute vs Relative Position Encoding

With the knowledge in hand that rotations are essential here, let’s
return to our motivating example and take a look at to find some intuitions for our next iteration.

$\begin{matrix} 0 1 2 3 4 \\ The dog chased one other dog \\ -2 -1 0 1 2 \\ The dog chased one other dog \end{matrix} begin{align*}
&hspace{0.7em}0 hspace{1.4em} 1 hspace{2em} 2 hspace{2.6em} 3 hspace{2.4em} 4
&text{The dog chased one other dog}
&hspace{0.3em}text{-2} hspace{1.4em} text{-1} hspace{1.7em} 0 hspace{2.6em} 1 hspace{2.4em} 2
&text{The dog color{#699C52}chased color{black}one other dog}
end{align*}$

Above we are able to see absolutely the positions of our tokens, and the relative
positions from $chased color{#699C52}text{chased}$

After we’re trying to grasp these sentences, does it matter that this word is the 2157th word on this blog post? Or will we care about its relationship to the words around it? Absolutely the position of a word rarely matters for meaning – what matters is how words relate to one another.

Positional encoding in context

From this point on, it’s key to contemplate positional encoding within the context of
self attention. To reiterate, the self-attention mechanism enables the model to weigh the importance of various elements in an input sequence and dynamically adjust their influence on the output.

$text{Attn}(Q, K, V) = text{softmax}left(frac{QK^T}{sqrt{d_k}}right)V$

In all our previous iterations, we have generated a separate positional encoding
vector and added it to our token embedding prior to our $Q$

Using the dictionary analogy, when looking up a word (query) in our dictionary (keys), nearby words must have more influence than distant ones. The influence of 1 token upon one other is decided by the $Q K^{T}$

$\vec{a} \cdot \vec{b} = ∣ \vec{a} ∣ ∣ \vec{b} ∣ \cos θ vec{a} cdot vec{b} = |vec{a}| |vec{b}| cos theta$

The geometric interpretation of the dot product shown above gives us an impressive insight.
We are able to modulate the results of our dot product of two vectors purely by
increasing or decreasing the angle between them. Moreover, by rotating the
vector, we’ve got absolutely zero impact on the norm of the vector, which encodes
the semantic information of our token.

So now we all know where to focus our attention, and have seen from one other angle why a
rotation may be a smart “channel” through which to encode our positional
information, let’s put all of it together!

Rotary Postional Encoding

Rotary Postional Encoding or RoPE was defined within the
RoFormer paper (Jianlin Su designed it independently on his blog here and here).
While it could appear to be voodoo if you happen to skip to the top result, by enthusiastic about Sinusoidal Encoding within the
context of self attention (and more specifically dot products), we are able to see how
all of it comes together.

Very similar to in Sinusoidal Encoding, we decompose our vectors $q mathbf{q}$

Let $q mathbf{q}$

$R(mathbf{q}, p) = begin{pmatrix} mathbf{M_1} & & & & mathbf{M_2} & & & & ddots & & & & mathbf{M_{d/2}} end{pmatrix} begin{pmatrix} q_1 q_2 vdots q_d end{pmatrix}$

Much the identical as Sinusoidal Encoding, $mathbf{M_i}$

$mathbf{M_i} = begin{bmatrix} cos(omega_i p) & sin(omega_i p) -sin(omega_i p) & cos(omega_i p) end{bmatrix}$

In practice, we do not use a matrix multiplication to compute RoPE as it might be
computationally inefficient with such a sparse matrix. As an alternative, we are able to directly apply the rotations to pairs of elements independently, profiting from the regular pattern within the computation:

$R_{Theta,p}^d q = begin{pmatrix} q_1 q_2 q_3 q_4 vdots q_{d-1} q_d end{pmatrix} otimes begin{pmatrix} cos ptheta_1 cos ptheta_1 cos ptheta_2 cos ptheta_2 vdots cos ptheta_{d/2} cos ptheta_{d/2} end{pmatrix} + begin{pmatrix} -q_2 q_1 -q_4 q_3 vdots -q_d q_{d-1} end{pmatrix} otimes begin{pmatrix} sin ptheta_1 sin ptheta_1 sin ptheta_2 sin ptheta_2 vdots sin ptheta_{d/2} sin ptheta_{d/2} end{pmatrix}$

That is all there’s to it! By artfully applying our rotations to 2D chunks of $q mathbf{q}$

Extending RoPE to $n$

We have explored the $1 D$

A natural first intuition could possibly be to directly use the $[\begin{matrix} x \\ y \end{matrix}] begin{bmatrix} x y end{bmatrix}$

Within the $1 D$

The long run of positional encoding

Is RoPE the ultimate incarnation of positional encoding? This recent paper from DeepMind deeply analyses RoPE and highlights some fundamental problems. TLDR: RoPE is not an ideal solution, and the models mostly give attention to the lower frequencies, however the paper shows that removing (not rotating) the bottom frequencies improves performance on Gemma 2B!

I anticipate some future breakthroughs, perhaps taking inspiration from
signal processing with ideas like wavelets or hierarchical implementations. As models
are increasingly quantized for deployment, I’d also expect to see some
innovation in encoding schemes that remain robust under low-precision arithmetic.

Conclusion

Positional encoding has and continues to be treated as an after thought in
transformers. I consider we should always view it in a different way – self attention has an
Achilles heel that has been repeatedly patched.

I hope this blog post showed you that you just too could have discovered state of the
art positional encoding, despite it being unintuitive at first. In a follow up
post I’d like to explore practical implementation details for RoPE with a view to
maximise performance.

This post was originally published here.

References

[^1]: Binary and Sinusoidal animations are reproductions of animations contained
in this video.

[^2]: Using $θ = 10000 theta = 10000$

[^3]: Pieces of this post are based on this improbable
post
by Amirhossein Kazemnejad.

[^4]: For empirical evidence, see this great post by EleutherAI.

Source link

You may have designed state-of-the-art positional encoding

Problem Statement

Motivating Example

Desirable Properties

Property 1 – Unique encoding for every position (across sequences)

Property 2 – Linear relation between two encoded positions

Property 3 – Generalizes to longer sequences than those encountered in training

Property 4 – Generated by a deterministic process the model can learn

Property 5 – Extensible to multiple dimensions

Integer Position Encoding

Binary Position Encoding

Sinusoidal positional encoding

Absolute vs Relative Position Encoding

Positional encoding in context

Rotary Postional Encoding

Extending RoPE to $n$

The long run of positional encoding

Conclusion

References

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Understanding Vibe Proving

What Happens When You Construct an LLM Using Only 1s and 0s

small yet mighty Vision Language Model

Rearchitecting Hugging Face Uploads and Downloads

Open Source Developers Guide to the EU AI Act

You may have designed state-of-the-art positional encoding

Problem Statement

Motivating Example

Desirable Properties

Property 1 – Unique encoding for every position (across sequences)

Property 2 – Linear relation between two encoded positions

Property 3 – Generalizes to longer sequences than those encountered in training

Property 4 – Generated by a deterministic process the model can learn

Property 5 – Extensible to multiple dimensions

Integer Position Encoding

Binary Position Encoding

Sinusoidal positional encoding

Absolute vs Relative Position Encoding

Positional encoding in context

Rotary Postional Encoding

Extending RoPE to nnn-Dimensions

The long run of positional encoding

Conclusion

References

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

Extending RoPE to $n$

What are your thoughts on this topic?
Let us know in the comments below.