You may have designed state-of-the-art positional encoding

-


Christopher Fleetwood's avatar

Gall’s Law
A fancy system that works is invariably found to have evolved from an easy
system that worked
John Gall

This post walks you thru the step-by-step discovery of state-of-the-art positional encoding in transformer models. We’ll achieve
this by iteratively improving our approach to encoding position, arriving at Rotary Postional Encoding (RoPE) utilized in the newest LLama 3.2 release and newest transformers. This post intends to limit the mathematical knowledge required to follow along, but some basic linear algebra, trigonometry and understanding of self attention is predicted.



Problem Statement

You shall know a word by the corporate it keeps
John Rupert Firth

As with all problems, it’s best to first start with understanding exactly what we try to realize. The self attention mechanism in transformers is utilized to grasp relationships
between tokens in a sequence. Self attention is a set operation, which
means it’s permutation equivariant. If we don’t
enrich self attention with positional information, many essential relationships are
incapable of being determined.

That is best demonstrated by example.



Motivating Example

Consider this sentence with the identical word in several positions:

The dog chased one other dog text{The dog chased one other dog}

Intuitively, “dog” refers to 2 different entities. Let’s examine what happens if we first tokenize them, map to the actual token embeddings of Llama 3.2 1B and pass them through torch.nn.MultiheadAttention.

import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModel

model_id = "meta-llama/Llama-3.2-1B"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id)

text = "The dog chased one other dog"
tokens = tok(text, return_tensors="pt")["input_ids"]
embeddings = model.embed_tokens(tokens)
hdim = embeddings.shape[-1]

W_q = nn.Linear(hdim, hdim, bias=False)
W_k = nn.Linear(hdim, hdim, bias=False)
W_v = nn.Linear(hdim, hdim, bias=False)
mha = nn.MultiheadAttention(embed_dim=hdim, num_heads=4, batch_first=True)

with torch.no_grad():
    for param in mha.parameters():
        nn.init.normal_(param, std=0.1) 

output, _ = mha(W_q(embeddings), W_k(embeddings), W_v(embeddings))

dog1_out = output[0, 2]
dog2_out = output[0, 5]
print(f"Dog output an identical?: {torch.allclose(dog1_out, dog2_out, atol=1e-6)}") 

As we are able to see, with none positional information, the output of a (multi
headed) self attention operation is an identical for a similar token in
different positions
, despite the tokens clearly representing distinct entities. Let’s begin designing a technique of enhancing self attention with positional information, such that it will probably determine relationships between words encoded by
their positions.

How should a super positional encoding scheme behave?



Desirable Properties

Let’s try to define some desirable properties that may make the optimization
process as easy as possible.



Property 1 – Unique encoding for every position (across sequences)

Each position needs a singular encoding that continues to be consistent no matter sequence length – a token at position 5 must have the identical encoding whether the present sequence is of length 10 or 10,000.



Property 2 – Linear relation between two encoded positions

The connection between positions ought to be mathematically easy. If we all know the encoding for position pp, it ought to be straightforward to compute the encoding for position p+kp+k

Should you take into consideration how we represent numbers on a number line, it is easy to grasp that 5 is 2 steps away from 3, or that 10 is 5 steps from 15. The identical intuitive relationship should exist in our encodings.



Property 3 – Generalizes to longer sequences than those encountered in training

To extend our models’ utility in the actual world, they need to generalize outside
their training distribution. Subsequently, our encoding scheme must be
adaptable enough to handle unexpected input lengths, without
violating any of our other desirable properties.



Property 4 – Generated by a deterministic process the model can learn

It might be ideal if our positional encodings could possibly be drawn from a
deterministic process. This could allow the model to learn the mechanism
behind our encoding scheme efficiently.



Property 5 – Extensible to multiple dimensions

With multimodal models becoming the norm, it’s crucial that our positional
encoding scheme can naturally extend from 1D1D to nDnD. This may allow models to devour data like images or brain scans, that are 2D2D and 4D4D
respectively.

Now we all know the best properties (henceforth known as PrnPr_n



Integer Position Encoding

The primary approach that will jump to mind is solely so as to add the integer value of the token position to every component of the token embedding, with values starting from 0L0 rightarrow L

Within the above animation, we create our positional encoding vector for the token chasedcolor{#699C52}text{chased} from the index and add it to our token embedding. The embedding values listed here are a subset of the actual values from Llama 3.2 1B. We are able to observe that they are clustered around 0. This
is desirable to avoid vanishing or exploding gradients during training and due to this fact is something we would like to keep up throughout the model.

It’s clear that our current naïve approach goes to cause problems. The magnitude of the position value
vastly exceeds the actual values of our input. This implies the signal-to-noise
ratio could be very low, and it’s hard for the model to separate the semantic
information from the positional information.

With this latest knowledge, a natural follow on may be to normalize the position value by 1Nfrac{1}{N}

Is there a greater strategy to ensure our numbers are between 0 and 1?
If we thought really hard about this for some time, we would give you switching
from decimal to binary numbers.



Binary Position Encoding

As an alternative of adding our (potentially normalized) integer position to every
component of the embedding, we could as a substitute convert it into its binary
representation and s t r e t c h our worth out to match our embedding dimension, as demonstrated below.

We have converted the position of interest (252) into its binary representation
(11111100) and added each bit to the corresponding component of the
token embedding. The least significant bit (LSB) will cycle between 0 and 1 for each
subsequent token, whilst probably the most significant bit (MSB) will cycle every 2n12^{n-1}

We have solved the worth range problem, and we now have unique encodings which are
consistent across different sequence lengths. What happens if we plot a low dimensional version of our token embedding and visualize the addition of our binary positional vector for various values.

We are able to see that the result could be very “jumpy” (as we would expect from the
discrete nature of binary). The optimization process likes smooth, continuous and
predictable changes. Will we know any functions with similar value ranges which are smooth and continuous?

If we looked around just a little, we would notice that each sinsin and coscos fit the bill!



Sinusoidal positional encoding

The above animation visualizes our position embedding if each component is
alternatively drawn from sinsin and coscos with steadily increasing
wavelengths. Should you compare it with the previous animation, you may notice a striking similarity!

We have now arrived at Sinusoidal embeddings; originally defined within the Attention is all you wish paper.
Let’s take a look at the equations:

PE(pos,2i)=sin(pos100002i/d)PE(pos,2i+1)=cos(pos100002i/d) PE_{(pos,2i)} = color{#58C4DD}sinleft(color{black}frac{pos}{10000^{2i/d}}color{#58C4DD}right)color{black} quad PE_{(pos,2i+1)} = color{#FC6255}cosleft(color{black}frac{pos}{10000^{2i/d}}color{#FC6255}right)color{black}

where pospos is the tokens position index, ii is the component index
within the positional encoding vector, and dd is the model dimension. 10,00010,000

There’s just a few parts of this equation which are confusing at first glance. How did the
authors select 10,00010,000

Plainly using 10,00010,000

Consider a sequence of sine and cosine pairs, each related to a frequency ωiomega_i

M[sin(ωip)cos(ωip)]=[sin(ωi(p+k))cos(ωi(p+k))] mathbf{M} cdot begin{bmatrix} sin(omega_i p) cos(omega_i p) end{bmatrix} = begin{bmatrix} sin(omega_i(p + k)) cos(omega_i(p + k)) end{bmatrix}

The frequencies ωiomega_i

ωi=1100002i/d omega_i = frac{1}{10000^{2i/d}}

To seek out this transformation matrix, we are able to express it as a general 2×2 matrix with unknown coefficients u1u_1

[u1v1u2v2][sin(ωip)cos(ωip)]=[sin(ωi(p+k))cos(ωi(p+k))] begin{bmatrix} u_1 & v_1 u_2 & v_2 end{bmatrix} cdot begin{bmatrix} sin(omega_i p) cos(omega_i p) end{bmatrix} = begin{bmatrix} sin(omega_i(p+k)) cos(omega_i(p+k)) end{bmatrix}

By applying the trigonometric addition theorem to the right-hand side, we are able to expand this into:

[u1v1u2v2][sin(ωip)cos(ωip)]=[sin(ωip)cos(ωik)+cos(ωip)sin(ωik)cos(ωip)cos(ωik)sin(ωip)sin(ωik)] begin{bmatrix} u_1 & v_1 u_2 & v_2 end{bmatrix} cdot begin{bmatrix} sin(omega_i p) cos(omega_i p) end{bmatrix} = begin{bmatrix} sin(omega_i p)cos(omega_i k) + cos(omega_i p)sin(omega_i k) cos(omega_i p)cos(omega_i k) – sin(omega_i p)sin(omega_i k) end{bmatrix}

This expansion gives us a system of two equations by matching coefficients:

u1sin(ωip)+v1cos(ωip)=cos(ωik)sin(ωip)+sin(ωik)cos(ωip)u2sin(ωip)+v2cos(ωip)=sin(ωik)sin(ωip)+cos(ωik)cos(ωip) begin{align} u_1sin(omega_i p) + v_1cos(omega_i p) &= cos(omega_i k)sin(omega_i p) + sin(omega_i k)cos(omega_i p) u_2sin(omega_i p) + v_2cos(omega_i p) &= -sin(omega_i k)sin(omega_i p) + cos(omega_i k)cos(omega_i p) end{align}

By comparing terms with sin(ωip)sin(omega_i p)

u1=cos(ωik)v1=sin(ωik)u2=sin(ωik)v2=cos(ωik) begin{align} u_1 &= cos(omega_i k) & v_1 &= sin(omega_i k) u_2 &= -sin(omega_i k) & v_2 &= cos(omega_i k) end{align}

These solutions give us our final transformation matrix Mkmathbf{M_k}

Mk=[cos(ωik)sin(ωik)sin(ωik)cos(ωik)] mathbf{M_k} = begin{bmatrix} cos(omega_i k) & sin(omega_i k) -sin(omega_i k) & cos(omega_i k) end{bmatrix}

Should you’ve done any game programming before, you would possibly notice that the
results of our derivation is oddly familiar. That is right, it is the Rotation Matrix! [3][^3]

So the encoding scheme designed by Noam Shazeer in Attention is all you wish was already encoding relative position as a rotation back in 2017! It took one other 4 years to go from Sinusoidal Encoding to RoPE, despite rotations already being on the table…



Absolute vs Relative Position Encoding

With the knowledge in hand that rotations are essential here, let’s
return to our motivating example and take a look at to find some intuitions for our next iteration.

01234The dog chased one other dog-2-1012The dog chased one other dog begin{align*} &hspace{0.7em}0 hspace{1.4em} 1 hspace{2em} 2 hspace{2.6em} 3 hspace{2.4em} 4 &text{The dog chased one other dog}

&hspace{0.3em}text{-2} hspace{1.4em} text{-1} hspace{1.7em} 0 hspace{2.6em} 1 hspace{2.4em} 2 &text{The dog color{#699C52}chased color{black}one other dog} end{align*}

Above we are able to see absolutely the positions of our tokens, and the relative
positions from chasedcolor{#699C52}text{chased} to each other token. With Sinusoidal Encoding, we
generated a separate vector which represents absolutely the position,
and using some trigonometric trickery we were capable of encode relative positions.

After we’re trying to grasp these sentences, does it matter that this word is the 2157th word on this blog post? Or will we care about its relationship to the words around it? Absolutely the position of a word rarely matters for meaning – what matters is how words relate to one another.



Positional encoding in context

From this point on, it’s key to contemplate positional encoding within the context of
self attention. To reiterate, the self-attention mechanism enables the model to weigh the importance of various elements in an input sequence and dynamically adjust their influence on the output.

Attn(Q,K,V)=softmax(QKTdk)V text{Attn}(Q, K, V) = text{softmax}left(frac{QK^T}{sqrt{d_k}}right)V

In all our previous iterations, we have generated a separate positional encoding
vector and added it to our token embedding prior to our QQ, KK and VV projections.
By adding the positional information on to our token embedding, we’re
polluting the semantic information with the positional information. We should always
be attempting to encode the data without modifying the norm. Shifting to multiplicative is the
key.

Using the dictionary analogy, when looking up a word (query) in our dictionary (keys), nearby words must have more influence than distant ones. The influence of 1 token upon one other is decided by the QKTQK^T

ab=abcosθ vec{a} cdot vec{b} = |vec{a}| |vec{b}| cos theta

The geometric interpretation of the dot product shown above gives us an impressive insight.
We are able to modulate the results of our dot product of two vectors purely by
increasing or decreasing the angle between them. Moreover, by rotating the
vector, we’ve got absolutely zero impact on the norm of the vector, which encodes
the semantic information of our token.

So now we all know where to focus our attention, and have seen from one other angle why a
rotation may be a smart “channel” through which to encode our positional
information, let’s put all of it together!



Rotary Postional Encoding

Rotary Postional Encoding or RoPE was defined within the
RoFormer paper (Jianlin Su designed it independently on his blog here and here).
While it could appear to be voodoo if you happen to skip to the top result, by enthusiastic about Sinusoidal Encoding within the
context of self attention (and more specifically dot products), we are able to see how
all of it comes together.

Very similar to in Sinusoidal Encoding, we decompose our vectors qmathbf{q} or kmathbf{k}, as a substitute of pre-projection xmathbf{x}) into 2D pairs/chunks. Relatively than encoding absolute position directly by adding a vector we drew from sinusoidal functions of slowly decreasing frequencies, we cut to the chase and encode relative position by multiplying each pair with the rotation matrix.

Let qmathbf{q} or kmathbf{k} be our input vector at position pp. We create a block diagonal matrix
where Mimathbf{M_i}

R(q,p)=(M1M2Md/2)(q1q2qd) R(mathbf{q}, p) = begin{pmatrix} mathbf{M_1} & & & & mathbf{M_2} & & & & ddots & & & & mathbf{M_{d/2}} end{pmatrix} begin{pmatrix} q_1 q_2 vdots q_d end{pmatrix}

Much the identical as Sinusoidal Encoding, Mimathbf{M_i}

Mi=[cos(ωip)sin(ωip)sin(ωip)cos(ωip)] mathbf{M_i} = begin{bmatrix} cos(omega_i p) & sin(omega_i p) -sin(omega_i p) & cos(omega_i p) end{bmatrix}

In practice, we do not use a matrix multiplication to compute RoPE as it might be
computationally inefficient with such a sparse matrix. As an alternative, we are able to directly apply the rotations to pairs of elements independently, profiting from the regular pattern within the computation:

RΘ,pdq=(q1q2q3q4qd1qd)(cospθ1cospθ1cospθ2cospθ2cospθd/2cospθd/2)+(q2q1q4q3qdqd1)(sinpθ1sinpθ1sinpθ2sinpθ2sinpθd/2sinpθd/2) R_{Theta,p}^d q = begin{pmatrix} q_1 q_2 q_3 q_4 vdots q_{d-1} q_d end{pmatrix} otimes begin{pmatrix} cos ptheta_1 cos ptheta_1 cos ptheta_2 cos ptheta_2 vdots cos ptheta_{d/2} cos ptheta_{d/2} end{pmatrix} + begin{pmatrix} -q_2 q_1 -q_4 q_3 vdots -q_d q_{d-1} end{pmatrix} otimes begin{pmatrix} sin ptheta_1 sin ptheta_1 sin ptheta_2 sin ptheta_2 vdots sin ptheta_{d/2} sin ptheta_{d/2} end{pmatrix}

That is all there’s to it! By artfully applying our rotations to 2D chunks of qmathbf{q} and kmathbf{k} prior to their dot product, and switching from additive to
multiplicative, we are able to gain an enormous performance boost in evaluations [4][^4]



Extending RoPE to nn-Dimensions

We have explored the 1D1D case for RoPE and by this point I hope you’ve got gained an
intuitive understanding of an admittedly unintuitive component of transformers.
Finally, let’s explore extending it to higher dimensions, corresponding to images.

A natural first intuition could possibly be to directly use the [xy] begin{bmatrix} x y end{bmatrix}

Within the 1D1D case, we encode the relative position mnm – n



The long run of positional encoding

Is RoPE the ultimate incarnation of positional encoding? This recent paper from DeepMind deeply analyses RoPE and highlights some fundamental problems. TLDR: RoPE is not an ideal solution, and the models mostly give attention to the lower frequencies, however the paper shows that removing (not rotating) the bottom frequencies improves performance on Gemma 2B!

I anticipate some future breakthroughs, perhaps taking inspiration from
signal processing with ideas like wavelets or hierarchical implementations. As models
are increasingly quantized for deployment, I’d also expect to see some
innovation in encoding schemes that remain robust under low-precision arithmetic.



Conclusion

Positional encoding has and continues to be treated as an after thought in
transformers. I consider we should always view it in a different way – self attention has an
Achilles heel that has been repeatedly patched.

I hope this blog post showed you that you just too could have discovered state of the
art positional encoding, despite it being unintuitive at first. In a follow up
post I’d like to explore practical implementation details for RoPE with a view to
maximise performance.

This post was originally published here.



References

[^1]: Binary and Sinusoidal animations are reproductions of animations contained
in this video.

[^2]: Using θ=10000theta = 10000

[^3]: Pieces of this post are based on this improbable
post

by Amirhossein Kazemnejad.

[^4]: For empirical evidence, see this great post by EleutherAI.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x