Deep Dive into Transformers by Hand ✍︎

Artificial Intelligence

Deep Dive into Transformers by Hand ✍︎

admin

April 12, 2024

Explore the main points behind the facility of transformers

There was a latest development in our neighborhood.

A ‘Robo-Truck,’ as my son likes to call it, has made its latest home on our street.

It’s a Tesla Cyber Truck and I actually have tried to clarify that name to my son again and again but he insists on calling it Robo-Truck. Now each time I take a look at Robo-Truck and listen to that name, it jogs my memory of the movie Transformers where robots could transform to and from cars.

And isn’t it strange that Transformers as we all know them today could thoroughly be on their approach to powering these Robo-Trucks? It’s almost a full circle moment. But where am I going with all these?

Well, I’m heading to the destination — Transformers. Not the robot automobile ones however the neural network ones. And you might be invited!

Image by writer (Our Transformer — ‘Robtimus Prime’. Colours as mandated by my artist son.)

What are Transformers?

Transformers are essentially neural networks. Neural networks that focus on learning context from the info.

But what makes them special is the presence of mechanisms that eliminate the necessity for labeled datasets and convolution or reoccurrence within the network.

What are these special mechanisms?

There are various. However the two mechanisms which can be truly the force behind the transformers are attention weighting and feed-forward networks (FFN).

What’s attention-weighting?

Attention-weighting is a method by which the model learns which a part of the incoming sequence must be focused on. Consider it because the ‘Eye of Sauron’ scanning every thing in any respect times and throwing light on the parts which can be relevant.

Fun-fact: Apparently, the researchers had almost named the Transformer model ‘Attention-Net’, given Attention is such an important a part of it.

What’s FFN?

Within the context of transformers, FFN is actually an everyday multilayer perceptron acting on a batch of independent data vectors. Combined with attention, it produces the right ‘position-dimension’ combination.

So, without further ado, let’s dive into how attention-weighting and FFN make transformers so powerful.

This discussion relies on Prof. Tom Yeh’s wonderful AI by Hand Series on Transformers . (All the pictures below, unless otherwise noted, are by Prof. Tom Yeh from the above-mentioned LinkedIn posts, which I actually have edited along with his permission.)

So here we go:

The important thing ideas here : attention weighting and feed-forward network (FFN).

Keeping those in mind, suppose we’re given:

5 input features from a previous block (A 3×5 matrix here, where X1, X2, X3, X4 and X5 are the features and every of the three rows denote their characteristics respectively.)

[1] Obtain attention weight matrix A

Step one in the method is to acquire the attention weight matrix A. That is the part where the self-attention mechanism involves play. What it’s attempting to do is locate probably the most relevant parts on this input sequence.

We do it by feeding the input features into the query-key (QK) module. For simplicity, the main points of the QK module usually are not included here.

[2] Attention Weighting

Once we’ve got the attention weight matrix A (5×5), we multiply the input features (3×5) with it to acquire the attention-weighted features Z.

The essential part here is that the features listed below are combined based on their positions P1, P2 and P3 i.e. horizontally.

To interrupt it down further, consider this calculation performed row-wise:

P1 X A1 = Z1 → Position [1,1] = 11

P1 X A2 = Z2 → Position [1,2] = 6

P1 X A3 = Z3 → Position [1,3] = 7

P1 X A4 = Z4 → Position [1,4] = 7

P1 X A5 = Z5 → Positon [1,5] = 5

P2 X A4 = Z4 → Position [2,4] = 3

P3 X A5 = Z5 →Position [3,5] = 1

For instance:

It seems a little bit tedious to start with but follow the multiplication row-wise and the result must be pretty straight-forward.

Cool thing is the way in which our attention-weight matrix A is arranged, the brand new features Z grow to be the mixtures of X as below :

Z1 = X1 + X2

Z2 = X2 + X3

Z3 = X3 + X4

Z4 = X4 + X5

Z5 = X5 + X1

(Hint : Have a look at the positions of 0s and 1s in matrix A).

[3] FFN : First Layer

The following step is to feed the attention-weighted features into the feed-forward neural network.

Nevertheless, the difference here lies in combining the values across dimensions versus positions within the previous step. It is finished as below:

What this does is that it looks at the info from the opposite direction.

– In the eye step, we combined our input on the idea of the unique features to acquire latest features.

– On this FFN step, we consider their characteristics i.e. mix features vertically to acquire our latest matrix.

Eg: P1(1,1) * Z1(1,1)

+ P2(1,2) * Z1 (2,1)

+ P3 (1,3) * Z1(3,1) + b(1) = 11, where b is bias.

Once more element-wise row operations to the rescue. Notice that here the variety of dimensions of the brand new matrix is increased to 4 here.

[4] ReLU

Our favourite step : ReLU, where the negative values obtained within the previous matrix are returned as zero and the positive value remain unchanged.

[5] FFN : Second Layer

Finally we pass it through the second layer where the dimensionality of the resultant matrix is reduced from 4 back to three.

The output here is able to be fed to the subsequent block (see its similarity to the unique matrix) and your complete process is repeated from the start.

The 2 key things to recollect listed below are:

The eye layer combines across positions (horizontally).
The feed-forward layer combines across dimensions (vertically).

And that is the key sauce behind the facility of the transformers — the power to research data from different directions.

To summarize the ideas above, listed below are the important thing points:

The transformer architecture will be perceived as a mixture of the eye layer and the feed-forward layer.
The attention layer combines the features to supply a latest feature. E.g. think of mixing two robots Robo-Truck and Optimus Prime to get a latest robot Robtimus Prime.
The feed-forward (FFN) layer combines the parts or the characteristics of the a feature to supply latest parts/characteristics. E.g. wheels of Robo-Truck and Ion-laser of Optimus Prime could produce a wheeled-laser.

Neural networks have existed for quite a while now. Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) had been reigning supreme but things took quite an eventful turn once Transformers were introduced within the 12 months 2017. And since then, the sector of AI has grown at an exponential rate — with latest models, latest benchmarks, latest learnings coming in each day. And only time will tell if this phenomenal idea will someday prepared the ground for something even greater — an actual ‘Transformer’.

But for now it could not be improper to say that an idea can really transform how we live!