## The complex math behind transformer models, in easy words

It is not any secret that transformer architecture was a breakthrough in the sphere of Natural Language Processing (NLP). It overcame the limitation of seq-to-seq models like RNNs, etc for being incapable of capturing long-term dependencies in text. The transformer architecture turned out to be the inspiration stone of revolutionary architectures like BERT, GPT, and T5 and their variants. As many say, NLP is within the midst of a golden era and it wouldn’t be fallacious to say that the transformer model is where it began.

## Need for Transformer Architecture

As said, necessity is the mother of invention. The normal seq-to-seq models were no good when it got here to working with long texts. ** Meaning the model tends to forget the learnings from the sooner parts of the input sequence because it moves to process the latter a part of the input sequence**. This loss of knowledge is undesirable.

Although gated architectures like LSTMs and GRUs showed some improvement in performance for handling long-term dependencies by ** discarding information that was useless along the option to remember necessary information, **it still wasn’t enough. The world needed something more powerful and in 2015, “attention

**mechanisms” were introduced by**

**Bahdanau et al**

**.**They were used together with RNN/LSTM to mimic human behaviour to concentrate on selective things while ignoring the remaining. Bahdanau suggested assigning relative importance to every word in a sentence in order that model focuses on necessary words and ignores the remaining. It emerged to be a large improvement over encoder-decoder models for neural machine translation tasks and shortly enough, the applying of the eye mechanism was rolled out in other tasks as well.

## The Era of Transformer Models

The transformer models are entirely based on an attention mechanism which can also be referred to as ** “self-attention”**. This architecture was introduced to the world within the paper “

**Attention is All You Need**” in 2017. It consisted of an Encoder-Decoder Architecture.

On a high level,

- The
is liable for accepting the input sentence and converting it right into a hidden representation with all useless information discarded.*encoder* - The
accepts this hidden representation and tries to generate the goal sentence.*decoder*

In this text, we’ll delve into an in depth breakdown of the Encoder component of the Transformer model. In the subsequent article, we will take a look at the Decoder component intimately. Let’s start!

The encoder block of the transformer consists of a stack of N encoders that work sequentially. The output of 1 encoder is the input for the subsequent encoder and so forth. The output of the last encoder is the ultimate representation of the input sentence that’s fed to the decoder block.

Each encoder block might be further split into two components as shown within the figure below.

Allow us to look into each of those components one after the other intimately to grasp how the encoder block is working. The primary component within the encoder block is ** multi-head attention** but before we hop into the main points, allow us to first understand an underlying concept:

**.**

*self-attention*## Self-Attention Mechanism

The primary query that may pop up in everyone’s mind: *Are attention and self-attention different concepts? *Yes, they’re. (Duh!)

Traditionally, the ** attention mechanisms** got here into existence for the duty of neural machine translation as discussed within the previous section. So essentially the eye mechanism was applied to map the source and goal sentence. Because the seq-to-seq models perform the interpretation task token by token, the eye mechanism helps us to discover which token(s) from the source sentence to

**while generating token x for the goal sentence. For this, it makes use of hidden state representations from encoders and decoders to calculate the eye scores and generate context vectors based on these scores as input for the decoder. Should you want to learn more concerning the Attention Mechanism, please hop on to this text (Brilliantly explained!).**

*focus more on*Coming back to ** self-attention**, the essential idea is to calculate the eye scores while mapping the source sentence to itself. If you’ve got a sentence like,

“The boy didn’t cross theroadbecauseitwas too wide.”

It is straightforward for us humans to grasp that word “it” refers to “road” within the above sentence but how will we make our language model understand this relationship as well? That is where ** self-attention** comes into the image!

On a high level, every word within the sentence is compared against every other word within the sentence to quantify the relationships and understand the context. For representational purposes, you’ll be able to confer with the figure below.

Allow us to see intimately how this self-attention is calculated (in real).

*Generate embeddings for the input sentence*

Find embeddings of all of the words and convert them into an input matrix. These embeddings might be generated via easy tokenisation and one-hot encoding or may very well be generated by embedding algorithms like BERT, etc. The ** dimension of the input matrix** might be equal to the

**. Allow us to call this**

*sentence length x embedding dimension***for future reference.**

*input matrix X**Transform input matrix into Q, K & V*

For calculating self-attention, we’d like to rework X (input matrix) into three latest matrices:

– Query (Q)

– Key (K)

– Value (V)

To calculate these three matrices, we’ll randomly initialise three weight matrices namely ** Wq, Wk, & Wv**. The input matrix X might be multiplied with these weight matrices Wq, Wk, & Wv to acquire values for Q, K & V respectively. The optimal values for weight matrices might be learned through the process to acquire more accurate values for Q, K & V.

*Calculate the dot product of Q and K-transpose*

From the figure above, we are able to imply that qi, ki, and vi represent the values of Q, K, and V for the i-th word within the sentence.

The primary row of the output matrix will inform you how word1 represented by q1 is said to the remaining of the words within the sentence using dot-product. The upper the worth of the dot-product, the more related the words are. For intuition of why this dot product was calculated, you’ll be able to understand Q (query) and K (key) matrices by way of information retrieval. So here,

– Q or Query = Term you’re looking for

– K or Key = a set of keywords in your search engine against which Q is compared and matched.

As within the previous step, we’re calculating the dot-product of two matrices i.e. performing a multiplication operation, there are possibilities that the worth might explode. To make certain this doesn’t occur and gradients are stabilised, we divide the dot product of Q and K-transpose by the square root of the embedding dimension (dk).

*Normalise the values using softmax*

Normalisation using the softmax function will end in values between 0 and 1. The cells with high-scaled dot-product might be heightened moreover whereas low values might be reduced making the excellence between matched word pairs clearer. The resultant output matrix might be considered a *rating matrix S*.

*Calculate the eye matrix Z*

The values matrix or V is multiplied by the rating matrix S obtained from the previous step to calculate the eye matrix Z.

*But wait, why multiply?*

Suppose, Si = [0.9, 0.07, 0.03] is the rating matrix value for i-th word from a sentence. This vector is multiplied with the V matrix to calculate Zi (attention matrix for i-th word).

*Zi = [0.9 * V1 + 0.07 * V2 + 0.03 * V3]*

Can we are saying that for understanding the context of i-th word, we must always only concentrate on word1 (i.e. V1) as 90% of the worth of attention rating is coming from V1? We could clearly define the necessary words where ** more attention** must be paid to grasp the context of i-th word.

Hence, we are able to conclude that the upper the contribution of a word within the Zi representation, the more critical and related the words are to 1 one other.

Now that we all know tips on how to calculate the self-attention matrix, allow us to understand the concept of the ** multi-head attention mechanism**.

*Multi-head attention Mechanism*

*Multi-head attention Mechanism*

What is going to occur in case your rating matrix is biased toward a particular word representation? It’s going to mislead your model and the outcomes is not going to be as accurate as we expect. Allow us to see an example to grasp this higher.

S1: “*All is well*”

Z(well) = 0.6 * V(all) + 0.0 * v(is) + 0.4 * V(well)

S2: “*The dog ate the food since it was hungry*”

Z(it) = 0.0 * V(the) + 1.0 * V(dog) + 0.0 * V(ate) + …… + 0.0 * V(hungry)

In S1 case, while calculating Z(well), more importance is given to V(all). It’s even greater than V(well) itself. There isn’t a guarantee how accurate this might be.

Within the S2 case, while calculating Z(it), all of the importance is given to V(dog) whereas the scores for the remaining of the words are 0.0 including V(it) as well. This looks acceptable because the “it” word is ambiguous. It is sensible to relate it more to a different word than the word itself. That was the entire purpose of this exercise of calculating self-attention. To handle the context of ambiguous words within the input sentences.

In other words, we are able to say that if the present word is ambiguous then it’s okay to present more importance to another word while calculating self-attention but in other cases, it might probably be misleading for the model. So, what will we do now?

*What if we calculate multiple attention matrices as an alternative of calculating one attention matrix and derive the ultimate attention matrix from these?*

That’s precisely what ** multi-head attention** is all about! We calculate multiple versions of attention matrices z1, z2, z3, ….., zm and concatenate them to derive the ultimate attention matrix. That way we might be more confident about our attention matrix.

Moving on to the subsequent necessary concept,

## Positional Encoding

In seq-to-seq models, the input sentence is fed word by word to the network which allows the model to trace the positions of words relative to other words.

But in transformer models, we follow a distinct approach. As an alternative of giving inputs word by word, they’re fed parallel-y which helps in reducing the training time and learning long-term dependency. But with this approach, the word order is lost. Nonetheless, to grasp the meaning of a sentence appropriately, word order is incredibly necessary. To beat this problem, a latest matrix called “** positional encoding**” (P) is introduced.

This matrix P is shipped together with input matrix X to incorporate the data related to the word order. For obvious reasons, the size of X and P matrices are the identical.

To calculate positional encoding, the formula given below is used.

Within the above formula,

**pos**= position of the word within the sentence**d**= dimension of the word/token embedding**i**= represents each dimension within the embedding

In calculations, d is fixed but pos and that i vary. If d=512, then i ∈ [0, 255] as we take 2i.

This video covers positional encoding in-depth if you happen to want to know more about it.

Visual Guide to Transformer Neural Networks — (Part 1) Position Embeddings

I’m using some visuals from the above video to clarify this idea in my words.

The above figure shows an example of a positional encoding vector together with different variable values.

The above figure shows how the values of ** PE(pos, 2i)** will vary

*if i is constant and only pos varies*. As we all know the

**is a periodic function that tends to repeat itself after a hard and fast interval. We are able to see that the encoding vectors for pos = 0 and pos = 6 are equivalent. This isn’t desirable as we’d want**

*sinusoidal wave**different positional encoding vectors for various values of pos*.

This might be achieved by *various the frequency of the sinusoidal wave.*

As the worth of i varies, the frequency of sinusoidal waves also varies resulting in several waves and hence, resulting in several values for each positional encoding vector. This is strictly what we wanted to attain.

The positional encoding matrix (P) is added to the input matrix (X) and fed to the encoder.

The following component of the encoder is the **feedforward network**.

## Feedforward Network

This sublayer within the encoder block is the classic neural network with two dense layers and ReLU activations. It accepts input from the multi-head attention layer, performs some non-linear transformations on the identical and at last generates contextualised vectors. The fully-connected layer is liable for considering each attention head and learning relevant information from them. For the reason that attention vectors are independent of one another, they might be passed to the transformer network in a parallelised way.

The last and final component of the Encoder block is ** Add & Norm component**.

**Add & Norm component**

This can be a *residual layer* followed by *layer normalisation*. The residual layer ensures that no necessary information related to the input of sub-layers is lost within the processing. While the normalisation layer promotes faster model training and prevents the values from changing heavily.

Inside the encoder, there are two add & norm layers:

- connects the input of the multi-head attention sub-layer to its output
- connects the input of the feedforward network sub-layer to its output

With this, we conclude the inner working of the Encoders. To summarize the article, allow us to quickly go over the steps that the encoder uses:

- Generate embeddings or tokenized representations of the input sentence. This might be our input matrix X.
- Generate the positional embeddings to preserve the data related to the word order of the input sentence and add it to the input matrix X.
- Randomly initialize three matrices: Wq, Wk, & Wv
i.e. weights of query, key & value. These weights might be updated through the training of the transformer model. - Multiply the input matrix X with each of Wq, Wk, & Wv to generate Q (query), K (key) and V (value) matrices.
- Calculate the dot product of Q and K-transpose, scale the product by dividing it with the square root of dk or embedding dimension and at last normalize it using the softmax function.
- Calculate the eye matrix Z by multiplying the V or value matrix with the output of the softmax function.
- Pass this attention matrix to the feedforward network to perform non-linear transformations and generate contextualized embeddings.

In the subsequent article, we’ll understand how the Decoder component of the Transformer model works.

This is able to be all for this text. I hope you found it useful. Should you did, please don’t forget to clap and share it together with your friends.