Understanding Transformers

An easy breakdown of “Attention is All You Need”¹

The transformer got here out in 2017. There have been many, many articles explaining how it really works, but I often find them either going too deep into the maths or too shallow on the small print. I find yourself spending as much time googling (or chatGPT-ing) as I do reading, which isn’t the perfect approach to understanding a subject. That brought me to writing this text, where I attempt to elucidate probably the most revolutionary facets of the transformer while keeping it succinct and easy for anyone to read.

This text assumes a general understanding of machine learning principles.

Tranformers, transforming. Image source: DALL-E (we’re learning about gen ai anyway)

The ideas behind the Transformer led us to the era of Generative AI

Transformers represented a brand new architecture of sequence transduction models. A sequence model is a kind of model that transforms an input sequence to an output sequence. This input sequence could be of varied data types, corresponding to characters, words, tokens, bytes, numbers, phonemes (speech recognition), and might also be multimodal¹.

Before transformers, sequence models were largely based on recurrent neural networks (RNNs), long short-term memory (LSTM), gated recurrent units (GRUs) and convolutional neural networks (CNNs). They often contained some type of an attention mechanism to account for the context provided by items in various positions of a sequence.

RNN Illustration. Image source: Christopher Olah

RNNs: The model tackles the information sequentially, so anything learned from the previous computation is accounted for in the following computation². Nevertheless, its sequential nature causes a number of problems: the model struggles to account for long-term dependencies for longer sequences (referred to as vanishing or exploding gradients), and prevents parallel processing of the input sequence as you can not train on different chunks of the input at the identical time (batching) since you will lose context of the previous chunks. This makes it more computationally expensive to coach.

LSTM and GRU overview. Image source: Christopher Olah

LSTM and GRUs: Made use of gating mechanisms to preserve long-term dependencies³. The model has a cell state which accommodates the relevant information from the entire sequence. The cell state changes through gates corresponding to the forget, input, output gates (LSTM), and update, reset gates (GRU). These gates resolve, at each sequential iteration, how much information from the previous state needs to be kept, how much information from the brand new update needs to be added, after which which a part of the brand new cell state needs to be kept overall. While this improves the vanishing gradient issue, the models still work sequentially and hence train slowly on account of limited parallelisation, especially when sequences get longer.
CNNs: Process data in a more parallel fashion, but still technically operates sequentially. They’re effective in capturing local patterns but struggle with long-term dependencies on account of the best way through which convolution works. The variety of operations to capture relationships between two input positions increases with distance between the positions.

Hence, introducing the Transformer, which relies entirely on the eye mechanism and does away with the reoccurrence and convolutions. Attention is what the model uses to deal with different parts of the input sequence at each step of generating an output. The Transformer was the primary model to make use of attention without sequential processing, allowing for parallelisation and hence faster training without losing long-term dependencies. It also performs a constant variety of operations between input positions, no matter how far apart they’re.

Transformer architecture. Image source: Attention is All You Need

The vital features of the transformer are: tokenisation, the embedding layer, the attention mechanism, the encoder and the decoder. Let’s imagine an input sequence in french: “Je suis etudiant” and a goal output sequence in English “I’m a student” (I’m blatantly copying from this link, which explains the method very descriptively)

Tokenisation

The input sequence of words is converted into tokens of three–4 characters long

Embeddings

The input and output sequence are mapped to a sequence of continuous representations, z, which represents the input and output embeddings. Each token shall be represented by an embedding to capture some sort of meaning, which helps in computing its relationship to other tokens; this embedding shall be represented as a vector. To create these embeddings, we use the vocabulary of the training dataset, which accommodates every unique output token that’s getting used to coach the model. We then determine an appropriate embedding dimension, which corresponds to the scale of the vector representation for every token; higher embedding dimensions will higher capture more complex / diverse / intricate meanings and relationships. The size of the embedding matrix, for vocabulary size V and embedding dimension D, hence becomes V x D, making it a high-dimensional vector.

At initialisation, these embeddings could be initialised randomly and more accurate embeddings are learned through the training process. The embedding matrix is then updated during training.

Positional encodings are added to those embeddings since the transformer doesn’t have a built-in sense of the order of tokens.

Computing attention scores for the token “it”. As you may see, the model is paying large attention to the tokens “The” and “Animal”. Image source: Jay Alammar

Attention mechanism

Self-attention is the mechanism where each token in a sequence computes attention scores with every other token in a sequence to understand relationships between all tokens no matter distance from one another. I’m going to avoid an excessive amount of math in this text, but you may read up here about different matrices formed to compute attention scores and hence capture relationships between each token and each other token.

These attention scores lead to a latest set of representations⁴ for every token which is then utilized in the following layer of processing. During training, the weight matrices are updated through back-propagation, so the model can higher account for relationships between tokens.

Multi-head attention is just an extension of self-attention. Different attention scores are computed, the outcomes are concatenated and transformed and the resulting representation enhances the model’s ability to capture various complex relationships between tokens.

Encoder

Input embeddings (built from the input sequence) with positional encodings are fed into the encoder. The input embeddings are 6 layers, with each layer containing 2 sub-layers: multi-head attention and feed forward networks. There’s also a residual connection which ends up in the output of every layer being LayerNorm(x+Sublayer(x)) as shown. The output of the encoder is a sequence of vectors that are contextualised representations of the inputs after accounting for attention scored. These are then fed to the decoder.

Decoder

Output embeddings (generated from the goal output sequence) with positional encodings are fed into the decoder. The decoder also accommodates 6 layers, and there are two differences from the encoder.

First, the output embeddings undergo masked multi-head attention, which implies that the embeddings from subsequent positions within the sequence are ignored when computing the eye scores. It is because once we generate the present token (in position i), we should always ignore all output tokens at positions after i. Furthermore, the output embeddings are offset to the correct by one position, in order that the expected token at position i only relies on outputs at positions lower than it.

For instance, let’s say the input was “je suis étudiant à l’école” and goal output is “i’m a student at school”. When predicting the token for student, the encoder takes embeddings for “je suis etudiant” while the decoder conceals the tokens after “a” in order that the prediction of student only considers the previous tokens within the sentence, namely “I’m a”. This trains the model to predict tokens sequentially. After all, the tokens “at school” provide added context for the model’s prediction, but we’re training the model to capture this context from the input token,“etudiant” and subsequent input tokens, “à l’école”.

How is the decoder getting this context? Well that brings us to the second difference: The second multi-head attention layer within the decoder takes within the contextualised representations of the inputs before being passed into the feed-forward network, to make sure that the output representations capture the total context of the input tokens and prior outputs. This offers us a sequence of vectors corresponding to every goal token, that are contextualised goal representations.

The prediction using the Linear and Softmax layers

Now, we would like to make use of those contextualised goal representations to work out what the following token is. Using the contextualised goal representations from the decoder, the linear layer projects the sequence of vectors right into a much larger logits vector which is identical length as our model’s vocabulary, let’s say of length L. The linear layer accommodates a weight matrix which, when multiplied with the decoder outputs and added with a bias vector, produces a logits vector of size 1 x L. Each cell is the rating of a novel token, and the softmax layer than normalises this vector in order that the complete vector sums to at least one; each cell now represents the possibilities of every token. The best probability token is chosen, and voila! we have now our predicted token.

Training the model

Next, we compare the expected token probabilities to the actual token probabilites (which can just be logits vector of 0 for each token apart from the goal token, which has probability 1.0). We calculate an appropriate loss function for every token prediction and average this loss over the complete goal sequence. We then back-propagate this loss over all of the model’s parameters to calculate appropriate gradients, and use an appropriate optimisation algorithm to update the model parameters. Hence, for the classic transformer architecture, this results in updates of

The embedding matrix
Different matrices used to compute attention scores
The matrices related to the feed-forward neural networks
The linear matrix used to make the logits vector

Matrices in 2–4 are weight matrices, and there are additional bias terms related to each output that are also updated during training.

Note: The linear matrix and embedding matrix are sometimes transposes of one another. That is the case for the Attention is All You Need paper; the technique known as “weight-tying”. The variety of parameters to coach are thus reduced.

This represents one epoch of coaching. Training comprises multiple epochs, with the number depending on the scale of the datasets, size of the models, and the model’s task.

As we mentioned earlier, the issues with the RNNs, CNNs, LSTMs and more include the shortage of parallel processing, their sequential architecture, and inadequate capturing of long-term dependencies. The transformer architecture above solves these problems as…

The Attention mechanism allows the entire sequence to be processed in parallel slightly than sequentially. With self-attention, each token within the input sequence attends to each other token within the input sequence (of that mini batch, explained next). This captures all relationships at the identical time, slightly than in a sequential manner.
Mini-batching of input inside each epoch allows parallel processing, faster training, and easier scalability of the model. In a big text filled with examples, mini-batches represent a smaller collection of those examples. The examples within the dataset are shuffled before being put into mini-batches, and reshuffled in the beginning of every epoch. Each mini-batch is passed into the model at the identical time.
By utilizing positional encodings and batch processing, the order of tokens in a sequence is accounted for. Distances between tokens are also accounted for equally no matter how far they’re, and the mini-batch processing further ensures this.

As shown within the paper, the outcomes were improbable.

Welcome to the world of transformers.

The transformer architecture was introduced by the researcher Ashish Vaswani in 2017 while he was working at Google Brain. The Generative Pre-trained Transformer (GPT) was introduced by OpenAI in 2018. The first difference is that GPT’s don’t contain an encoder stack of their architecture. The encoder-decoder makeup is helpful when were directly converting one sequence into one other sequence. The GPT was designed to deal with generative capabilities, and it did away with the decoder while keeping the remainder of the components similar.

Image source: Improving Language Understanding by Generative Pre-Training

The GPT model is pre-trained on a big corpus of text, unsupervised, to learn relationships between all words and tokens⁵. After fine-tuning for various use cases (corresponding to a general purpose chatbot), they’ve proven to be extremely effective in generative tasks.

Example

While you ask it a matter, the steps for prediction are largely the identical as a daily transformer. For those who ask it the query “How does GPT predict responses”, these words are tokenised, embeddings generated, attention scores computed, probabilities of the following word are calculated, and a token is chosen to be the following predicted token. For instance, the model might generate the response step-by-step, starting with “GPT predicts responses by…” and continuing based on probabilities until it forms a whole, coherent response. (guess what, that last sentence was from chatGPT).

Understanding Transformers

An easy breakdown of “Attention is All You Need”¹

The ideas behind the Transformer led us to the era of Generative AI

Tokenisation

Embeddings

Attention mechanism

Encoder

Decoder

The prediction using the Linear and Softmax layers

Training the model

As shown within the paper, the outcomes were improbable.

Example

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Crowdworks “Self-constructed high-quality data has definite market value”

HPI-MIT design research collaboration creates powerful teams

My 30-Day Map Challenge 2023

Suncheon Wangjo District Office, Jeonnam Provincial Office Chosen as Best Regional Office

Fliki Review: Supercharge Your Content Creation with AI

Understanding Transformers

An easy breakdown of “Attention is All You Need”¹

The ideas behind the Transformer led us to the era of Generative AI

Tokenisation

Embeddings

Attention mechanism

Encoder

Decoder

The prediction using the Linear and Softmax layers

Training the model

As shown within the paper, the outcomes were improbable.

Example

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.