Large Language Models, GPT-1 — Generative Pre-Trained Transformer

Artificial Intelligence

Large Language Models, GPT-1 — Generative Pre-Trained Transformer

admin

January 28, 2024

Large Language Models, GPT-1 — Generative Pre-Trained Transformer

Diving deeply into the working structure of the primary version of gigantic GPT-models

10 min read

18 hours ago

2017 was a historical 12 months in machine learning. Researchers from the Google Brain team introduced Transformer which rapidly outperformed most of the prevailing approaches in deep learning. The famous attention mechanism became the important thing component in the long run models derived from Transformer. The amazing fact about Transformer’s architecture is its vaste flexibility: it may well be efficiently used for quite a lot of machine learning task types including NLP, image and video processing problems.

The unique Transformer might be decomposed into two parts that are called encoder and decoder. Because the name suggests, the goal of the encoder is to encode an input sequence in the shape of a vector of numbers — a low-level format that is known by machines. Alternatively, the decoder takes the encoded sequence and by applying a language modeling task, it generates a recent sequence.

Encoders and decoders might be used individually for specific tasks. The 2 most famous models deriving their parts from the unique Transformer are called BERT (Bidirectional Encoder Representations from Transformer) consisting of encoder blocks and GPT (Generative Pre-Trained Transformer) composed of decoder blocks.

In this text, we’ll discuss GPT and understand how it really works. From the high-level perspective, it’s crucial to grasp that GPT architecture consists of a set of Transformer blocks as illustrated within the diagram above aside from the incontrovertible fact that it doesn’t have any input encoders.

As for many LLMs, GPT’s framework consists of two stages: pre-training and fine-tuning. Allow us to study how they’re organised.

1. Pre-training

Loss function

Because the paper states, “We use a normal language modeling objective to maximise the next likelihood”:

On this formula, at each step, the model outputs the probability distribution of all possible tokens being the subsequent token i for the sequence consisting of the last k context tokens. Then, the logarithm of the probability for the actual token is calculated and used as one in every of several values within the sum above for the loss function.

The parameter k known as the context window size.

The mentioned loss function can also be referred to as log-likelihood.

Encoder models (e.g. BERT) predict tokens based on the context from each side while decoder models (e.g. GPT) only use the previous context, otherwise they might not give you the option to learn to generate text.

The intuition behind the loss function

For the reason that expression for the log-likelihood may not be easy to understand, this section will explain intimately how it really works.

Because the name suggests, GPT is a generative model indicating that its ultimate goal is to generate a recent sequence during inference. To realize it, during training an input sequence is embedded and split by several substrings of equal size k. After that, for every substring, the model is asked to predict the subsequent token by generating the output probability distribution (by utilizing the ultimate softmax layer) built for all vocabulary tokens. Each token on this distribution is mapped to the probability that exactly this token is the true next token within the subsequence.

To make the things more clear, allow us to take a look at the instance below through which we’re given the next string:

We split this string into substrings of length k = 3. For every of those substrings, the model outputs a probability distribution for the language modeling task. The anticipated distrubitons are shown within the table below:

In each distribution, the probability corresponding to the true token within the sequence is taken (highlighted in yellow) and used for loss calculation. The ultimate loss equals the sum of logarithms of true token probabilities.

GPT tries to maximise its loss, thus higher loss values correspond to higher algorithm performance.

From the instance distributions above, it is evident that top predicted probabilities corresponding to true tokens add up larger values to the loss function demonstrating higher performance of the algorithm.

Subtlety behind the loss function

We now have understood the intuition behind the GPT’s pre-training loss function. Nevertheless, the expression for the log-likelihood was originally derived from one other formula and may very well be much easier to interpret!

Allow us to assume that the model performs the identical language modeling task. Nonetheless, this time, the loss function will maximize the product of all predicted probabilities. It’s an inexpensive selection as the entire output predicted probabilities for various subsequences are independent.

Multiplication of probabilities because the loss value for the previous example

Computed loss value

Since probability is defined within the range [0, 1], this loss function can even take values in that range. The very best value of 1 indicates that the model with 100% confidence predicted all of the corrected tokens, thus it may well fully restore the entire sequence. Due to this fact,

Product of probabilities because the loss function for a language modeling task, maximizes the probability of appropriately restoring the entire sequence(-s).

General formula for product probability in language modeling

If this loss function is so easy and seems to have such a pleasant interpretation, why it will not be utilized in GPT and other LLMs? The issue comes up with computation limits:

Within the formula, a set of probabilities is multiplied. The values they represent are frequently very low and shut to 0, especially when in the course of the starting of the pre-training step when the algoroithm has not learned anything yet, thus assigning random probabilities to its tokens.
In real life, models are trained in batches and never on single examples. Which means that the full variety of probabilities within the loss expression might be very high.

As a consequence, loads of tiny values are multiplied. Unfortunately, computer machines with their floating-point arithmetics are usually not ok to exactly compute such expressions. That’s the reason the loss function is barely transformed by inserting a logarithm behind the entire product. The reasoning behind doing it is 2 useful logarithm properties:

Logarithm is monotonic. Which means that higher loss will still correspond to higher performance and lower loss will correspond to worse performance. Due to this fact, maximizing L or log(L) doesn’t require modifications within the algorithm.

The logarithm of a product is the same as the sum of the logarithms of its aspects, i.e. log(ab) = log(a) + log(b). This rule might be used to decompose the product of probabilities into the sum of logarithms:

We will notice that just by introducing the logarithmic transformation we’ve obtained the identical formula used for the unique loss function in GPT! On condition that and the above observations, we are able to conclude a crucial fact:

The log-likelihood loss function in GPT maximizes the logarithm of the probability of appropriately predicting all of the tokens within the input sequence.

Text generation

Once GPT is pre-trained, it may well already be used for text generation. GPT is an autoregressive model meaning that it uses previously predicted tokens as input for prediction of next tokens.

On each iteration, GPT takes an initial sequence and predicts the subsequent most probable token for it. After that, the sequence and the expected token are concatenated and passed as input to again predict the subsequent token, etc. The method lasts until the [end] token is predicted or the utmost input size is reached.

Autoregressive completion of a sentence with GPT

2. Nice-tuning

After pre-training, GPT can capture linguistic knowledge of input sequences. Nonetheless, to make it higher perform on downstream tasks, it must be fine-tuned on a supervised problem.

For fine-tuning, GPT accepts a labelled dataset where each example accommodates an input sequence x with a corresponding label y which must be predicted. Every example is passed through the model which outputs their hidden representations h on the last layer. The resulting vectors are then passed to an added linear layer with learnable parameters W after which through the softmax layer.

The loss function used for fine-tuning could be very just like the one mentioned within the pre-training phase but this time, it evaluates the probability of observing the goal value y as an alternative of predicting the subsequent token. Ultimately, the evaluation is finished for several examples within the batch for which the log-likelihood is then calculated.

Moreover, the authors of the paper found it useful to incorporate an auxiliary objective used for pre-training within the fine-tuning loss function as well. In line with them, it:

improves the model’s generalization;
accelerates convergence.

GPT diagram during fine-tuning. Image adopted by the writer.

Finally, the fine-tuning loss function takes the next form (α is a weight):

Nice-tuning loss function

There exist loads of approaches in NLP for fine-tuning a model. A few of them require changes within the model’s architecture. The plain downside of this system is that it becomes much harder to make use of transfer learning. Moreover, such a way also requires loads of customizations to be made for the model which will not be practical in any respect.

Alternatively, GPT uses a traversal-style approach: for various downstream tasks, GPT doesn’t require changes in its architecture but only within the input format. The unique paper demonstrates visualised examples of input formats accepted by GPT on various downstream problems. Allow us to individually undergo them.

Classification

That is the best downstream task. The input sequence is wrapped with [start] and [end] tokens (that are trainable) after which passed to GPT.

Classification pipeline for fine-tuning. Image adopted by the writer.

Textual entailment

Textual entailment or natural language inference (NLI) is an issue of determining whether the primary sentence (premise) is logically followed by the second (hypothesis) or not. For modeling that task, premise and hypothesis are concatenated and separated by a delimiter token ($).

Textual entailment pipeline for fine-tuning. Image adopted by the writer.

Semantic similarity

The goal of similarity tasks is to grasp how semantically close a pair of sentences are to one another. Normally, compared pairs sentences don’t have any order. Taking that under consideration, the authors propose concatenating pairs of sentences in each possible orders and feeding the resulting sequences to GPT. The each hidden output Transformer layers are then added element-wise and passed to the ultimate linear layer.

Query answering & Multiple selection answering

Multiple selection answering is a task of appropriately selecting one or several answers to a given query based on the provided context information.

For GPT, each possible answer is concatenated with the context and the query. All of the concatenated strings are then independently passed to Transformer whose outputs from the Linear layer are then aggregated and final predictions are chosen based on the resulting answer probability distribution.

Multiple selection answering pipeline for fine-tuning. Image adopted by the writer.

GPT is pre-trained on the BookCorpus dataset containing 7k books. This dataset was chosen on purpose because it mostly consists of long stretches of text allowing the model to higher capture language information on a protracted distance. Speaking of architecture and training details, the model has the next parameters:

Variety of Transformer blocks: 12
Embedding size: 768
Variety of attention heads: 12
FFN hidden state size: 3072
Optimizator: Adam (learning rate is about to 2.5e-4)
Activation function: GELU
Byte-pair encoding with a vocabulary size of 40k is used
Total variety of parameters: 120M

Finally, GPT is pre-trained on 100 epochs tokens with a batch size of 64 on continuous sequences of 512 tokens.

Most of hyperparameters used for fine-tuning are similar to those used during pre-training. Nevertheless, for fine-tuning, the educational rate is decreased to six.25e-5 with the batch size set to 32. Generally, 3 fine-tuning epochs were enough for the model to supply strong performance.

Byte-pair encoding helps take care of unknown tokens: it iteratively constructs vocabulary on a subword level meaning that any unknown token might be then split into a mix of learned subword representations.

Combination of the facility of Transformer blocks and stylish architecture design, GPT has turn into some of the fundamental models in machine learning. It has established 9 out of 12 recent state-of-the-art results on top benchmarks and has turn into an important foundation for its future gigantic successors: GPT-2, GPT-3, GPT-4, ChatGPT, etc.

All images are by the writer unless noted otherwise