De-coded: Transformers explained in plain English

Artificial Intelligence

De-coded: Transformers explained in plain English

admin

October 9, 2023

De-coded: Transformers explained in plain English

No code, maths, or mention of Keys, Queries and Values

Since their introduction in 2017, transformers have emerged as a distinguished force in the sector of Machine Learning, revolutionizing the capabilities of major translation and autocomplete services.

Recently, the recognition of transformers has soared even higher with the arrival of huge language models like OpenAI’s ChatGPT, GPT-4, and Meta’s LLama. These models, which have garnered immense attention and excitement, are all built on the muse of the transformer architecture. By leveraging the facility of transformers, these models have achieved remarkable breakthroughs in natural language understanding and generation; exposing these to most of the people.

Despite a number of good resources which break down how transformers work, I discovered myself able where I understood the how the mechanics worked mathematically but found it difficult to clarify how a transformer works intuitively. After conducting many interviews, chatting with my colleagues, and giving a lightning talk on the topic, evidently many individuals share this problem!

On this blog post, I shall aim to supply a high-level explanation of how transformers work without counting on code or mathematics. My goal is to avoid confusing technical jargon and comparisons with previous architectures. Whilst I’ll try to maintain things so simple as possible, this won’t be easy as transformers are quite complex, but I hope it would provide a greater intuition of what they do and the way they do it.

A transformer is a variety of neural network architecture which is well suited to tasks that involve processing sequences as inputs. Perhaps probably the most common example of a sequence on this context is a sentence, which we are able to consider as an ordered set of words.

The aim of those models is to create a numerical representation for every element inside a sequence; encapsulating essential information concerning the element and its neighbouring context. The resulting numerical representations can then be passed on to downstream networks, which may leverage this information to perform various tasks, including generation and classification.

By creating such wealthy representations, these models enable downstream networks to raised understand the underlying patterns and relationships inside the input sequence, which reinforces their ability to generate coherent and contextually relevant outputs.

The important thing advantage of transformers lies of their ability to handle long-range dependencies inside sequences, in addition to being highly efficient; able to processing sequences in parallel. This is especially useful for tasks resembling machine translation, sentiment evaluation, and text generation.

Image generated by the Azure OpenAI Service DALL-E model with the next prompt: “*The green and black Matrix code in the form of Optimus Prime”*

To feed an input right into a transformer, we must first convert it right into a sequence of tokens; a set of integers that represent our input.

As transformers were first applied within the NLP domain, let’s consider this scenario first. The only technique to convert a sentence right into a series of tokens is to define a vocabulary which acts as a lookup table, mapping words to integers; we are able to reserve a particular number to represent any word which is just not contained on this vocabulary, in order that we are able to at all times assign an integer value.

In practice, it is a naïve way of encoding text, as words resembling cat and cats are treated as completely different tokens, despite them being singular and plural descriptions of the identical animal! To beat this, different tokenisation strategies — resembling byte-pair encoding — have been devised which break words up into smaller chunks before indexing them. Moreover, it is usually useful so as to add special tokens to represent characteristics resembling the beginning and end of a sentence, to supply additional context to the model.

Let’s consider the next example, to raised understand the tokenization process.

“Hello there, isn’t the weather nice today in Drosval?”

Drosval is a reputation generated by GPT-4 using the next prompt: “Are you able to create a fictional place name that seems like it could belong to David Gemmell’s Drenai universe?”; chosen deliberately because it shouldn’t appear within the vocabulary of any trained model.

Using the bert-base-uncased tokenizer from the transformers library, that is converted to the next sequence of tokens:

The integers that represent each word will change depending on the particular model training and tokenization strategy. Decoding this, we are able to see the word that every token represents:

Interestingly, we are able to see that this is just not similar to our input. Special tokens have been added, our abbreviation has been split into multiple tokens, and our fictional place name is represented by different ‘chunks’. As we used the ‘uncased’ model, we’ve got also lost all capitalization context.

Nevertheless, whilst we used a sentence for our example, transformers should not limited to text inputs; this architecture has also demonstrated good results on vision tasks. To convert a picture right into a sequence, the authors of ViT sliced the image into non-overlapping 16×16 pixel patches and concatenated these into a protracted vector before passing it into the model. If we were using a transformer in a Recommender system, one approach might be to make use of the item ids of the last n items browsed by a user as an input to our network. If we are able to create a meaningful representation of input tokens for our domain, we are able to feed this right into a transformer network.

Embedding our tokens

Once we’ve got a sequence of integers which represents our input, we are able to convert them into embeddings. Embeddings are a way of representing information that may be easily processed by machine learning algorithms; they aim to capture the meaning of the token being encoded in a compressed format, by representing the data as a sequence of numbers. Initially, embeddings are initialised as sequences of random numbers, and meaningful representations are learned during training. Nevertheless, these embeddings have an inherent limitation: they don’t take into consideration the context through which the token appears. There are two facets to this.

Depending on the duty, after we embed our tokens, we can also want to preserve the ordering of our tokens; this is particularly necessary in domains resembling NLP, or we essentially find yourself with a bag of words approach. To beat this, we apply positional encoding to our embeddings. Whilst there are multiple ways of making positional embeddings, the important idea is that we’ve got one other set of embeddings which represent the position of every token within the input sequence, that are combined with our token embeddings.

The opposite issue is that tokens can have different meanings depending on the tokens that surround it. Consider the next sentences:

It’s dark, who turned off the sunshine?

Wow, this parcel is de facto light!

Here, the word light is utilized in two different contexts, where it has completely different meanings! Nevertheless, it is probably going that — depending on the tokenisation strategy — the embedding will likely be the identical. In a transformer, that is handled by its attention mechanism.

Perhaps a very powerful mechanism utilized by the transformer architecture is often called attention, which enables the network to grasp which parts of the input sequence are probably the most relevant for the given task. For every token within the sequence, the eye mechanism identifies which other tokens are necessary for understanding the present token within the given context. Before we explore how that is implemented inside a transformer, let’s start easy and check out to grasp what the eye mechanism is trying to realize conceptually, to construct our intuition.

One technique to understand attention is to think about it as a way which replaces each token embedding with an embedding that features details about its neighbouring tokens; as an alternative of using the identical embedding for each token no matter its context. If we knew which tokens were relevant to the present token, a technique of capturing this context can be to create a weighted average — or, more generally, a linear combination — of those embeddings.

Let’s consider an easy example of how this might search for considered one of the sentences we saw earlier. Before attention is applied, the embeddings within the sequence don’t have any context of their neighbours. Due to this fact, we are able to visualise the embedding for the word light as the next linear combination.

Here, we are able to see that our weights are only the identity matrix. After applying our attention mechanism, we would really like to learn a weight matrix such that we could express our light embedding in a way much like the next.

This time, larger weights are given to the embeddings that correspond to probably the most relevant parts of the sequence for our chosen token; which should be certain that a very powerful context is captured in the brand new embedding vector.

Embeddings which contain details about their current context are sometimes often called contextualised embeddings, and that is ultimately what we try to create.

Now that we’ve got a high level understanding of what attention is trying to realize, let’s explore how this is definitely implemented in the next section.

There are multiple kinds of attention, and the important differences lie in the way in which that the weights used to perform the linear combination are calculated. Here, we will consider scaled dot-product attention, as introduced within the original paper, as that is probably the most common approach. On this section, assume that every one of our embeddings have been positionally encoded.

Recalling that our aim is to create contextualised embeddings using linear mixtures of our original embeddings, let’s start easy and assume that we are able to encode all the mandatory information needed into our learned embedding vectors, and all we’d like to calculate are the weights.

To calculate the weights, we must first determine which tokens are relevant to one another. To attain this, we’d like to determine a notion of similarity between two embeddings. One technique to represent this similarity is through the use of the dot product, where we would really like to learn embeddings such that higher scores indicate that two words are more similar.

As, for every token, we’d like to calculate its relevance with every other token within the sequence, we are able to generalise this to a matrix multiplication, which provides us with our weight matrix; that are also known as attention scores. To be certain that our weights sum to 1, we also apply the SoftMax function. Nevertheless, as matrix multiplications can produce arbitrarily large numbers, this might lead to the SoftMax function returning very small gradients for big attention scores; which can result in the vanishing gradient problem during training. To counteract this, the eye scores are multiplied by a scaling factor, before applying the SoftMax.

Now, to get our contextualised embedding matrix, we are able to multiply our attention scores with our original embedding matrix; which is the equivalent of taking linear mixtures of our embeddings.

No code, maths, or mention of Keys, Queries and Values

Embedding our tokens

LEAVE A REPLY Cancel reply