How GPT works: A Metaphoric Explanation of Key, Value, Query in Attention, using a Tale of Potion

Artificial Intelligence

How GPT works: A Metaphoric Explanation of Key, Value, Query in Attention, using a Tale of Potion

admin

June 19, 2023

How GPT works: A Metaphoric Explanation of Key, Value, Query in Attention, using a Tale of Potion

The backbone of ChatGPT is the GPT model, which is built using the architecture. The backbone of Transformer is the mechanism. The toughest concept to grok in Attention for a lot of is . On this post, I’ll use an analogy of potion to internalize these concepts. Even when you already understand the maths of transformer mechanically, I hope by the top of this post, you possibly can develop a more intuitive understanding of the inner workings of GPT from end to finish.

This explanation requires no maths background. For the technically inclined, I add more technical explanations in […]. It’s also possible to safely skip notes in [brackets] and side notes in quote blocks like this one. Throughout my writing, I make up some human-readable interpretation of the intermediary states of the transformer model to assist the reason, but GPT doesn’t think exactly like that.

[When I talk about “attention”, I exclusively mean “self-attention”, as that is what’s behind GPT. But the same analogy explains the general concept of “attention” just as well.]

The Set Up

GPT can spew out paragraphs of coherent content, since it does one task superbly well: “Given a text, what word comes next?” Let’s role-play GPT: “Sarah lies still on the bed, feeling ____”. Are you able to fill within the blank?

One reasonable answer, amongst many, is “drained”. In the remaining of the post, I’ll unpack how GPT arrives at this answer. (For fun, I put this prompt in ChatGPT and it wrote a brief story out of it.)

The Analogy: (Key, Value, Query), or (Tag, Potion, Recipe)

You feed the above prompt to GPT. In GPT, each word is provided with three things: Key, Value, Query, whose values are learned from devouring your complete web of texts through the training of the GPT model. It’s the interaction amongst these three ingredients that enables GPT to make sense of a word within the context of a text. So what do they do, really?

Let’s arrange our analogy of alchemy. , we’ve:

(aka “value”): The potion incorporates wealthy information concerning the word. For illustrative purpose, imagine the potion of the word “lies” incorporates information like “drained; dishonesty; can have a positive connotation if it’s a white lie; …”. The word “lies” can tackle multiple meanings, e.g. “tell lies” (related to dishonesty) or, “lies down” (related to drained). You possibly can only tell the true meaning within the context of a text. Right away, the potion incorporates information for each meanings, since it doesn’t have the context of a text.
(aka “query”): The alchemist of a given word, e.g. “lies”, goes over all of the nearby words. He finds a number of of those words relevant to his own word “lies”, and he’s tasked with filling an empty flask with potions of those words. The alchemist has a recipe, listing specific criteria that identifies what potions he should pay attention to.
(aka “key”): each potion (value) comes with a tag (key). If the tag (key) matches well with the alchemist’s recipe (query), the alchemist will concentrate to this potion.

Attention: the Alchemist’s Potion Mixology

The potions with their tags. Source: created by the writer.

In step one (attention), the alchemists of all words each exit on their very own quests to fill their flasks with potions from relevant words.

Let’s take the alchemist of the word “lies” for instance. He knows from previous experience — after being pre-trained on your complete web of texts — that words that help interpret “lies” in a sentence are often of the shape: “some flat surfaces, words related to dishonesty, words related to resting”. He writes down these criteria in his recipe (query) and appears for tags (key) on the potions of other words. If the tag may be very much like the standards, he’ll pour numerous that potion into his flask; if the tag will not be similar, he’ll pour little or none of that potion.

So he finds the tag for “bed” says “a flat piece of furniture”. That’s much like “some flat surfaces” in his recipe! He pours the potion for “bed” in his flask. The potion (value) for “bed” incorporates information like “drained, restful, sleepy, sick”.

The alchemist for the word “lies” continues the search. He finds the tag for the word “still” says “related to resting” (amongst other connotations of the word “still”). That’s related to his criteria “restful”, so he pours partially of the potion from “still”, which incorporates information like “restful, silent, stationary”.

He looks on the tag of “on”, “Sarah”, “the”, “feeling” and doesn’t find them relevant. So he doesn’t pour any of their potions into his flask.

Remember, he needs to ascertain his own potion too. The tag of his own potion “lies” says “a verb related to resting”, which matches his recipe. So he pours a few of his own potion into the flask as well, which incorporates information like “drained; dishonest; can have a positive connotation if it’s a white lie; …”.

By the top of his quest to ascertain words within the text, his flask is full.

Unlike the unique potion for “lies”, this mixed potion now takes into consideration the context of this very specific sentence. Namely, it has numerous elements of “drained, exhausted” and only a pinch of “dishonest”.

On this quest,.We’ve just explained crucial equation for Transformer, the underlying architecture of GPT:

Q is Query; K is Key; V is Value. Source: Attention is All You Need

Advanced notes:

1. Each alchemist looks at every bottle, including their very own [Q·K.transpose()].

2. The alchemist can match his recipe (query) with the tag (key) quickly and make a quick decision. [The similarity between query and key is determined by a dot product, which is a fast operation.] Moreover, all alchemists do their quests in parallel, which also helps speed things up. [Q·K.transpose() is a matrix multiplication, which is parallelizable. Speed is a winning feature of Transformer, compared to its predecessor Recurrent Neural Network that computes sequentially.]

3. The alchemist is picky. He only selects the highest few potions, as an alternative of blending in a little bit of every thing. [We use softmax to collapse Q·K.transpose(). Softmax will pull the inputs into more extreme values, and collapse many inputs to near-zero.]

4. At this stage, the alchemist doesn’t consider the ordering of words. Whether it’s “Sarah lies still on the bed, feeling” or “still bed the Sarah feeling on lies”, the filled flask (output of attention) will likely be the identical. [In the absence of “positional encoding”, Attention(Q, K, V) is independent of word positions.]

5. The flask at all times returns 100% filled, no more, no less. [The softmax is normalized to 1.]

6. The alchemist’s recipe and the potions’ tags must speak the identical language. [The Query and Key must be of the same dimension to be able to dot product together to communicate. The Value can take on a different dimension if you wish.]

7. The technically astute readers may indicate we didn’t do . I don’t need to clutter the analogy with too many details but I’ll explain it here. In self-attention, each word can only see the previous words. So within the sentence “Sarah lies still on the bed, feeling”, “lies” only sees “Sarah”; “still” only sees “Sarah”, “lies”. The alchemist of “still” can’t reach into the potions of “on”, “the”, “bed” and “feeling”.

Feed Forward: Chemistry on the Mixed Potions

Up till this point, the alchemist simply pours the potion from other bottles. In other words, he pours the potion of “lies” — “drained; dishonest;…” — as a uniform mixture into the flask; he can’t distill out the “drained” part and discard the “dishonest” part just yet. [Attention is simply summing the different V’s together, weighted by the softmax.]

Now comes the true chemistry (feed forward). The alchemist mixes every thing together and does some synthesis. He notices interactions between words like “sleepy” and“restful”, etc. He also notices that “dishonesty” is simply mentioned in a single potion. He knows from past experiences the way to make some ingredients interact with one another and the way discard the one-off ones. [The feed forward layer is a linear (and then non-linear) transformation of the Value. Feed forward layer is the building block of neural networks. You can think of it as the “thinking” step in Transformer, while the earlier mixology step is simply “collecting”.]

The resulting potion after his processing becomes rather more useful for the duty of predicting the following word. Intuitively, it represents some richer properties about this word within the context of its sentence, in contrast with the starting potion (value) that’s out of context.

The Final Linear and Softmax Layer: the Assembly of Alchemists

How can we get from here to the ultimate output, which is to predict that the following word after “Sarah lies still on the bed, feeling ___” is “drained”?

To this point, each alchemist has been working independently, only tending to his own word. Now all of the alchemists of various words assemble and stack their flasks in the unique word order and present them to the ultimate linear and softmax layer of the Transformer. What do I mean by this? Here, we must depart from the metaphor.

This final linear layer synthesizes information across different words. Based on pre-trained data, one plausible learning is that the immediate previous word is very important to predict the following word. For instance, the linear layer might heavily give attention to the last flask (“feeling”’s flask).

Then combined with the softmax layer, this step assigns each word in our vocabulary a probability for a way likely that is the following word after “Sarah lies on the bed, feeling…”. For instance, non-English words will receive probabilities near 0. Words like “drained”, “sleepy”, “exhausted” will receive high probabilities. We then pick the highest winner as the ultimate answer.

Recap

Now you’ve built a minimalist GPT!

To recap, for every word in the eye step, you identify which words (including self) each word should concentrate to, based on how well that word’s query (recipe) matches the opposite word’s key (tag). You combine together those words’ values (potions) proportional to the eye that word pays to them. You process this mixture to do some “pondering” (feed forward). Once each word is processed, you then mix the mixtures from all the opposite words to do more “pondering” (linear layer) and make the ultimate prediction of what the following word must be.

Side note: the language “decoder” is a vestige from the unique paper, as Transformer was first used for machine translation tasks. You “encode” the source language into embeddings, and “decode” from the embeddings to the goal language.