How GPT works: A Metaphoric Explanation of Key, Value, Query in Attention, using a Tale of Potion

Artificial Intelligence

How GPT works: A Metaphoric Explanation of Key, Value, Query in Attention, using a Tale of Potion

admin

June 17, 2023

How GPT works: A Metaphoric Explanation of Key, Value, Query in Attention, using a Tale of Potion

The backbone of ChatGPT is the GPT model, which is built using the Transformer architecture. The backbone of Transformer is the Attention mechanism. The toughest concept to grok in Attention for a lot of is Key, Value, and Query. On this post, I’ll use an analogy of potion to internalize these concepts. Even if you happen to already understand the maths of transformer mechanically, I hope by the top of this post, you possibly can develop a more intuitive understanding of the inner workings of GPT from end to finish.

This explanation requires no maths background. For the technically inclined, I add more technical explanations in […]. You can too safely skip notes in [brackets] and side notes in quote blocks like this one. Throughout my writing, I make up some human-readable interpretation of the intermediary states of the transformer model to help the reason, but GPT doesn’t think exactly like that.

[When I talk about “attention”, I exclusively mean “self-attention”, as that is what’s behind GPT. But the same analogy explains the general concept of “attention” just as well.]

The Set Up

GPT can spew out paragraphs of coherent content, since it does one task superbly well: “Given a text, what word comes next?” Let’s role-play GPT: “Sarah lies still on the bed, feeling ____”. Are you able to fill within the blank?

One reasonable answer, amongst many, is “drained”. In the remainder of the post, I’ll unpack how GPT arrives at this answer. (For fun, I put this prompt in ChatGPT and it wrote a brief story out of it.)

The Analogy: (Key, Value, Query), or (Tag, Potion, Recipe)

You feed the above prompt to GPT. In GPT, each word is supplied with three things: Key, Value, Query, whose values are learned from devouring your entire web of texts in the course of the training of the GPT model. It’s the interaction amongst these three ingredients that permits GPT to make sense of a word within the context of a text. So what do they do, really?

Let’s arrange our analogy of alchemy. For every word, we’ve:

A potion (aka “value”): The potion incorporates wealthy information in regards to the word. For illustrative purpose, imagine the potion of the word “lies” incorporates information like “drained; dishonesty; can have a positive connotation if it’s a white lie; …”. The word “lies” can tackle multiple meanings, e.g. “tell lies” (related to dishonesty) or, “lies down” (related to drained). You’ll be able to only tell the true meaning within the context of a text. Right away, the potion incorporates information for each meanings, since it doesn’t have the context of a text.
An alchemist’s recipe (aka “query”): The alchemist of a given word, e.g. “lies”, goes over all of the nearby words. He finds a number of of those words relevant to his own word “lies”, and he’s tasked with filling an empty flask with potions of those words. The alchemist has a recipe, listing specific criteria that identifies what potions he should pay attention to.
A tag (aka “key”): each potion (value) comes with a tag (key). If the tag (key) matches well with the alchemist’s recipe (query), the alchemist will listen to this potion.

Attention: the Alchemist’s Potion Mixology

The potions with their tags. Source: created by the writer.

In step one (attention), the alchemists of all words each exit on their very own quests to fill their flasks with potions from relevant words.

Let’s take the alchemist of the word “lies” for instance. He knows from previous experience — after being pre-trained on your entire web of texts — that words that help interpret “lies” in a sentence are often of the shape: “some flat surfaces, words related to dishonesty, words related to resting”. He writes down these criteria in his recipe (query) and appears for tags (key) on the potions of other words. If the tag may be very much like the standards, he’ll pour plenty of that potion into his flask; if the tag shouldn’t be similar, he’ll pour little or none of that potion.

So he finds the tag for “bed” says “a flat piece of furniture”. That’s much like “some flat surfaces” in his recipe! He pours the potion for “bed” in his flask. The potion (value) for “bed” incorporates information like “drained, restful, sleepy, sick”.

The alchemist for the word “lies” continues the search. He finds the tag for the word “still” says “related to resting” (amongst other connotations of the word “still”). That’s related to his criteria “restful”, so he pours partially of the potion from “still”, which incorporates information like “restful, silent, stationary”.

He looks on the tag of “on”, “Sarah”, “the”, “feeling” and doesn’t find them relevant. So he doesn’t pour any of their potions into his flask.

Remember, he needs to envision his own potion too. The tag of his own potion “lies” says “a verb related to resting”, which matches his recipe. So he pours a few of his own potion into the flask as well, which incorporates information like “drained; dishonest; can have a positive connotation if it’s a white lie; …”.

By the top of his quest to envision words within the text, his flask is full.

Unlike the unique potion for “lies”, this mixed potion now takes into consideration the context of this very specific sentence. Namely, it has plenty of elements of “drained, exhausted” and only a pinch of “dishonest”.

On this quest, the alchemist knows to listen to the appropriate words, and combines the worth of those relevant words. This can be a metaphoric step for “attention”. We’ve just explained a very powerful equation for Transformer, the underlying architecture of GPT:

Q is Query; K is Key; V is Value. Source: Attention is All You Need

Advanced notes:

1. Each alchemist looks at every bottle, including their very own [Q·K.transpose()].

2. The alchemist can match his recipe (query) with the tag (key) quickly and make a quick decision. [The similarity between query and key is determined by a dot product, which is a fast operation.] Moreover, all alchemists do their quests in parallel, which also helps speed things up. [Q·K.transpose() is a matrix multiplication, which is parallelizable. Speed is a winning feature of Transformer, compared to its predecessor Recurrent Neural Network that computes sequentially.]

3. The alchemist is picky. He only selects the highest few potions, as a substitute of blending in a little bit of all the pieces. [We use softmax to collapse Q·K.transpose(). Softmax will pull the inputs into more extreme values, and collapse many inputs to near-zero.]

4. At this stage, the alchemist doesn’t have in mind the ordering of words. Whether it’s “Sarah lies still on the bed, feeling” or “still bed the Sarah feeling on lies”, the filled flask (output of attention) will likely be the identical. [In the absence of “positional encoding”, Attention(Q, K, V) is independent of word positions.]

5. The flask at all times returns 100% filled, no more, no less. [The softmax is normalized to 1.]

6. The alchemist’s recipe and the potions’ tags must speak the identical language. [The Query and Key must be of the same dimension to be able to dot product together to communicate. The Value can take on a different dimension if you wish.]

7. The technically astute readers may indicate we didn’t do masking. I don’t wish to clutter the analogy with too many details but I’ll explain it here. In self-attention, each word can only see the previous words. So within the sentence “Sarah lies still on the bed, feeling”, “lies” only sees “Sarah”; “still” only sees “Sarah”, “lies”. The alchemist of “still” can’t reach into the potions of “on”, “the”, “bed” and “feeling”.

Feed Forward: Chemistry on the Mixed Potions

Up till this point, the alchemist simply pours the potion from other bottles. In other words, he pours the potion of “lies” — “drained; dishonest;…” — as a uniform mixture into the flask; he can’t distill out the “drained” part and discard the “dishonest” part just yet. [Attention is simply summing the different V’s together, weighted by the softmax.]

Now comes the true chemistry (feed forward). The alchemist mixes all the pieces together and does some synthesis. He notices interactions between words like “sleepy” and“restful”, etc. He also notices that “dishonesty” is simply mentioned in a single potion. He knows from past experiences the way to make some ingredients interact with one another and the way discard the one-off ones. [The feed forward layer is a linear (and then non-linear) transformation of the Value. Feed forward layer is the building block of neural networks. You can think of it as the “thinking” step in Transformer, while the earlier mixology step is simply “collecting”.]

The resulting potion after his processing becomes way more useful for the duty of predicting the subsequent word. Intuitively, it represents some richer properties about this word within the context of its sentence, in contrast with the starting potion (value) that’s out of context.

The Final Linear and Softmax Layer: the Assembly of Alchemists

How can we get from here to the ultimate output, which is to predict that the subsequent word after “Sarah lies still on the bed, feeling ___” is “drained”?

To this point, each alchemist has been working independently, only tending to his own word. Now all of the alchemists of various words assemble and stack their flasks in the unique word order and present them to the ultimate linear and softmax layer of the Transformer. What do I mean by this? Here, we must depart from the metaphor.

This final linear layer synthesizes information across different words. Based on pre-trained data, one plausible learning is that the immediate previous word is vital to predict the subsequent word. For example, the linear layer might heavily concentrate on the last flask (“feeling”’s flask).

Then combined with the softmax layer, this step assigns each word in our vocabulary a probability for a way likely that is the subsequent word after “Sarah lies on the bed, feeling…”. For instance, non-English words will receive probabilities near 0. Words like “drained”, “sleepy”, “exhausted” will receive high probabilities. We then pick the highest winner as the ultimate answer.

Recap

Now you’ve built a minimalist GPT!

To recap, for every word in the eye step, you establish which words (including self) each word should listen to, based on how well that word’s query (recipe) matches the opposite word’s key (tag). You combine together those words’ values (potions) proportional to the eye that word pays to them. You process this mixture to do some “pondering” (feed forward). Once each word is processed, you then mix the mixtures from all the opposite words to do more “pondering” (linear layer) and make the ultimate prediction of what the subsequent word ought to be.

Side note: the language “decoder” is a vestige from the unique paper, as Transformer was first used for machine translation tasks. You “encode” the source language into embeddings, and “decode” from the embeddings to the goal language.