We don’t need to completely replace the worth of with , so let’s say that we take a linear combination of and as the brand new value for :
v_Riley = get_value('Riley')
v_dog = get_value('dog')ratio = .75
v_Riley = (ratio * v_Riley) + ((1-ratio) * v_dog)
This seems to work alright, we’ve embedded a little bit of the meaning of the word “dog” into the word “Riley”.
Now we would love to try to apply this manner of attention to the entire sentence by updating the vector representations of each word by the vector representations of each other word.
What goes unsuitable here?
The core problem is that we don’t know which words should tackle the meanings of other words. We might also like some measure of how much the worth of every word should contribute to one another word.
Alright. So we want to know the way much two words must be related.
Time for attempt number 2.
I’ve redesigned our vector database in order that each word actually has two associated vectors. The primary is identical value vector that we had before, still denoted by . As well as, we now have unit vectors denoted by that store some notion of word relations. Specifically, if two vectors are close together, it implies that the values related to these words are more likely to influence one another’s meanings.
With our recent and vectors, how can we modify our previous scheme to update ’s value with in a way that respects how much two words are related?
Let’s proceed with the identical linear combination business as before, but provided that the k vectors of each are close in embedding space. Even higher, we are able to use the dot product of the 2 k vectors (which range from 0–1 since they’re unit vectors) to inform us how much we must always update with .
v_Riley, v_dog = get_value('Riley'), get_value('dog')
k_Riley, k_dog = get_key('Riley'), get_key('dog')relevance = k_Riley · k_dog # dot product
v_Riley = (relevance) * v_Riley + (1 - relevance) * v_dog
That is a bit of bit strange since if relevance is 1, gets completely replaced by , but let’s ignore that for a minute.
I need to as an alternative take into consideration what happens after we apply this type of idea to the entire sequence. The word “Riley” can have a relevance value with one another word via dot product of s. So, possibly we are able to as an alternative update the worth of every word proportionally to the worth of the dot product. For simplicity, let’s also include it’s dot product with itself as a option to preserve it’s own value.
sentence = "Evan's dog Riley is so hyper, she never stops moving"
words = sentence.split()# obtain a listing of values
values = get_values(words)
# oh yeah, that is what k stands for by the best way
keys = get_keys(words)
# get riley's relevance key
riley_index = words.index('Riley')
riley_key = keys[riley_index]
# generate relevance of "Riley" to one another word
relevances = [riley_key · key for key in keys] #still pretending python has ·
# normalize relevances to sum to 1
relevances /= sum(relevances)
# takes a linear combination of values, weighted by relevances
v_Riley = relevances · values
Okay that’s adequate for now.
But once more, I claim that there’s something unsuitable with this approach. It’s not that any of our ideas have been implemented incorrectly, but reasonably there’s something fundamentally different between this approach and the way we actually take into consideration relationships between words.
If there’s any point in this text where I think that you need to stop and think, it’s here. Even those of you who think you fully understand attention. What’s unsuitable with our approach?
Relationships between words are inherently asymmetric! The way in which that “Riley” attends to “dog” is different from the best way that “dog” attends to “Riley”. It’s a much larger deal that “Riley” refers to a dog, not a human, than the name of the dog.
In contrast, the dot product is a symmetric operation, which suggests that in our current setup, if a attends to b, then b attends equally strong to a! Actually, that is somewhat false because we’re normalizing the relevance scores, but the purpose is that the words must have the choice of attending in an asymmetric way, even when the opposite tokens are held constant.
Part 3
We’re almost there! Finally, the query becomes:
How can we most naturally extend our current setup to permit for asymmetric relationships?
Well what can we do with another vector type? We still have our worth vectors , and our relation vector . Now we’ve got one more vector for every token.
How can we modify our setup and use to realize the asymmetric relationship that we would like?
Those of you who’re conversant in how self-attention works will hopefully be smirking at this point.
As a substitute of computing relevance · when “dog” attends to “Riley”, we are able to as an alternative query against the key by taking their dot product. When computing the opposite way around, we can have · as an alternative — asymmetric relevance!
Here’s the entire thing together, computing the update for each value without delay!
sentence = "Evan's dog Riley is so hyper, she never stops moving"
words = sentence.split()
seq_len = len(words)# obtain arrays of queries, keys, and values, each of shape (seq_len, n)
Q = array(get_queries(words))
K = array(get_keys(words))
V = array(get_values(words))
relevances = Q @ K.T
normalized_relevances = relevances / relevances.sum(axis=1)
new_V = normalized_relevances @ V


