We don’t wish to completely replace the worth of with , so let’s say that we take a linear combination of and as the brand new value for :
v_Riley = get_value('Riley')
v_dog = get_value('dog')ratio = .75
v_Riley = (ratio * v_Riley) + ((1-ratio) * v_dog)
This seems to work alright, we’ve embedded a little bit of the meaning of the word “dog” into the word “Riley”.
Now we would love to try to apply this way of attention to the entire sentence by updating the vector representations of each word by the vector representations of each other word.
What goes flawed here?
The core problem is that we don’t know which words should tackle the meanings of other words. We’d also like some measure of how much the worth of every word should contribute to one another word.
Alright. So we’d like to know the way much two words ought to be related.
Time for attempt number 2.
I’ve redesigned our vector database in order that each word actually has two associated vectors. The primary is identical value vector that we had before, still denoted by . As well as, we now have unit vectors denoted by that store some notion of word relations. Specifically, if two k vectors are close together, it signifies that the values related to these words are prone to influence one another’s meanings.
With our recent and vectors, how can we modify our previous scheme to update ’s value with in a way that respects how much two words are related?
Let’s proceed with the identical linear combination business as before, but provided that the k vectors of each are close in embedding space. Even higher, we will use the dot product of the 2 k vectors (which range from 0–1 since they’re unit vectors) to inform us how much we must always update with .
v_Riley, v_dog = get_value('Riley'), get_value('dog')
k_Riley, k_dog = get_key('Riley'), get_key('dog')relevance = k_Riley · k_dog # dot product
v_Riley = (relevance) * v_Riley + (1 - relevance) * v_dog
That is somewhat bit strange since if relevance is 1, gets completely replaced by , but let’s ignore that for a minute.
I need to as an alternative take into consideration what happens once we apply this sort of idea to the entire sequence. The word “Riley” may have a relevance value with one another word via dot product of s. So, perhaps we will as an alternative update the worth of every word proportionally to the worth of the dot product. For simplicity, let’s also include it’s dot product with itself as a solution to preserve it’s own value.
sentence = "Evan's dog Riley is so hyper, she never stops moving"
words = sentence.split()# obtain an inventory of values
values = get_values(words)
# oh yeah, that is what k stands for by the way in which
keys = get_keys(words)
# get riley's relevance key
riley_index = words.index('Riley')
riley_key = keys[riley_index]
# generate relevance of "Riley" to one another word
relevances = [riley_key · key for key in keys] #still pretending python has ·
# normalize relevances to sum to 1
relevances /= sum(relevances)
# takes a linear combination of values, weighted by relevances
v_Riley = relevances · values
Okay that’s ok for now.
But once more, I claim that there’s something flawed with this approach. It’s not that any of our ideas have been implemented incorrectly, but slightly there’s something fundamentally different between this approach and the way we actually take into consideration relationships between words.
If there’s any point in this text where I think that you need to stop and think, it’s here. Even those of you who think you fully understand attention. What’s flawed with our approach?
Relationships between words are inherently asymmetric! The best way that “Riley” attends to “dog” is different from the way in which that “dog” attends to “Riley”. It’s a much larger deal that “Riley” refers to a dog, not a human, then the name of the dog.
In contrast, the dot product is a symmetric operation, which implies that in our current setup, if a attends to b, then b attends equally strong to a! Actually, that is somewhat false because we’re normalizing the relevance scores, but the purpose is that the words must have the choice of attending in an asymmetric way, even when the opposite tokens are held constant.
Part 3
We’re almost there! Finally, the query becomes:
How can we most naturally extend our current setup to permit for asymmetric relationships?
Well what can we do with yet another vector type? We still have our worth vectors , and our relation vector . Now now we have yet one more vector for every token.
How can we modify our setup and use to attain the asymmetric relationship that we would like?
Those of you who’re accustomed to how self-attention works will hopefully be smirking at this point.
As an alternative of computing relevance · when “dog” attends to “Riley”, we will as an alternative query against the key by taking their dot product. When computing the opposite way around, we may have · as an alternative — asymmetric relevance!
Here’s the entire thing together, computing the update for each value directly!
sentence = "Evan's dog Riley is so hyper, she never stops moving"
words = sentence.split()
seq_len = len(words)# obtain arrays of queries, keys, and values, each of shape (seq_len, n)
Q = array(get_queries(words))
K = array(get_keys(words))
V = array(get_values(words))
relevances = Q @ K.T
normalized_relevances = relevances / relevances.sum(axis=1)
new_V = normalized_relevances @ V


