Diving into Word Embeddings with EDA

Visualizing unexpected insights in text data

When starting work with a brand new dataset, it’s all the time a great idea to begin with some exploratory data evaluation (EDA). Taking the time to know your data before training any fancy models can enable you to understand the structure of the dataset, discover any obvious issues, and apply domain-specific knowledge.

You see EDA in various forms with the whole lot from house prices to advanced applications in the information science industry. But I still haven’t seen it for the most popular recent dataset: word embeddings, the idea of our greatest large language models. So why not try it?

In this text, we’ll apply EDA to GloVe word embeddings, using techniques like covariance matrices, clustering, PCA, and vector math. This can help us understand the structure of word embeddings, giving us a useful place to begin for constructing more powerful models with this data. As we discover this structure, we’ll find that it’s not all the time what it seems, and a few surprising biases are hidden within the corpus.

You have to:

Basic understanding of linear algebra, statistics, and vector mathematics
Python packages: numpy, sklearn, and matplotlib
About 3 GB of spare disk space

To start, download the dataset at huggingface.co/stanfordnlp/glove/resolve/most important/glove.6B.zip[1]. This incorporates three text files, each containing an inventory of words together with their vector representations. We are going to use the 300-dimensional representations (glove.6B.300d.txt).

A fast note on where this dataset comes from: essentially, it is a list of word embeddings derived from 6 billion tokens’ price of co-occurrence data from Wikipedia and various news sources. A useful side effect of using co-occurrence is that words that mean similar things are inclined to be close together. For instance, since “the red bird” and “the blue bird” are each valid sentences, we would expect the vectors for “red” and “blue” to be close to one another. For more technical information, you possibly can check the unique GloVe paper[1].

To be clear, these are not word embeddings trained for the aim of huge language models. They’re a completely unsupervised technique based on a big corpus. But they display quite a lot of similar properties to language model embeddings, and are interesting in their very own right.

Each line of this text file consists of a word, followed by all 300 vector components of the associated embedding separated by spaces. We are able to load that in with Python. (To cut back noise and speed things up, I’m using the highest 10% of the complete dataset here with the //10, but you possibly can change that when you’d like.)

import numpy as npembeddings = {}
with open(f"glove.6B/glove.6B.300d.txt", "r") as f:
glove_content = f.read().split('n')
for i in range(len(glove_content)//10):
line = glove_content[i].strip().split(' ')
if line[0] == '':
proceed
word = line[0]
embedding = np.array(list(map(float, line[1:])))
embeddings[word] = embedding
print(len(embeddings))

That leaves us with 40,000 embeddings loaded in.

One natural query we would ask is: are vectors generally near other vectors with similar meaning? And as a follow-up query, how will we quantify this?

There are two most important ways we are going to quantify similarity between vectors: one is Euclidean distance, which is solely the natural Pythagorean theorem distance we’re acquainted with. The opposite is cosine similarity, which measures the cosine of the angle between two vectors. A vector has a cosine similarity of 1 with itself, -1 with an opposite vector, and 0 with an orthogonal vector.

Let’s implement these in NumPy:

def cos_sim(a, b):
return np.dot(a,b)/(np.linalg.norm(a) * np.linalg.norm(b))
def euc_dist(a, b):
return np.sum(np.square(a - b)) # no need for square root since we are only rating distances

Now we are able to find all of the closest vectors to a given word or embedding vector! We’ll do that in increasing order.

def get_sims(to_word=None, to_e=None, metric=cos_sim):
# list all similarities to the word to_word, OR the embedding vector to_e
assert (to_word is just not None) ^ (to_e is just not None) # find similarity to a word or a vector, not each
sims = []
if to_e is None:
to_e = embeddings[to_word] # get the embedding for the word we're 
for word in embeddings:
if word == to_word:
proceed
word_e = embeddings[word]
sim = metric(word_e, to_e)
sims.append((sim, word))
sims.sort()
return sims

Now we are able to write a function to display the ten most similar words. It’s going to be useful to incorporate a reverse option as well, so we are able to display the least similar words.

def display_sims(to_word=None, to_e=None, n=10, metric=cos_sim, reverse=False, label=None):
assert (to_word is just not None) ^ (to_e is just not None)
sims = get_sims(to_word=to_word, to_e=to_e, metric=metric)
display = lambda sim: f'{sim[1]}: {sim[0]:.5f}'
if label is None:
label = to_word.upper() if to_word is just not None else ''
print(label) # a heading so we all know what these similarities are for
if reverse:
sims.reverse()
for i, sim in enumerate(reversed(sims[-n:])):
print(i+1, display(sim))
return sims

Finally, we are able to test it!

display_sims(to_word='red')
# yellow, blue, pink, green, white, purple, black, coloured, sox, shiny

Looks just like the Boston Red Sox made an unexpected appearance here. But aside from that, that is about what we might expect.

Possibly we are able to try some verbs, and not only nouns and adjectives? How a few nice and type verb like “share”?

display_sims(to_word='share')
# shares, stock, profit, percent, shared, earnings, profits, price, gain, cents

I suppose “share” isn’t often used as a verb on this dataset. Oh well.

We are able to try some more conventional examples as well:

display_sims(to_word='cat')
# dog, cats, pet, dogs, feline, monkey, horse, pets, rabbit, leopard
display_sims(to_word='frog')
# toad, frogs, snake, monkey, squirrel, species, rodent, parrot, spider, rat
display_sims(to_word='queen')
# elizabeth, princess, king, monarch, royal, majesty, victoria, throne, lady, crown

Considered one of the fascinating properties about word embeddings is that analogy is inbuilt using vector math. The instance from the GloVe paper is king – queen = man – woman. In other words, rearranging the equation, we expect king = man – woman + queen. Is that this true?

display_sims(to_e=embeddings['man'] - embeddings['woman'] + embeddings['queen'], label='king-queen analogy')
# queen, king, ii, majesty, monarch, prince...

Not quite: the closest vector to man – woman + queen seems to be queen (cosine similarity 0.78), followed somewhat distantly by king (cosine similarity 0.66). Inspired by this excellent 3Blue1Brown video, we would try aunt and uncle as an alternative:

display_sims(to_e=embeddings['aunt'] - embeddings['woman'] + embeddings['man'], label='aunt-uncle analogy')
# aunt, uncle, brother, grandfather, grandmother, cousin, uncles, grandpa, dad, father

This is healthier (cosine similarity 0.7348 vs 0.7344), but still doesn’t work perfectly. But we are able to try switching to Euclidean distance. Now we’d like to set reverse=True, because a higher Euclidean distance is definitely a lower similarity.

display_sims(to_e=embeddings['aunt'] - embeddings['woman'] + embeddings['man'], metric=euc_dist, reverse=True, label='aunt-uncle analogy')
# uncle, aunt, grandfather, brother, cousin, grandmother, newphew, dad, grandpa, cousins

Now we got it. Nevertheless it looks like the analogy math may not be as perfect as we hoped, at the least within the naïve way that we’re doing it here.

Cosine similarity is all concerning the angles between vectors. But is the magnitude of a vector also necessary?

We are able to reuse our existing code by expressing magnitude because the Euclidean distance from the zero vector. Let’s see which words have the biggest and smallest magnitudes:

zero_vec = np.zeros_like(embeddings['the'])
display_sims(to_e=zero_vec, metric=euc_dist, label='largest magnitude')
# republish, nonsubscribers, hushen, tael, www.star, stoxx, 202-383-7824, resend, non-families, 225-issue
display_sims(to_e=zero_vec, metric=euc_dist, reverse=True, label='smallest magnitude')
# likewise, lastly, interestingly, sarcastically, incidentally, furthermore, conversely, moreover, aforementioned, wherein

It doesn’t seem like there’s much of a pattern to the meaning of the big magnitude vectors, but all of them appear to have very specific (and sometimes confusing) meanings. Then again, the smallest magnitude vectors are inclined to be quite common words that could be present in quite a lot of contexts.

There’s a huge range between magnitudes: from about 2.6 for the smallest vector all of the technique to about 17 for the biggest. What does this distribution seem like? We are able to plot a histogram to get a greater picture of this.

import matplotlib.pyplot as pltdef plot_magnitudes():
words = [w for w in embeddings]
magnitude = lambda word: np.linalg.norm(embeddings[word])
magnitudes = list(map(magnitude, words))
plt.hist(magnitudes, bins=40)
plt.show()
plot_magnitudes()

A histogram of magnitudes of our word embeddings

This distribution looks roughly normal. If we desired to test this further, we could use a Q-Q plot. But for our purposes at once, that is high-quality.

It seems that directions and subspaces in vector embeddings can encode various sorts of concepts, often in biased ways. This paper[2] studied how this works for gender bias.

We are able to replicate this idea in our GloVe embeddings, too. First, let’s find the direction of the concept of “masculinity”. We are able to accomplish this by taking the typical of differences between vectors like he and she, man and woman, and so forth:

gender_pairs = [('man', 'woman'), ('men', 'women'), ('brother', 'sister'), ('he', 'she'),
('uncle', 'aunt'), ('grandfather', 'grandmother'), ('boy', 'girl'),
('son', 'daughter')]
masc_v = zero_vec
for pair in gender_pairs:
masc_v += embeddings[pair[0]]
masc_v -= embeddings[pair[1]]

Now we are able to find the “most masculine” and “most feminine” vectors, as judged by the embedding space.

display_sims(to_e=masc_v, metric=cos_sim, label='masculine vecs')
# brother, colonel, himself, uncle, gen., nephew, brig., brothers, son, sir
display_sims(to_e=masc_v, metric=cos_sim, reverse=True, label='feminine vecs')
# actress, herself, businesswoman, chairwoman, pregnant, she, her, sister, actresses, woman

Now, we are able to run a simple test to detect bias within the dataset: compute the similarity between nurse and every of man and woman. Theoretically, these needs to be about equal: nurse is just not a gendered word. Is that this true?

print("nurse - man", cos_sim(embeddings['nurse'], embeddings['man'])) # 0.24
print("nurse - woman", cos_sim(embeddings['nurse'], embeddings['woman'])) # 0.45

That’s a reasonably large difference! (Remember cosine similarity runs from -1 to 1, with positive associations within the range 0 to 1.) For reference, 0.45 can be near the cosine similarity between cat and leopard.

Let’s see if we are able to cluster words with similar meaning using k-means clustering. This is straightforward to do with the package scikit-learn. We’re going to use 300 clusters, which appears like loads, but trust me: almost all the clusters are so interesting, you may write a complete article just interpreting them!

from sklearn.cluster import KMeansdef get_kmeans(n=300):
kmeans = KMeans(n_clusters=n, n_init=1)
X = np.array([embeddings[w] for w in embeddings])
kmeans.fit(X)
return kmeans
def display_kmeans(kmeans):
# print all clusters and 5 associated words for every
words = np.array([w for w in embeddings])
X = np.array([embeddings[w] for w in embeddings])
y = kmeans.predict(X) # get the cluster for every word
for cluster in range(kmeans.cluster_centers_.shape[0]):
print(f'KMeans {cluster}')
cluster_words = words[y == cluster] # get all words in each cluster
for i, w in enumerate(cluster_words[:5]):
print(i+1, w)
kmeans = get_kmeans()
display_kmeans(kmeans)

There’s loads to take a look at here. We’ve clusters for things as diverse as Recent York City (manhattan, n.y., brooklyn, hudson, borough), molecular biology (protein, proteins, enzyme, beta, molecules), and Indian names (singh, ram, gandhi, kumar, rao).

But sometimes these clusters should not what they appear. Let’s write code to display all words of a cluster containing a given word, together with the closest and farthest cluster.

def get_kmeans_cluster(kmeans, word=None, cluster=None):
# given a word, find the cluster of that word. (or start with a cluster index.)
# then, get all words of that cluster.
assert (word is None) ^ (cluster is None)
if cluster is None:
cluster = kmeans.predict([embeddings[word]])[0]
words = np.array([w for w in embeddings])
X = np.array([embeddings[w] for w in embeddings])
y = kmeans.predict(X)
cluster_words = words[y == cluster]
return cluster, cluster_wordsdef display_cluster(kmeans, word):
cluster, cluster_words = get_kmeans_cluster(kmeans, word=word)
# print all words within the cluster
print(f"Full KMeans ({word}, cluster {cluster})")
for i, w in enumerate(cluster_words):
print(i+1, w)
# rank all clusters (excluding this one) by Euclidean distance of their centers from this cluster's center
distances = np.concatenate([kmeans.cluster_centers_[:cluster], kmeans.cluster_centers_[cluster+1:]], axis=0)
distances = np.sum(np.square(distances - kmeans.cluster_centers_[cluster]), axis=1)
nearest = np.argmin(distances, axis=0)
_, nearest_words = get_kmeans_cluster(kmeans, cluster=nearest)
print(f"Nearest cluster: {nearest}")
for i, w in enumerate(nearest_words[:5]):
print(i+1, w)
farthest = np.argmax(distances, axis=0)
print(f"Farthest cluster: {farthest}")
_, farthest_words = get_kmeans_cluster(kmeans, cluster=farthest)
for i, w in enumerate(farthest_words[:5]):
print(i+1, w)

Now let’s check out this code.

display_cluster(kmeans, 'animal')
# species, fish, wild, dog, bear, males, birds...
display_cluster(kmeans, 'dog')
# same as 'animal'
display_cluster(kmeans, 'birds')
# same again
display_cluster(kmeans, 'bird')
# spread, bird, flu, virus, tested, humans, outbreak, infected, sars....?

You would possibly not get exactly this result each time: the clustering algorithm is non-deterministic. But much of the time, “birds” is related to disease words fairly than animal words. It seems the unique dataset tends to make use of the word “bird” within the context of disease vectors.

There are actually a whole bunch more clusters so that you can explore the contents of. Another clusters I discovered interesting are “Illinois” and “Genghis”.

Principal Component Evaluation (PCA) is a tool we are able to use to seek out the directions in vector space related to probably the most variance in our dataset. Let’s try it. Like clustering, sklearn makes this easy.

from sklearn.decomposition import PCAdef get_pca_vecs(n=10): # get the primary 10 principal components
pca = PCA()
X = np.array([embeddings[w] for w in embeddings])
pca.fit(X)
principal_components = list(pca.components_[:n, :])
return pca, principal_components
pca, pca_vecs = get_pca_vecs()
for i, vec in enumerate(pca_vecs):
# display the words with the very best and lowest values for every principal component
display_sims(to_e=vec, metric=cos_sim, label=f'PCA {i+1}')
display_sims(to_e=vec, metric=cos_sim, label=f'PCA {i+1} negative', reverse=True)

Like our k-means experiment, quite a lot of these PCA vectors are really interesting. For instance, let’s take a take a look at principal component 9:

    PCA 9
1 featuring: 0.38193
2 hindi: 0.37217
3 arabic: 0.36029
4 sung: 0.35130
5 che: 0.34819
6 malaysian: 0.34474
7 ka: 0.33820
8 video: 0.33549
9 bollywood: 0.33347
10 counterpart: 0.33343
PCA 9 negative
1 suffolk: -0.31999
2 cumberland: -0.31697
3 northumberland: -0.31449
4 hampshire: -0.30857
5 missouri: -0.30771
6 calhoun: -0.30749
7 erie: -0.30345
8 massachusetts: -0.30133
9 counties: -0.29710
10 wyoming: -0.29613

It looks like positive values for component 9 are related to Middle Eastern, South Asian and Southeast Asian terms, while negative values are related to North American and British terms.

One other interesting one is component 3. All of the positive terms are decimal numbers, apparently quite a salient feature for this model. Component 8 also shows an analogous pattern.

    PCA 3
1 1.8: 0.57993
2 1.6: 0.57851
3 1.2: 0.57841
4 1.4: 0.57294
5 2.3: 0.57019
6 2.6: 0.56993
7 2.8: 0.56966
8 3.7: 0.56660
9 1.9: 0.56424
10 2.2: 0.56063

Dimensionality Reduction

Considered one of the most important advantages of PCA is that it allows us to take a really high-dimensional dataset (300-dimensional on this case) and plot it in only two or three dimensions by projecting onto the primary components. Let’s try a two-dimensional plot and see if there may be any information we are able to gather from it. We’ll also include color-coding by cluster using k-means.

def plot_pca(pca_vecs, kmeans):
words = [w for w in embeddings]
x_vec = pca_vecs[0]
y_vec = pca_vecs[1]
X = np.array([np.dot(x_vec, embeddings[w]) for w in words])
Y = np.array([np.dot(y_vec, embeddings[w]) for w in words])
colours =  kmeans.predict([embeddings[w] for w in words])
plt.scatter(X, Y, c=colours, cmap='spring') # color by cluster
for i in np.random.alternative(len(words), size=100, replace=False):
# annotate 100 randomly chosen words on the graph
plt.annotate(words[i], (X[i], Y[i]), weight='daring')
plt.show()plot_pca(pca_vecs, kmeans)

A plot of the primary (X) and second (Y) principal components for our embeddings dataset

Unfortunately, this plot is a complete mess! It’s difficult to learn much from it. It looks like just two dimensions in isolation should not very easy to interpret amongst 300 total dimensions, at the least within the case of this dataset.

There are two exceptions. First, we see that names are inclined to cluster near the highest of this graph. Second, there may be a bit section that stands proud like a sore thumb at the underside left. This area appears to be related to numbers, particularly decimal numbers.

It is usually helpful to get an idea of the covariance between input features. On this case, our input features are only abstract vector directions which are difficult to interpret. Still, a covariance matrix can tell us how much of this information is definitely getting used. If we see high covariance, it means some dimensions are strongly correlated, and possibly we could get away with reducing the dimensionality a bit bit.

def display_covariance():
X = np.array([embeddings[w] for w in embeddings]).T # rows are variables (components), columns are observations (words)
cov = np.cov(X)
cov_range = np.maximum(np.max(cov), np.abs(np.min(cov))) # be certain the colorbar is balanced, with 0 in the center
plt.imshow(cov, cmap='bwr', interpolation='nearest', vmin=-cov_range, vmax=cov_range)
plt.colorbar()
plt.show()display_covariance()

A covariance matrix for all 300 vector components in our dataset

After all, there’s a giant line down the main diagonal, representing that every component is strongly correlated with itself. Apart from that, this isn’t a really interesting graph. The whole lot looks mostly blank, which is a great sign.

Should you look closely, there’s one exception: components 9 and 276 seem somewhat strongly related (covariance of 0.308).

The covariance matrix zoomed in on components 9 and 276. Observe the somewhat shiny red dot here, together with strange behavior along the row and column.

Let’s investigate this further by printing the vectors which are most related to components 9 and 276. That is reminiscent of cosine similarity to a basis vector of all zeros, aside from a one within the relevant component.

e9 = np.zeros_like(zero_vec)
e9[9] = 1.0
e276 = np.zeros_like(zero_vec)
e276[276] = 1.0
display_sims(to_e=e9, metric=cos_sim, label='e9')
# grizzlies, supersonics, notables, posey, bobcats, wannabe, hoosiers...
display_sims(to_e=e276, metric=cos_sim, label='e276')
# pehr, zetsche, steadied, 202-887-8307, bernice, goldie, edelman, kr...

These results are strange, and never very informative.

But wait: we may also have a positive covariance in these components if words with a really negative value in a single are inclined to even be very negative in the opposite. Let’s try reversing the direction of similarity.

display_sims(to_e=e9, metric=cos_sim, label='e9', reverse=True)
# subsequently, that, it, which, government, because, furthermore, fact, thus, very
display_sims(to_e=e276, metric=cos_sim, label='e276', reverse=True)
# they, as an alternative, those, a whole bunch, addition, dozens, others, dozen, only, outside

It looks like each of those components are related to basic function words and numbers that could be present in many alternative contexts. This helps explain the covariance between them, at the least more so than the positive case did.

In this text we applied quite a lot of exploratory data evaluation (EDA) techniques to a 300-dimensional dataset of GloVe word embeddings. We used cosine similarity to measure the similarity between the meaning of words, clustering to group words into related groups, and principal component evaluation (PCA) to discover the directions in vector space which are most vital to the embedding model.

We visually observed overall minimal covariance between the input features using principal component evaluation. We tried using PCA to plot all of our 300-dimensional data in only two dimensions, but this was still a bit messy.

We also tested assumptions and biases in our dataset. We identified gender bias in our dataset by comparing the cosine similarity of nurse with each of man and woman. We tried using vector math to represent analogies (like “king” is to “queen” as “man” is to “woman”), with some success. By subtracting various examples of vectors referring to men and women, we were capable of discover a vector direction related to gender, and display the “most masculine” and “most feminine” vectors within the dataset.

There’s loads more EDA you may try on a dataset of word embeddings, but I hope this was a great place to begin to know each some techniques of EDA usually and the structure of word embeddings particularly. If you wish to see the complete code related to this text, plus some additional examples, you possibly can take a look at my GitHub at crackalamoo/glove-embeddings-eda. Thanks for reading!

References

[1] J. Pennington, R. Socher and C.Manning, GloVe: Global Vectors for Word Representation (2014), Stanford NLP (Public Domain Dataset)

[2] T. Bolukbasi, K. Chang, J. Zou, V. Saligrama and A. Kalai, Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings (2016), Microsoft Research Recent England

All images created by the writer using Matplotlib.

Diving into Word Embeddings with EDA

Visualizing unexpected insights in text data

Dimensionality Reduction

References

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

How social media encourages the worst of AI boosterism

Hugging Face + PyCharm

The Machine Learning “Advent Calendar” Day 20: Gradient Boosted Linear Regression in Excel

Share your open ML datasets on Hugging Face Hub!

The Machine Learning “Advent Calendar” Day 21: Gradient Boosted Decision Tree Regressor in Excel

Diving into Word Embeddings with EDA

Visualizing unexpected insights in text data

Dimensionality Reduction

References

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.