Demystifying Cosine Similarity

is a commonly used metric for operationalizing tasks akin to semantic search and document comparison in the sector of natural language processing (NLP). Introductory NLP courses often provide only a high-level justification for using cosine similarity in such tasks (versus, say, Euclidean distance) without explaining the underlying mathematics, leaving many data scientists with a relatively vague understanding of the subject material. To deal with this gap, the next article lays out the mathematical intuition behind the cosine similarity metric and shows how this may also help us interpret ends in practice with hands-on examples in Python.

Note: All figures and formulas in the next sections have been created by the creator of this text.

Mathematical Intuition

The cosine similarity metric is predicated on the cosine function that readers may recall from highschool math. The cosine function exhibits a repeating wavelike pattern, a full cycle of which is depicted in Figure 1 below for the range 0 <= <= 2*. The Python code used to supply the figure can also be included for reference.

import numpy as np
import matplotlib.pyplot as plt

# Define the x range from 0 to 2*pi
x = np.linspace(0, 2 * np.pi, 500)
y = np.cos(x)

# Create the plot
plt.figure(figsize=(8, 4))
plt.plot(x, y, label='cos(x)', color='blue')

# Add notches on the x-axis at pi/2 and three*pi/2
notch_positions = [0, np.pi/2, np.pi, 3*np.pi/2, 2*np.pi]
notch_labels = ['0', 'pi/2', 'pi', '3*pi/2', '2*pi']
plt.xticks(ticks=notch_positions, labels=notch_labels)

# Add custom horizontal gridlines only at y = -1, 0, 1
for y_val in [-1, 0, 1]:
    plt.axhline(y=y_val, color='gray', linestyle='--', linewidth=0.5)

# Add vertical gridlines at specified x-values
for x_val in notch_positions:
    plt.axvline(x=x_val, color='gray', linestyle='--', linewidth=0.5)

# Customize the plot
plt.xlabel("x")
plt.ylabel("cos(x)")

# Final layout and display
plt.tight_layout()
plt.show()

Figure 1: Cosine Function

The function parameter denotes an angle in radians (e.g., the angle between two vectors in an embedding space), where /2, , 3*/2, and a couple of*, are 90, 180, 270, and 360 degrees, respectively.

To know why the cosine function can function a useful basis for designing a vector similarity metric, notice that the essential cosine function, with none functional transformations as shown in Figure 1, has maxima at x = 2**, minima at x = (2* + 1)*, and roots at x = ( + 1/2)* for some integers , , and . In other words, if denotes the angle between two vectors, returns the biggest value when the vectors point in the identical direction, the smallest value when the vectors point in opposite directions, and 0 when the vectors are orthogonal to one another.

This behavior of the cosine function neatly captures the interplay between two key concepts in NLP: (conveying how much meaning is shared between two texts) and (capturing the oppositeness of meaning in texts). For instance, the texts “I liked this movie” and “I enjoyed this film” would have high semantic overlap (they express essentially the identical meaning despite using different words) and low semantic polarity (they don’t express opposite meanings). Now, if the embedding vectors for 2 words occur to encode each semantic overlap and polarity, then we’d expect synonyms to have cosine similarity approaching 1, antonyms to have cosine similarity approaching -1, and unrelated words to have cosine similarity approaching 0.

In practice, we are going to typically not know the angle directly. As a substitute, we must derive the cosine value from the vectors themselves. Given two vectors and , each with elements, the cosine of the angle between these vectors — corresponding to the cosine similarity metric — is computed because the dot product of the vectors divided by the product of the vector magnitudes:

The above formula for the cosine of the angle between two vectors will be derived from the so-called Cosine Rule, as demonstrated within the segment between minutes 12 and 18 of this video:

A neat proof of the Cosine Rule itself is presented on this video:

The next Python implementation of cosine similarity explicitly operationalizes the formulas presented above, without counting on any black-box, third-party packages:

import math

def cosine_similarity(U, V):
    if len(U) != len(V):
        raise ValueError("Vectors have to be of the identical length.")

    # Compute dot product and magnitudes
    dot_product = sum(u * v for u, v in zip(U, V))
    magnitude_U = math.sqrt(sum(u ** 2 for u in U))
    magnitude_V = math.sqrt(sum(v ** 2 for v in V))
    
    # Zero vector handling to avoid division by zero
    if magnitude_U == 0 or magnitude_V == 0:
        raise ValueError("Cannot compute cosine similarity for zero-magnitude vectors.")

    return dot_product / (magnitude_U * magnitude_V)

Interested readers can confer with this text for a more efficient Python implementation of the metric (defined as 1 minus cosine similarity) using the NumPy and SciPy packages.

Finally, it’s price comparing the mathematical intuition of cosine similarity (or distance) with that of , which measures the linear distance between two vectors and can even function a vector similarity metric. Specifically, the lower the Euclidean distance between two vectors, the upper their semantic similarity is more likely to be. The Euclidean distance between two vectors and (each of length ) will be computed using the next formula:

Below is the corresponding Python implementation:

import math

def euclidean_distance(U, V):
    if len(U) != len(V):
        raise ValueError("Vectors have to be of the identical length.")

    # Compute sum of squared differences
    sum_squared_diff = sum((u - v) ** 2 for u, v in zip(U, V))

    # Take the square root of the sum
    return math.sqrt(sum_squared_diff)

Notice that, because the elementwise differences within the Euclidean distance formula are squared, the resulting metric will all the time be a non-negative number — zero if the vectors are equivalent, positive otherwise. Within the NLP context, this means that Euclidean distance won’t reflect semantic polarity in quite the identical way as cosine distance does. Furthermore, so long as two vectors point in the identical direction, the cosine of the angle between them will remain the identical whatever the vector magnitudes. In contrast, the Euclidean distance metric is affected by differences in vector magnitude, which can result in misleading interpretations in practice (e.g., two texts of various lengths may yield a high Euclidean distance despite being semantically similar). As such, cosine similarity is the popular metric in lots of NLP scenarios, where determining vector — or semantic — directionality is the first concern.

Theory versus Practice

In a practical NLP scenario, the interpretation of cosine similarity hinges on the extent to which the vector embedding encodes polarity in addition to semantic overlap. In the next hands-on example, we are going to investigate the similarity between two given words using a pretrained embedding model that doesn’t encode polarity (all-MiniLM-L6-v2) and one which does (distilbert-base-uncased-finetuned-sst-2-english). We will even use more efficient implementations of cosine similarity and Euclidean distance by leveraging functions provided by the SciPy package.

from scipy.spatial.distance import cosine as cosine_distance
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModel
import torch

# Words to embed
words = ["movie", "film", "good", "bad", "spoon", "car"]

# Load a pre-trained embedding model from Hugging Face
model_1 = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
model_2_name = "distilbert-base-uncased-finetuned-sst-2-english"
model_2_tokenizer = AutoTokenizer.from_pretrained(model_2_name)
model_2 = AutoModel.from_pretrained(model_2_name)

# Generate embeddings for model 1
embeddings_1 =  dict(zip(words, model_1.encode(words)))

# Generate embeddings for model 2
inputs = model_2_tokenizer(words, padding=True, truncation=True, return_tensors="pt")
with torch.no_grad():
    outputs = model_2(**inputs)
    embedding_vectors_model_2 = outputs.last_hidden_state.mean(dim=1)
embeddings_2 = {word: vector for word, vector in zip(words, embedding_vectors_model_2)}

# Compute and print cosine similarity (1 - cosine distance) for each embedding models
print("Cosine similarity for embedding model 1:")
print("movie", "t", "film", "t", 1 - cosine_distance(embeddings_1["movie"], embeddings_1["film"]))
print("good", "t", "bad", "t", 1 - cosine_distance(embeddings_1["good"], embeddings_1["bad"]))
print("spoon", "t", "automobile", "t", 1 - cosine_distance(embeddings_1["spoon"], embeddings_1["car"]))
print()

print("Cosine similarity for embedding model 2:")
print("movie", "t", "film", "t", 1 - cosine_distance(embeddings_2["movie"], embeddings_2["film"]))
print("good", "t", "bad", "t", 1 - cosine_distance(embeddings_2["good"], embeddings_2["bad"]))
print("spoon", "t", "automobile", "t", 1 - cosine_distance(embeddings_2["spoon"], embeddings_2["car"]))
print()

Output:

Cosine similarity for embedding model 1:
movie 	 film 	 0.8426464702276286
good 	 bad 	 0.5871497042685934
spoon 	 automobile 	 0.22919675707817078

Cosine similarity for embedding model 2:
movie 	 film 	 0.9638281550070811
good 	 bad 	 -0.3416433451550165
spoon 	 automobile 	 0.5418748837234599

The words “movie” and “film”, that are typically used as synonyms, have cosine similarity near 1, suggesting high semantic overlap as expected. The words “good” and “bad” are antonyms, and we see this reflected within the negative cosine similarity result when using the second embedding model known to encode semantic polarity. Finally, the words “spoon” and “automobile” are semantically unrelated, and the corresponding orthogonality of their vector embeddings is indicated by their cosine similarity results being closer to zero than for “movie” and “film”.

The Wrap

The cosine similarity between two vectors is predicated on the cosine of the angle they form, and — unlike metrics akin to Euclidean distance — will not be sensitive to differences in vector magnitudes. In theory, cosine similarity ought to be near 1 if the vectors point in the identical direction (indicating high similarity), near -1 if the vectors point in opposite directions (indicating high dissimilarity), and shut to 0 if the vectors are orthogonal (indicating unrelatedness). Nonetheless, the precise interpretation of cosine similarity in a given NLP scenario is determined by the character of the embedding model used to vectorize the textual data (e.g., whether the embedding model encodes polarity along with semantic overlap).

Demystifying Cosine Similarity

Mathematical Intuition

Theory versus Practice

The Wrap

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Introducing the Open Leaderboard for Japanese LLMs!

ChatLLM Presents a Streamlined Solution to Addressing the Real Bottleneck in AI

From Files to Chunks: Improving HF Storage Efficiency

The Geometry of Laziness: What Angles Reveal About AI Hallucinations

Faster Text Generation with Self-Speculative Decoding

Demystifying Cosine Similarity

Mathematical Intuition

Theory versus Practice

The Wrap

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.