t-SNE from Scratch (ft. NumPy)

Artificial Intelligence

t-SNE from Scratch (ft. NumPy)

admin

April 18, 2023

Acquire a deep understanding of the inner workings of t-SNE via implementation from scratch in python

I have found that among the best ways to truly understanding any statistical algorithm or methodology is to manually implement it yourself. On the flip side, coding these algorithms can sometimes be time consuming and an actual pain, and when any individual else has already done it, why would I need to spend my time doing it — seems inefficient, no? Each are fair points, and I’m not here to make an argument for one over the opposite.

This text is designed for readers who’re fascinated about understanding t-SNE via translation of the mathematics within the original paper — by Laurens van der Maaten & Geoffrey Hinton — into python code implementation.[1] I find these type of exercises to be quite illuminating into the inner workings of statistical algorithms/models and really test your underlying understanding and assumptions regarding these algorithms/models. You might be almost guaranteed to walk away with a greater understanding then you definitely had before. On the very minimum, successful implementation is at all times very satisfying!

This text might be accessible to individuals with any level of exposure of t-SNE. Nevertheless, note a number of things this post definitely is :

A strictly conceptual introduction and exploration of t-SNE, as there are many other great resources on the market that do that; nevertheless, I might be doing my best to attach the mathematical equations to their intuitive/conceptual counterparts at each stage of implementation.
A comprehensive discussion into the applications & pros/cons of t-SNE, in addition to direct comparisons of t-SNE to other dimensionality reduction techniques. I’ll, nevertheless, briefly touch on these topics throughout this text, but will not at all cover this in-depth.

Without further ado, let’s start with a transient introduction into t-SNE.

A Temporary Introduction to t-SNE

t-distributed stochastic neighbor embedding (t-SNE) is a dimensionality reduction tool that’s primarily utilized in datasets with a big dimensional feature space and enables one to visualise the information down, or project it, right into a lower dimensional space (often 2-D). It is very useful for visualizing non-linearly separable data wherein linear methods similar to Principal Component Evaluation (PCA) would fail. Generalizing linear frameworks of dimensionality reduction (similar to PCA) into non-linear approaches (similar to t-SNE) can also be often known as Manifold Learning. These methods will be extremely useful for visualizing and understanding the underlying structure of a high dimensional, non-linear data set, and will be great for disentangling and grouping together observations which can be similar within the high-dimensional space. For more information on t-SNE and other manifold learning techniques, the scikit-learn documentation is an amazing resource. Moreover, to examine some cool areas t-SNE has seen applications, the Wikipedia page highlights a few of these areas with references to the work.

Let’s start with breaking down the name t-distributed stochastic neighbor embedding into its components. t-SNE is an extension on stochastic neighbor embedding (SNE) presented 6 years earlier on this paper by Geoffrey Hinton & Sam Roweis. So let’s start there. The stochastic a part of the name comes from the indisputable fact that the target function will not be convex and thus different results can arise from different initializations. The neighbor embedding highlights the character of the algorithm — optimally mapping the points in the unique high-dimensional space into the corresponding low-dimensional space while best preserving the “neighborhood” structure of the points. SNE is comprised of the next (simplified) steps:

Obtain the Similarity Matrix between Points within the Original Space: Compute the conditional probabilities for every datapoint j relative to every datapoint i. These conditional probabilities are calculated in the unique high-dimensional space using a Gaussian centered at i and tackle the next interpretation: the probability that i would pick j as its neighbor in the unique space. This creates a matrix that represents similarities between the points.
Initialization: Select random starting points within the lower-dimensional space (say, 2-D) for every datapoint in the unique space and compute latest conditional probabilities similarly as above on this latest space.
Mapping: Iteratively improve upon the points within the lower-dimensional space until the Kullback-Leibler divergences between all of the conditional probabilities is minimized. Essentially we’re minimizing the differences in the possibilities between the similarity matrices of the 2 spaces in order to make sure the similarities are best preserved within the mapping of the unique high-dimensional dataset to the low-dimensional dataset.

t-SNE improves upon SNE in two primary ways:

It minimizes the Kullback-Leibler divergences between the joint probabilities relatively than the conditional probabilities. The authors confer with this as “symmetric SNE” b/c their approach ensures that the joint probabilities p_ij = p_ji.
It computes the similarities between points using a Student-t distribution w/ one degree of freedom (also a Cauchy Distribution) relatively than a Gaussian within the low-dimensional space (step 2 above). Here we will see where the “t” in t-SNE is coming from. This “crowding problem” will be envisioned as such: Imagine we have now a 10-D space, the quantity of space available in 2-D won’t be sufficient to accurately capture those moderately dissimilar points in comparison with the quantity of space for nearby points relative to the quantity of space available in a 10-D space. More simply, just envision taking a 3-D space and projecting it right down to 2-D, the 3-D space may have rather more overall space to model the similarities relative to the projection down into 2-D. The Student-t distribution helps alleviate this problem by having heavier tails than the conventional distribution. See the original paper for a rather more in-depth treatment of this problem.

If this didn’t all track immediately, that’s okay! I’m hoping after we implement this in code, the pieces will all fall in to put. The fundamental takeaway is that this:

Implementation from Scratch

Let’s now move on to understanding t-SNE via implementing the unique version of the algorithm as presented within the paper by Laurens van der Maaten & Geoffrey Hinton. We’ll first start with implementing algorithm 1 below step-by-step, which can cover 95% of the fundamental algorithm. There are two additional enhancements the authors note: 1) Early Exaggeration & 2) Adaptive Learning Rates. We’ll only discuss adding within the early exaggeration as that’s most conducive in aiding the interpretation of the particular algorithms inner workings, because the adaptive learning rate is targeted on improving the rates of convergence.

Following the unique paper, we might be using the publicly available MNIST dataset from OpenML with images of handwritten digits from 0–9.[2] We can even randomly sample 1000 images from the dataset & reduce the dimensionality of the dataset using Principal Component Evaluation (PCA) and keep 30 components. These are each to enhance computational time of the algorithm, because the code here will not be optimized for speed, but relatively for interpretability & learning.

from sklearn.datasets import fetch_openml
from sklearn.decomposition import PCA
import pandas as pd# Fetch MNIST data
mnist = fetch_openml('mnist_784', version=1, as_frame=False)
mnist.goal = mnist.goal.astype(np.uint8)
X_total = pd.DataFrame(mnist["data"])
y_total = pd.DataFrame(mnist["target"])
X_reduced = X_total.sample(n=1000)
y_reduced = y_total.loc[X_total.index]
# PCA to maintain 30 components
X = PCA(n_components=30).fit_transform(X_reduced)

This might be our X dataset with each row being a picture and every column being a feature, or principal component on this case (i.e. linear combos of the unique pixels):

Sample of 1000 from MNIST Dataset with first 30 principal components

We also might want to specify the price function parameters — perplexity — and the optimization parameters — iterations, learning rate, & momentum. We’ll hold off on these for now and address them as they seem at each stage.

When it comes to output, recall that we seek a the low-dimensional mapping of the unique dataset X. We might be mapping the unique space right into a 2 dimensional space throughout this instance. Thus, our latest output might be the 1000 images now represented in a 2 dimensional space relatively than the unique 30 dimensional space:

Now that we have now our inputs, step one is to compute the pairwise similarities in the unique high dimensional space. That’s, for every image i we compute the probability that i would pick image j as its neighbor in the unique space for every j. These probabilities are calculated via a standard distribution centered around each point after which are normalized to sum as much as 1. Mathematically, we have now:

Note that, in our case with n = 1000, these computations will lead to a 1000 x 1000 matrix of similarity scores. Note we set p = 0 each time i = j b/c we’re modeling pairwise similarities. Nevertheless, it’s possible you’ll notice that we have now not mentioned how σ is decided. This value is decided for every commentary i via a grid search based on the user-specified desired perplexity of the distributions. We’ll speak about this immediately below, but let’s first take a look at how we’d code eq. (1) above:

def get_original_pairwise_affinities(X:np.array([]), 
perplexity=10):'''
Function to acquire affinities matrix.
'''
n = len(X)
print("Computing Pairwise Affinities....")
p_ij = np.zeros(shape=(n,n))
for i in range(0,n):
# Equation 1 numerator
diff = X[i]-X
σ_i = grid_search(diff, i, perplexity) # Grid Seek for σ_i
norm = np.linalg.norm(diff, axis=1)
p_ij[i,:] = np.exp(-norm**2/(2*σ_i**2))
# Set p = 0 when j = i
np.fill_diagonal(p_ij, 0)
# Equation 1 
p_ij[i,:] = p_ij[i,:]/np.sum(p_ij[i,:])
# Set 0 values to minimum numpy value (ε approx. = 0) 
ε = np.nextafter(0,1)
p_ij = np.maximum(p_ij,ε)
print("Accomplished Pairwise Affinities Matrix. n")
return p_ij

Now before we take a look at the outcomes of this code, let’s discuss how we determine the values of σ via the grid_search() function. Given a specified value of perplexity (which on this context will be loosely considered because the variety of nearest neighbors for every point), we do a grid search over a variety of values of σ such that the next equation is as near equality as possible for every i:

Perplexity

where H(P) is the Shannon entropy of P.

Shannon Entropy of P

In our case, we’ll set perplexity = 10 and set the search space to be defined by [0.01 * standard deviation of the norms for the difference between images i and j, 5 * standard deviation of the norms for the difference between images i and j] divided into 200 equal steps. Knowing this, we will define our grid_search() function as follows:

def grid_search(diff_i, i, perplexity):'''
Helper function to acquire σ's based on user-specified perplexity.
'''
result = np.inf # Set first result to be infinity
norm = np.linalg.norm(diff_i, axis=1)
std_norm = np.std(norm) # Use standard deviation of norms to define search space
for σ_search in np.linspace(0.01*std_norm,5*std_norm,200):
# Equation 1 Numerator
p = np.exp(-norm**2/(2*σ_search**2)) 
# Set p = 0 when i = j
p[i] = 0 
# Equation 1 (ε -> 0) 
ε = np.nextafter(0,1)
p_new = np.maximum(p/np.sum(p),ε)
# Shannon Entropy
H = -np.sum(p_new*np.log2(p_new))
# Get log(perplexity equation) as near equality
if np.abs(np.log(perplexity) - H * np.log(2)) < np.abs(result):
result = np.log(perplexity) - H * np.log(2)
σ = σ_search
return σ

Given these functions, we will compute the affinity matrix via p_ij = get_original_pairwise_affinities(X) where we obtain the next matrix:

Affinity Matrix of Conditional Probabilities in Original High Dimensional Space

Note, the diagonal elements are set to ε ≈ 0 by construction (each time i = j). Recall that a key extension of the t-SNE algorithm is to compute the joint probabilities relatively than the conditional probabilities. That is computed simply as follow:

Converting Conditional Probabilities to Joint Probabilities

Thus, we will define a latest function:

def get_symmetric_p_ij(p_ij:np.array([])):'''
Function to acquire symmetric affinities matrix utilized in t-SNE.
'''
print("Computing Symmetric p_ij matrix....")
n = len(p_ij)
p_ij_symmetric = np.zeros(shape=(n,n))
for i in range(0,n):
for j in range(0,n):
p_ij_symmetric[i,j] = (p_ij[i,j] + p_ij[j,i]) / (2*n)
# Set 0 values to minimum numpy value (ε approx. = 0)
ε = np.nextafter(0,1)
p_ij_symmetric = np.maximum(p_ij_symmetric,ε)
print("Accomplished Symmetric p_ij Matrix. n")
return p_ij_symmetric

Feeding in p_ij from above, we have now p_ij_symmetric = get_symmetric_p_ij(p_ij) where we obtain the next symmetric affinities matrix:

Symmetric Affinity Matrix of Joint Probabilities in Original High Dimensional Space

Now we have now accomplished the primary fundamental step in t-SNE! We computed the symmetric affinity matrix in the unique high-dimensional space. Before we dive right into the optimization stage, we’ll discuss the fundamental components of the optimization problem in the subsequent two steps after which mix them into our for loop.

Now we would like to sample a random initial solution within the lower dimensional space as follows:

def initialization(X: np.array([]),
n_dimensions = 2):return np.random.normal(loc=0,scale=1e-4,size=(len(X),n_dimensions))

where calling y0 = initialization(X) we obtain a random starting solution:

Now, we would like to compute the affinity matrix on this lower dimensional space. Nevertheless, recall that we do that utilizing a student-t distribution w/ 1 degree of freedom:

Again, we set q = 0 each time i = j. Note this equation differs from eq. (1) in that the denominator is over all i and thus symmetric by construction. Putting this into code, we obtain:

def get_low_dimensional_affinities(Y:np.array([])):
'''
Obtain low-dimensional affinities.
'''n = len(Y)
q_ij = np.zeros(shape=(n,n))
for i in range(0,n):
# Equation 4 Numerator
diff = Y[i]-Y
norm = np.linalg.norm(diff, axis=1)
q_ij[i,:] = (1+norm**2)**(-1)
# Set p = 0 when j = i
np.fill_diagonal(q_ij, 0)
# Equation 4 
q_ij = q_ij/q_ij.sum()
# Set 0 values to minimum numpy value (ε approx. = 0)
ε = np.nextafter(0,1)
q_ij = np.maximum(q_ij,ε)
return q_ij

Here we’re in search of a 1000 x 1000 affinity matrix but now within the lower dimensional space. Calling q_ij = get_low_dimensional_affinities(y0) we obtain:

Symmetric Affinity Matrix of Joint Probabilities in Recent Low Dimensional Space

Recall, our cost function is the Kullback-Leibler divergence between joint probability distributions within the high dimensional space and low dimensional space:

Kullback-Leibler divergence between joint probability distributions

Intuitively, we would like to attenuate the difference within the affinity matrices p_ij and q_ij thereby best preserving the “neighborhood” structure of the unique space. The optimization problem is solved using gradient descent, but first let’s take a look at computing the gradient for the price function above. The authors derive (see appendix A of the paper) the gradient of the price function as follows:

Gradient of Cost Function (Eq. 5, but from appendix)

In python, we have now:

def get_gradient(p_ij: np.array([]),
q_ij: np.array([]),
Y: np.array([])):
'''
Obtain gradient of cost function at current point Y.
'''n = len(p_ij)
# Compute gradient
gradient = np.zeros(shape=(n, Y.shape[1]))
for i in range(0,n):
# Equation 5
diff = Y[i]-Y
A = np.array([(p_ij[i,:] - q_ij[i,:])])
B = np.array([(1+np.linalg.norm(diff,axis=1))**(-1)])
C = diff
gradient[i] = 4 * np.sum((A * B).T * C, axis=0)
return gradient

Feeding within the relevant arguments, we obtain the gradient at y0 via gradient = get_gradient(p_ij_symmetric,q_ij,y0) with the corresponding output:

Gradient of Cost Function at Initial Solution (y0)

Now, we have now all of the pieces so as to solve the optimization problem!

With a view to update our low-dimensional mapping, we use gradient descent with momentum as outlined by the authors:

Update Rule (Gradient Descent w/ Momentum)

where η is our learning rate and α(t) is our momentum term as a function of time. The training rate controls the step size at each iteration and the momentum term allows the optimization algorithm to achieve inertia in the sleek direction of the search space, while not being bogged down by the noisy parts of the gradient. We’ll set η=200 for our example and can fix α(t)=0.5 if t < 250 and α(t)=0.8 otherwise. Now we have all of the components needed above to compute to the update rule, thus we will now run our optimization over a set variety of iterations T (we’ll set T=1000).

Before we arrange for iteration scheme, let’s first introduce the enhancement the authors confer with as “early exaggeration”. This term is a relentless that scales the unique matrix of affinities p_ij. What this does is it places more emphasis on modeling the very similar points (high values in p_ij from the unique space) in the brand new space early on and thus forming “clusters” of highly similar points. The early exaggeration is placed on at first of the iteration scheme (T<250) after which turned off otherwise. Early exaggeration might be set to 4 in our case. We’ll see this in motion in our visual below following implementation.

Now, putting the entire pieces together for the algorithm, we have now the next:

def tSNE(X: np.array([]), 
perplexity = 10,
T = 1000, 
η = 200,
early_exaggeration = 4,
n_dimensions = 2):n = len(X)
# Get original affinities matrix 
p_ij = get_original_pairwise_affinities(X, perplexity)
p_ij_symmetric = get_symmetric_p_ij(p_ij)
# Initialization
Y = np.zeros(shape=(T, n, n_dimensions))
Y_minus1 = np.zeros(shape=(n, n_dimensions))
Y[0] = Y_minus1
Y1 = initialization(X, n_dimensions)
Y[1] = np.array(Y1)
print("Optimizing Low Dimensional Embedding....")
# Optimization
for t in range(1, T-1):
# Momentum & Early Exaggeration
if t < 250:
α = 0.5
early_exaggeration = early_exaggeration
else:
α = 0.8
early_exaggeration = 1
# Get Low Dimensional Affinities
q_ij = get_low_dimensional_affinities(Y[t])
# Get Gradient of Cost Function
gradient = get_gradient(early_exaggeration*p_ij_symmetric, q_ij, Y[t])
# Update Rule
Y[t+1] = Y[t] - η * gradient + α * (Y[t] - Y[t-1]) # Use negative gradient 
# Compute current value of cost function
if t % 50 == 0 or t == 1:
cost = np.sum(p_ij_symmetric * np.log(p_ij_symmetric / q_ij))
print(f"Iteration {t}: Value of Cost Function is {cost}")
print(f"Accomplished Embedding: Final Value of Cost Function is {np.sum(p_ij_symmetric * np.log(p_ij_symmetric / q_ij))}")
solution = Y[-1]
return solution, Y

Calling solution, Y = tSNE(X) we obtain the next output :

where solution is the ultimate 2-D mapping and Y is our mapped 2-D values at each step of the iteration. Plotting the evolution of Y where Y[-1] is our final 2-D mapping, we obtain (note how the algorithm behaves with early exaggeration on and off):

Evolution of 2-D Mapping in t-SNE Algorithm

I like to recommend fooling around with different values of the parameters (i.e., perplexity, learning rate, early exaggeration, etc.) to see how the answer differs (See the original paper and the scikit-learn documentation for guides on using these parameters).