## Acquire a deep understanding of the inner workings of t-SNE via implementation from scratch in python

I have found that among the best ways to *truly *understanding any statistical algorithm or methodology is to manually implement it yourself. On the flip side, coding these algorithms can sometimes be time consuming and an actual pain, and when any individual else has already done it, why would I need to spend my time doing it — seems inefficient, no? Each are fair points, and I’m not here to make an argument for one over the opposite.

This text is designed for readers who’re fascinated about understanding t-SNE via translation of the mathematics within the original paper — by Laurens van der Maaten & Geoffrey Hinton — into python code implementation.[1] I find these type of exercises to be quite illuminating into the inner workings of statistical algorithms/models and really test your underlying understanding and assumptions regarding these algorithms/models. You might be almost guaranteed to walk away with a greater understanding then you definitely had before. On the very minimum, successful implementation is at all times very satisfying!

This text might be accessible to individuals with any level of exposure of t-SNE. Nevertheless, note a number of things this post definitely is :

- A
*strictly*conceptual introduction and exploration of t-SNE, as there are many other great resources on the market that do that; nevertheless, I might be doing my best to attach the mathematical equations to their intuitive/conceptual counterparts at each stage of implementation. - A
*comprehensive*discussion into the applications & pros/cons of t-SNE, in addition to direct comparisons of t-SNE to other dimensionality reduction techniques. I’ll, nevertheless, briefly touch on these topics throughout this text, but will not at all cover this in-depth.

Without further ado, let’s start with a *transient *introduction into t-SNE.

## A Temporary Introduction to t-SNE

*t-distributed stochastic neighbor embedding (*t-SNE) is a dimensionality reduction tool that’s primarily utilized in datasets with a big dimensional feature space and enables one to visualise the information down, or project it, right into a lower dimensional space (often 2-D). It is very useful for visualizing non-linearly separable data wherein linear methods similar to Principal Component Evaluation (PCA) would fail. Generalizing linear frameworks of dimensionality reduction (similar to PCA) into non-linear approaches (similar to t-SNE) can also be often known as Manifold Learning. These methods will be extremely useful for visualizing and understanding the underlying structure of a high dimensional, non-linear data set, and will be great for disentangling and grouping together observations which can be similar within the high-dimensional space. For more information on t-SNE and other manifold learning techniques, the scikit-learn documentation is an amazing resource. Moreover, to examine some cool areas t-SNE has seen applications, the Wikipedia page highlights a few of these areas with references to the work.

Let’s start with breaking down the name *t-distributed stochastic neighbor embedding *into its components*. *t-SNE is an extension on stochastic neighbor embedding (SNE) presented 6 years earlier on this paper by Geoffrey Hinton & Sam Roweis. So let’s start there. The *stochastic *a part of the name comes from the indisputable fact that the target function will not be convex and thus different results can arise from different initializations. The *neighbor embedding* highlights the character of the algorithm — optimally mapping the points in the unique high-dimensional space into the corresponding low-dimensional space while best preserving the “neighborhood” structure of the points. SNE is comprised of the next (simplified) steps:

*Obtain the Similarity Matrix between Points within the Original Space:*Compute the conditional probabilities for every datapoint*j*relative to every datapoint*i*. These conditional probabilities are calculated in the unique high-dimensional space using a Gaussian centered at*i*and tackle the next interpretation: the probability that*i*would pick*j*as its neighbor in the unique space.*Initialization:*Select random starting points within the lower-dimensional space (say, 2-D) for every datapoint in the unique space and compute latest conditional probabilities similarly as above on this latest space.*Mapping:*Iteratively improve upon the points within the lower-dimensional space until the Kullback-Leibler divergences between all of the conditional probabilities is minimized. Essentially we’re minimizing the differences in the possibilities between the similarity matrices of the 2 spaces in order to make sure the similarities are best preserved within the mapping of the unique high-dimensional dataset to the low-dimensional dataset.

t-SNE improves upon SNE in two primary ways:

- It minimizes the Kullback-Leibler divergences between the
*joint probabilities*relatively than the conditional probabilities. The authors confer with this as “symmetric SNE” b/c their approach ensures that the joint probabilities*p_ij*=*p_ji.* - It computes the similarities between points using a Student-t distribution w/ one degree of freedom (also a Cauchy Distribution) relatively than a Gaussian within the
*low-dimensional*space (step 2 above). Here we will see where the “t” in t-SNE is coming from. This “crowding problem” will be envisioned as such: Imagine we have now a 10-D space, the quantity of space available in 2-D won’t be sufficient to accurately capture those moderately dissimilar points in comparison with the quantity of space for nearby points relative to the quantity of space available in a 10-D space. More simply, just envision taking a 3-D space and projecting it right down to 2-D, the 3-D space may have rather more overall space to model the similarities relative to the projection down into 2-D. The Student-t distribution helps alleviate this problem by having heavier tails than the conventional distribution. See the original paper for a rather more in-depth treatment of this problem.

If this didn’t all track immediately, that’s okay! I’m hoping after we implement this in code, the pieces will all fall in to put. The fundamental takeaway is that this:

## Implementation from Scratch

Let’s now move on to understanding t-SNE via implementing the unique version of the algorithm as presented within the paper by Laurens van der Maaten & Geoffrey Hinton. We’ll first start with implementing algorithm 1 below step-by-step, which can cover 95% of the fundamental algorithm. There are two additional enhancements the authors note: 1) Early Exaggeration & 2) Adaptive Learning Rates. We’ll only discuss adding within the early exaggeration as that’s most conducive in aiding the interpretation of the particular algorithms inner workings, because the adaptive learning rate is targeted on improving the rates of convergence.

Following the unique paper, we might be using the publicly available MNIST dataset from OpenML with images of handwritten digits from 0–9.[2] We can even randomly sample 1000 images from the dataset & reduce the dimensionality of the dataset using Principal Component Evaluation (PCA) and keep 30 components. These are each to enhance computational time of the algorithm, because the code here will not be optimized for speed, but relatively for interpretability & learning.

`from sklearn.datasets import fetch_openml`

from sklearn.decomposition import PCA

import pandas as pd# Fetch MNIST data

mnist = fetch_openml('mnist_784', version=1, as_frame=False)

mnist.goal = mnist.goal.astype(np.uint8)

X_total = pd.DataFrame(mnist["data"])

y_total = pd.DataFrame(mnist["target"])

X_reduced = X_total.sample(n=1000)

y_reduced = y_total.loc[X_total.index]

# PCA to maintain 30 components

X = PCA(n_components=30).fit_transform(X_reduced)

This might be our *X* dataset with each *row* being a picture and every *column* being a feature, or principal component on this case (i.e. linear combos of the unique pixels):

We also might want to specify the price function parameters — perplexity — and the optimization parameters — iterations, learning rate, & momentum. We’ll hold off on these for now and address them as they seem at each stage.

When it comes to output, recall that we seek a the low-dimensional mapping of the unique dataset *X. *We might be mapping the unique space right into a 2 dimensional space throughout this instance. Thus, our latest output might be the 1000 images now represented in a 2 dimensional space relatively than the unique 30 dimensional space:

Now that we have now our inputs, step one is to compute the pairwise similarities in the unique high dimensional space. That’s, for every image *i *we compute the probability that *i *would pick image *j *as its neighbor in the unique space for every *j*. These probabilities are calculated via a standard distribution centered around each point after which are normalized to sum as much as 1. Mathematically, we have now:

Note that, in our case with *n = 1000*, these computations will lead to a *1000 *x *1000* matrix of similarity scores. Note we set *p = 0 *each time *i = j *b/c we’re modeling pairwise similarities. Nevertheless, it’s possible you’ll notice that we have now not mentioned how σ is decided. This value is decided for every commentary *i* via a grid search* *based on the user-specified desired perplexity of the distributions. We’ll speak about this immediately below, but let’s first take a look at how we’d code eq. (1) above:

`def get_original_pairwise_affinities(X:np.array([]), `

perplexity=10):'''

Function to acquire affinities matrix.

'''

n = len(X)

print("Computing Pairwise Affinities....")

p_ij = np.zeros(shape=(n,n))

for i in range(0,n):

# Equation 1 numerator

diff = X[i]-X

σ_i = grid_search(diff, i, perplexity) # Grid Seek for σ_i

norm = np.linalg.norm(diff, axis=1)

p_ij[i,:] = np.exp(-norm**2/(2*σ_i**2))

# Set p = 0 when j = i

np.fill_diagonal(p_ij, 0)

# Equation 1

p_ij[i,:] = p_ij[i,:]/np.sum(p_ij[i,:])

# Set 0 values to minimum numpy value (ε approx. = 0)

ε = np.nextafter(0,1)

p_ij = np.maximum(p_ij,ε)

print("Accomplished Pairwise Affinities Matrix. n")

return p_ij

Now before we take a look at the outcomes of this code, let’s discuss how we determine the values of σ via the grid_search() function. Given a specified value of perplexity (which on this context will be loosely considered because the variety of nearest neighbors for every point), we do a grid search over a variety of values of σ such that the next equation is as near equality as possible for every *i*:

where H(P) is the Shannon entropy of P.

In our case, we’ll set perplexity = 10 and set the search space to be defined by [0.01 * standard deviation of the norms for the difference between images *i *and* j*, 5 * standard deviation of the norms for the difference between images *i *and* j*] divided into 200 equal steps. Knowing this, we will define our grid_search() function as follows:

`def grid_search(diff_i, i, perplexity):`'''

Helper function to acquire σ's based on user-specified perplexity.

'''

result = np.inf # Set first result to be infinity

norm = np.linalg.norm(diff_i, axis=1)

std_norm = np.std(norm) # Use standard deviation of norms to define search space

for σ_search in np.linspace(0.01*std_norm,5*std_norm,200):

# Equation 1 Numerator

p = np.exp(-norm**2/(2*σ_search**2))

# Set p = 0 when i = j

p[i] = 0

# Equation 1 (ε -> 0)

ε = np.nextafter(0,1)

p_new = np.maximum(p/np.sum(p),ε)

# Shannon Entropy

H = -np.sum(p_new*np.log2(p_new))

# Get log(perplexity equation) as near equality

if np.abs(np.log(perplexity) - H * np.log(2)) < np.abs(result):

result = np.log(perplexity) - H * np.log(2)

σ = σ_search

return σ

Given these functions, we will compute the affinity matrix via `p_ij = get_original_pairwise_affinities(X)`

where we obtain the next matrix:

Note, the diagonal elements are set to ε ≈ 0 by construction (each time *i = j*). Recall that a key extension of the t-SNE algorithm is to compute the joint probabilities relatively than the conditional probabilities. That is computed simply as follow:

Thus, we will define a latest function:

`def get_symmetric_p_ij(p_ij:np.array([])):`'''

Function to acquire symmetric affinities matrix utilized in t-SNE.

'''

print("Computing Symmetric p_ij matrix....")

n = len(p_ij)

p_ij_symmetric = np.zeros(shape=(n,n))

for i in range(0,n):

for j in range(0,n):

p_ij_symmetric[i,j] = (p_ij[i,j] + p_ij[j,i]) / (2*n)

# Set 0 values to minimum numpy value (ε approx. = 0)

ε = np.nextafter(0,1)

p_ij_symmetric = np.maximum(p_ij_symmetric,ε)

print("Accomplished Symmetric p_ij Matrix. n")

return p_ij_symmetric

Feeding in `p_ij`

from above, we have now `p_ij_symmetric = get_symmetric_p_ij(p_ij)`

where we obtain the next *symmetric* affinities matrix:

Now we have now accomplished the primary fundamental step in t-SNE! We computed the symmetric affinity matrix in the unique high-dimensional space. Before we dive right into the optimization stage, we’ll discuss the fundamental components of the optimization problem in the subsequent two steps after which mix them into our for loop.

Now we would like to sample a random initial solution within the lower dimensional space as follows:

`def initialization(X: np.array([]),`

n_dimensions = 2):return np.random.normal(loc=0,scale=1e-4,size=(len(X),n_dimensions))

where calling `y0 = initialization(X)`

we obtain a random starting solution:

Now, we would like to compute the affinity matrix on this lower dimensional space. Nevertheless, recall that we do that utilizing a student-t distribution w/ 1 degree of freedom:

Again, we set *q = 0* each time *i = j*. Note this equation differs from eq. (1) in that the denominator is over all *i *and thus symmetric by construction. Putting this into code, we obtain:

`def get_low_dimensional_affinities(Y:np.array([])):`

'''

Obtain low-dimensional affinities.

'''n = len(Y)

q_ij = np.zeros(shape=(n,n))

for i in range(0,n):

# Equation 4 Numerator

diff = Y[i]-Y

norm = np.linalg.norm(diff, axis=1)

q_ij[i,:] = (1+norm**2)**(-1)

# Set p = 0 when j = i

np.fill_diagonal(q_ij, 0)

# Equation 4

q_ij = q_ij/q_ij.sum()

# Set 0 values to minimum numpy value (ε approx. = 0)

ε = np.nextafter(0,1)

q_ij = np.maximum(q_ij,ε)

return q_ij

Here we’re in search of a *1000* x *1000 *affinity matrix but now within the lower dimensional space. Calling `q_ij = get_low_dimensional_affinities(y0)`

we obtain:

Recall, our cost function is the Kullback-Leibler divergence between joint probability distributions within the high dimensional space and low dimensional space:

Intuitively, we would like to attenuate the difference within the affinity matrices `p_ij`

and `q_ij`

thereby best preserving the “neighborhood” structure of the unique space. The optimization problem is solved using gradient descent, but first let’s take a look at computing the gradient for the price function above. The authors derive (see appendix A of the paper) the gradient of the price function as follows:

In python, we have now:

`def get_gradient(p_ij: np.array([]),`

q_ij: np.array([]),

Y: np.array([])):

'''

Obtain gradient of cost function at current point Y.

'''n = len(p_ij)

# Compute gradient

gradient = np.zeros(shape=(n, Y.shape[1]))

for i in range(0,n):

# Equation 5

diff = Y[i]-Y

A = np.array([(p_ij[i,:] - q_ij[i,:])])

B = np.array([(1+np.linalg.norm(diff,axis=1))**(-1)])

C = diff

gradient[i] = 4 * np.sum((A * B).T * C, axis=0)

return gradient

Feeding within the relevant arguments, we obtain the gradient at `y0`

via `gradient = get_gradient(p_ij_symmetric,q_ij,y0)`

with the corresponding output:

Now, we have now all of the pieces so as to solve the optimization problem!

With a view to update our low-dimensional mapping, we use gradient descent with momentum as outlined by the authors:

where *η *is our learning rate and *α(t)* is our momentum term as a function of time. The training rate controls the step size at each iteration and the momentum term allows the optimization algorithm to achieve inertia in the sleek direction of the search space, while not being bogged down by the noisy parts of the gradient. We’ll set *η=200* for our example and can fix *α(t)=0.5* if *t < 250* and *α(t)=0.8* otherwise. Now we have all of the components needed above to compute to the update rule, thus we will now run our optimization over a set variety of iterations *T *(we’ll set *T=1000*).

Before we arrange for iteration scheme, let’s first introduce the enhancement the authors confer with as “early exaggeration”. This term is a relentless that scales the unique matrix of affinities `p_ij`

. What this does is it places more emphasis on modeling the very similar points (high values in `p_ij`

from the unique space) in the brand new space early on and thus forming “clusters” of highly similar points. The early exaggeration is placed on at first of the iteration scheme (*T<250*) after which turned off otherwise. Early exaggeration might be set to 4 in our case. We’ll see this in motion in our visual below following implementation.

Now, putting the entire pieces together for the algorithm, we have now the next:

`def tSNE(X: np.array([]), `

perplexity = 10,

T = 1000,

η = 200,

early_exaggeration = 4,

n_dimensions = 2):n = len(X)

# Get original affinities matrix

p_ij = get_original_pairwise_affinities(X, perplexity)

p_ij_symmetric = get_symmetric_p_ij(p_ij)

# Initialization

Y = np.zeros(shape=(T, n, n_dimensions))

Y_minus1 = np.zeros(shape=(n, n_dimensions))

Y[0] = Y_minus1

Y1 = initialization(X, n_dimensions)

Y[1] = np.array(Y1)

print("Optimizing Low Dimensional Embedding....")

# Optimization

for t in range(1, T-1):

# Momentum & Early Exaggeration

if t < 250:

α = 0.5

early_exaggeration = early_exaggeration

else:

α = 0.8

early_exaggeration = 1

# Get Low Dimensional Affinities

q_ij = get_low_dimensional_affinities(Y[t])

# Get Gradient of Cost Function

gradient = get_gradient(early_exaggeration*p_ij_symmetric, q_ij, Y[t])

# Update Rule

Y[t+1] = Y[t] - η * gradient + α * (Y[t] - Y[t-1]) # Use negative gradient

# Compute current value of cost function

if t % 50 == 0 or t == 1:

cost = np.sum(p_ij_symmetric * np.log(p_ij_symmetric / q_ij))

print(f"Iteration {t}: Value of Cost Function is {cost}")

print(f"Accomplished Embedding: Final Value of Cost Function is {np.sum(p_ij_symmetric * np.log(p_ij_symmetric / q_ij))}")

solution = Y[-1]

return solution, Y

Calling `solution, Y = tSNE(X)`

we obtain the next output :

where `solution`

is the ultimate 2-D mapping and `Y`

is our mapped 2-D values at each step of the iteration. Plotting the evolution of `Y`

where `Y[-1]`

is our final 2-D mapping, we obtain (note how the algorithm behaves with early exaggeration on and off):

I like to recommend fooling around with different values of the parameters (i.e., perplexity, learning rate, early exaggeration, etc.) to see how the answer differs (See the original paper and the scikit-learn documentation for guides on using these parameters).