The Machine Learning “Advent Calendar” Day 4: k-Means in Excel

4 of the Machine Learning Advent Calendar.

Through the first three days, we explored distance-based models for supervised learning:

In all these models, the thought was the identical: we measure distances, and we resolve the output based on the closest points or nearest centers.

Today, we stay on this same family of ideas. But we use the distances in an unsupervised way: k-means.

Now, one query for individuals who already know this algorithm: k-means looks more much like which model, the k-NN classifier, or the Nearest Centroid classifier?

And in case you remember, for all of the models we now have seen up to now, there was probably not a “training” phase or hyperparameter tuning.

For k-NN, there isn’t a training in any respect.

For LDA, QDA, or GNB, training is just computing means and variances. And there are also no real hyperparameters.

Now, with k-means, we’re going to implement a training algorithm that finally looks like “real” machine learning.

We start with a tiny 1D example. Then we move to 2D.

Goal of k-means

Within the training dataset, there are no initial labels.

The goal of k-means is to create meaningful labels by grouping points which might be close to one another.

Allow us to take a look at the illustration below. You possibly can clearly see two groups of points. Each centroid (the red square and the green square) is in the midst of its cluster, and each point is assigned to the closest one.

This provides a really intuitive picture of how k-means discovers structure using only distances.

And here, k means the variety of centers we try to seek out.

k-means in Excel – image by creator

Now, allow us to answer the query: Which algorithm is k-means closer to, the k-NN classifier or the Nearest Centroid classifier?

Don’t be fooled by the k in k-NN and k-means.
They don’t mean the identical thing:

in k-NN, is the variety of neighbors, not the variety of classes;
in k-means, is the variety of centroids.

K-means is far closer to the Nearest Centroid classifier.

Each models are represented by centroids, and for a brand new statement we simply compute the space to every centroid to come to a decision to which one it belongs.

The difference, after all, is that within the Nearest Centroid classifier, we already the centroids because they arrive from labeled classes.

In k-means, we have no idea the centroids. The entire goal of the algorithm is to suitable ones directly from the information.

The business problem is totally different: as an alternative of predicting labels, we try to create them.

And in k-means, the worth of k (the variety of centroids) is unknown. So it becomes a hyperparameter that we will tune.

k-means with only One feature

We start with a tiny 1D example in order that the whole lot is visible on one axis. And we are going to select the values in such a trivial way that we will immediately see the 2 centroids.

1, 2, 3, 11, 12, 13

Yes, 2, and 12.

But how would the pc know? The machine will “learn” by guessing step-by-step.

Here comes the algorithm called Lloyd’s algorithm.

We’ll implement it in Excel with the next loop:

select initial centroids
compute the space from each point to every centroid
assign each point to the closest centroid
recompute the centroids as the typical of the points in each cluster
repeat steps 2 to 4 until the centroids not move

1. Select initial centroids

Pick two initial centers, for instance:

They must be inside the data range (between 1 and 13).

2. Compute distances

For every data point x:

compute the space to c_1,
compute the space to c_2.

Typically, we use absolute distance in 1D.

We now have two distance values for every point.

3. Assign clusters

For every point:

compare the 2 distances,
assign the cluster of the smallest one (1 or 2).

In Excel, this is an easy IF or MIN based logic.

4. Compute the brand new centroids

For every cluster:

take the points assigned to that cluster,
compute their average,
this average becomes the brand new centroid.

5. Iterate until reaching convergence

Now in Excel, because of the formulas, we will simply paste the brand new centroid values into the cells of the initial centroids.

The update is immediate, and after doing this a number of times, you will note that the values stop changing. That’s when the algorithm has converged.

We may also record each step in Excel, so we will see how the centroids and clusters evolve over time.

k-means with Two Features

Now allow us to use two features. The method is precisely the identical, we simply use the Euclidean distance in 2D.

You possibly can either do the copy-paste of the brand new centroids as values (with just a number of cells to update),

or you’ll be able to display all of the intermediate steps to see the complete evolution of the algorithm.

Visualizing the Moving Centroids in Excel

To make the method more intuitive, it is useful to create plots that show how the centroids move.

Unfortunately, Excel or Google Sheets are usually not ideal for this type of visualization, and the information tables quickly change into a bit complex to prepare.

If you ought to see a full example with detailed plots, you’ll be able to read this text I wrote almost three years ago, where each step of the centroid movement is shown clearly.

As you’ll be able to see on this picture, the worksheet became quite unorganized, especially in comparison with the sooner table, which was very straightforward.

Selecting the optimal k: The Elbow Method

So now, it is feasible to try k = 2 and k = 3 in our case, and compute the inertia for each. Then we simply compare the values.

We are able to even begin with k=1.

For every value of k:

we run k-Means until convergence,
compute the inertia, which is the sum of squared distances between each point and its assigned centroid.

In Excel:

For every point, take the space to its centroid and square it.
Sum all these squared distances.
This provides the inertia for this k.

For instance:

for k = 1, the centroid is just the general mean of x1 and x2,
for k = 2 and k = 3, we take the converged centroids from the sheets where you ran the algorithm.

Then we will plot inertia as a function of k, for instance for (k = 1, 2, 3).

For this dataset

from 1 to 2, the inertia drops loads,
from 2 to three, the development is far smaller.

The “elbow” is the worth of k after which the decrease in inertia becomes marginal. In the instance, it suggests that k = 2 is sufficient.

Conclusion

K-means is a really intuitive algorithm when you see it step-by-step in Excel.

We start with easy centroids, compute distances, assign points, update the centroids, and repeat. Now, we will see how “machines learn”, right?

Well, this is just the start, we are going to see that different models “learn” in really alternative ways.

And here is the transition for tomorrow’s article: the unsupervised version of the Nearest Centroid classifier is indeed k-means.

So what can be the unsupervised version of LDA or QDA? We’ll answer this in the following article.

The Machine Learning “Advent Calendar” Day 4: k-Means in Excel

Goal of k-means