The Machine Learning “Advent Calendar” Day 16: Kernel Trick in Excel

article about SVM, the subsequent natural step is Kernel SVM.

At first sight, it looks like a very different model. The training happens within the dual form, we stop talking a few slope and an intercept, and suddenly every little thing is a few “kernel”.

In today’s article, I’ll make the word kernel concrete by visualizing what it really does.

There are various good ways to introduce Kernel SVM. If you could have read my previous articles, you already know that I like to start out from something easy that you simply already know.

A classic method to introduce Kernel SVM is that this: SVM is a linear model. If the connection between the features and the goal is non-linear, a straight line won’t separate the classes well. So we create latest features. Polynomial regression continues to be a linear model, we simply add polynomial features (x, x², x³, …). From this standpoint, a polynomial kernel performs polynomial regression implicitly, and an RBF kernel could be seen as using an infinite series of polynomial features…

Perhaps one other day we are going to follow this path, but today we are going to take a unique one: we start with KDE.

Yes, Kernel Density Estimation.

Let’s start.

And you need to use this link to get Google sheet

Kernel trick in Excel – all images by creator

1. KDE as a sum of individual densities

I introduced KDE within the article about LDA and QDA, and at the moment I said we might reuse it later. That is the moment.

We see the word kernel in KDE, and we also see it in Kernel SVM. This will not be a coincidence, there’s an actual link.

The concept of KDE is straightforward:
around each data point, we place a small distribution (a kernel).
Then, we add all these individual densities together to acquire a world distribution.

Keep this concept in mind. It would be the important thing to understanding Kernel SVM.

We also can adjust one parameter to manage how smooth the worldwide density is, from very local to very smooth, as illustrated within the GIF below.

As you already know, KDE is a distance or density-based model, so here, we’re going to create a link between two models from two different families.

2. Turning KDE right into a model

Now we reuse the exact same idea to construct a function around each point, after which this function could be used for classification.

Do you do not forget that the classification task with the weight-based models is first a regression task, since the value y is all the time regarded as continuous? We only do the classification part after we got the choice function or f(x).

2.1. (Still) using a straightforward dataset

Someone once asked me why I all the time use around 10 data points to clarify machine learning, saying it’s meaningless.

I strongly disagree.

If someone cannot explain how a Machine Learning model works with 10 points (or less) and one single feature, then they do not likely understand how this model works.

So this can not be a surprise for you. Yes, I’ll still use this quite simple dataset, that I already used for logistic regression and SVM. I do know this dataset is linearly separable, but it surely is interesting to check the outcomes of the models.

And I also generated one other dataset with data points that are usually not linearly separable and visualized how the kernelized model works.

Dataset for kernel SVM in Excel – all images by creator

2.2. RBF kernel centered on points

Allow us to now apply the KDE idea to our dataset.

For every data point, we place a bell-shaped curve centered on its x value. At this stage, we don’t care about classification yet. We’re only doing one easy thing: creating one local bell around each point.

This bell has a Gaussian shape, but here it has a selected name: RBF, for .

On this figure, we are able to see the RBF (Gaussian) kernel centered on this point x₇

The name sounds technical, but the concept is definitely quite simple.

When you see RBFs as “distance-based bells”, the name stops being mysterious.

The way to read this intuitively

x is any position on the x-axis
x₇ is the middle of the bell (the seventh point)
γ (gamma) controls the width of the bell

So the bell reaches its maximum exactly at the purpose.

As x moves away from x₇, the worth decreases easily toward 0.

Role of γ (gamma)

Small γ means wide bell (smooth, global influence)
Large γ means narrow bell (very local influence)

So γ plays the identical role because the bandwidth in KDE.

At this stage, nothing is combined yet. We are only constructing the elementary blocks.

2.3. Combining bells with class labels

On the figures below, you first see the person bells, each centered on a knowledge point.

Once this is obvious, we move to the subsequent step: combining the bells.

This time, each bell is multiplied by its label yi.
In consequence, some bells are added and others are subtracted, creating influences in two opposite directions.

This is step one toward a classification function.

And we are able to see all of the components from each data point which can be adding together in Excel to get the ultimate rating.

This already looks extremely just like KDE.

But we are usually not done yet.

2.4. From equal bells to weighted bells

We said earlier that SVM belongs to the weight-based family of models. So the subsequent natural step is to introduce weights.

In distance-based models, one major limitation is that every one features are treated as equally vital when computing distances. In fact, we are able to rescale features, but this is commonly a manual and imperfect fix.

Here, we take a unique approach.

As an alternative of simply summing all of the bells, we assign a weight to every data point and multiply each bell by this weight.

At this point, the model continues to be linear, but linear within the space of kernels, not in the unique input space.

To make this concrete, we are able to assume that the coefficients αi are already known and directly plot the resulting function in Excel. Each data point contributes its own weighted bell, and the ultimate rating is just the sum of all these contributions.

If we apply this to a dataset with a non-linearly separable boundary, we clearly see what Kernel SVM is doing: it matches the info by combining local influences, as an alternative of attempting to draw a single straight line.

3. Loss function: where SVM really starts

To date, we’ve only talked concerning the kernel a part of the model. Now we have built bells, weighted them, and combined them.

But our model is known as Kernel SVM, not only “kernel model”.

The SVM part comes from the loss function.

And as chances are you’ll already know, SVM is defined by the hinge loss.

3.1 Hinge loss and support vectors

The hinge loss has an important property.

If a degree is:

accurately classified, and
far enough from the choice boundary,

then its loss is zero.

As a direct consequence, its coefficient αi becomes zero.

Only just a few data points remain lively.

These points are called support vectors.

So despite the fact that we began with one bell per data point, in the ultimate model, only just a few bells survive.

In the instance below, you possibly can see that for some points (for example points 5 and eight), the coefficient αi is zero. These points are usually not support vectors and don’t contribute to the choice function.

Depending on how strongly we penalize violations (through the parameter C), the variety of support vectors can increase or decrease.

This is an important practical advantage of SVM.

When the dataset is large, storing one parameter per data point could be expensive. Because of hinge loss, SVM produces a sparse model, where only a small subset of points is kept.

3.2 Kernel ridge regression: same kernels, different loss

If we keep the identical kernels but replace the hinge loss with a squared loss, we obtain kernel ridge regression:

Same kernels.
Same bells.
Different loss.

This results in an important conclusion:

Kernels define the representation.
The loss function defines the model.

With kernel ridge regression, the model must store all training data points.

Since squared loss doesn’t force any coefficient to zero, every data point keeps a non-zero weight and contributes to the prediction.

In contrast, Kernel SVM produces a sparse solution: only support vectors are stored, all other points disappear from the model.

3.3 A fast link with LASSO

There’s an interesting parallel with LASSO.

In linear regression, LASSO uses an L1 penalty on the primal coefficients. This penalty encourages sparsity, and a few coefficients change into exactly zero.

In SVM, hinge loss plays an identical role, but in a unique space.

LASSO creates sparsity within the primal coefficients
SVM creates sparsity within the dual coefficients αi

Different mechanisms, same effect: only the vital parameters survive.

Conclusion

Kernel SVM will not be nearly kernels.

Kernels construct a wealthy, non-linear representation.
Hinge loss selects only the essential data points.

The result’s a model that’s each flexible and sparse, which is why SVM stays a strong and chic tool.

Tomorrow, we are going to have a look at one other model that deals with non-linearity. Stay tuned.

The Machine Learning “Advent Calendar” Day 16: Kernel Trick in Excel

1. KDE as a sum of individual densities

2. Turning KDE right into a model