The Machine Learning “Advent Calendar” Day 15: SVM in Excel

we’re.

That is the model that motivated me, from the very starting, to make use of Excel to raised understand Machine Learning.

And today, you’ll see a different explanation of SVM than you normally see, which is the one with:

margin separators,
distances to a hyperplane,
geometric constructions first.

As a substitute, we’ll construct the model step-by-step, ranging from things we already know.

So possibly this can be the day you finally say “oh, I understand higher now.”

Constructing a Latest Model on What We already Know

One in every of my important learning principles is straightforward:
at all times start from what we already know.

Before SVM, we already studied:

logistic regression,
penalization and regularization.

We’ll use these models and ideas today.

The concept will not be to introduce a brand new model, but to transform an existing one.

Training datasets and label convention

We’ll use a one-feature dataset to clarify the SVM.
Yes, I do know, this might be the primary time you see someone explain SVM using just one feature.

Why not?

Actually, it’s crucial, for several reasons.

For other models, similar to linear regression or logistic regression, we normally start with a single feature. We should always do the identical with SVM, in order that we will compare the models properly.

In the event you construct a model with many features and think you understand how it really works, but you can’t explain it with only one feature, you then do probably not understand it yet.

Using a single feature makes the model:

simpler to implement,
easier to visualise,
and far easier to debug.

So, we use two datasets that I generated as an instance the 2 possible situations a linear classifier can face:

one dataset is completely separable
the opposite is not completely separable

It’s possible you’ll already know why we use these two datasets, whereas we only use one, right?

We also use the label convention -1 and 1 as a substitute of 0 and 1.

Why? We’ll see later, that is definitely interesting history, about how the models are seen in GLM and Machine Learning perspectives.

SVM in Excel – All images by writer

In logistic regression, before applying the sigmoid, we compute a logit. And we will call it f, it is a linear rating.

This quantity is a linear rating that may take any real value, from −∞ to +∞.

positive values correspond to 1 class,
negative values correspond to the opposite,
zero is the choice boundary.

Using labels -1 and 1 matches this interpretation naturally.
It emphasizes the sign of the logit, without going through probabilities.

So, we’re working with a pure linear model, not inside the GLM framework.

There is no such thing as a sigmoid, no probability, only a linear decision rating.

A compact solution to express this concept is to take a look at the amount:

y(ax + b) = y f(x)

If this value is positive, the purpose is appropriately classified.
Whether it is large, the classification is confident.
Whether it is negative, the purpose is misclassified.

At this point, we’re still not talking about SVMs.
We’re only making explicit what good classification means in a linear setting.

From log-loss to a brand new loss function

With this convention, we will write the log-loss for logistic regression directly as a function of the amount:

y f(x) = y (ax+b)

We will plot this loss as a function of yf(x).
Now, allow us to introduce a brand new loss function called the hinge loss.

Once we plot the 2 losses on the identical graph, we will see that they’re quite similar in shape.

Do you remember Gini vs. Entropy in Decision Tree Classifiers?
The comparison may be very similar here.

In each cases, the concept is to penalize:

points which are misclassified yf(x)<0,
points which are too near the choice boundary.

The difference is in how this penalty is applied.

The log-loss penalizes errors in a smooth and progressive way.
Even well-classified points are still barely penalized.
The hinge loss is more direct and abrupt.
Once some extent is appropriately classified with a sufficient margin, it’s now not penalized in any respect.

So the goal will not be to vary what we consider a very good or bad classification,
but to simplify the best way we penalize it.

One query naturally follows.

Could we also use a squared loss?

In any case, linear regression can be used as a classifier.

But after we do that, we immediately see the issue:
the squared loss keeps penalizing points which are already thoroughly classified.

As a substitute of specializing in the choice boundary, the model tries to suit exact numeric targets.

For this reason linear regression is generally a poor classifier, and why the selection of the loss function matters a lot.

Description of the brand new model

Allow us to now assume that the model is already trained and look directly at the outcomes.

For each models, we compute the exact same quantities:

the linear rating (and it is known as logit for Logistic Regression)
the probability (we will just apply the sigmoid function in each cases),
and the loss value.

This enables a direct, point-by-point comparison between the 2 approaches.

Although the loss functions are different, the linear scores and the resulting classifications are very similar on this dataset.

For the completely separable dataset, the result’s immediate: all points are appropriately classified and lie sufficiently removed from the choice boundary. As a consequence, the hinge loss is the same as zero for each remark.

This results in a crucial conclusion.

When the info is perfectly separable, there will not be a novel solution. Actually, there are infinitely many linear decision functions that achieve the exact same result. We will shift the road, rotate it barely, or rescale the coefficients, and the classification stays perfect, with zero loss in every single place.

So what can we do next?

We introduce regularization.

Just as in ridge regression, we add a penalty on the size of the coefficients. This extra term doesn’t improve classification accuracy, but it surely allows us to pick out one solution amongst all of the possible ones.

So in our dataset, we get the one with the smallest slope a.

And congratulations, we now have just built the SVM model.

We will now just write down the fee function of the 2 models: Logistic Regression and SVM.

Do you do not forget that Logistic Regression could be regularized, and it continues to be called so, right?

Now, why does the model include the term “Support Vectors”?

In the event you take a look at the dataset, you possibly can see that only just a few points, for instance those with values 6 and 10, are enough to find out the choice boundary. These points are called support vectors.

At this stage, with the attitude we’re using, we cannot discover them directly.

We’ll see later that one other viewpoint makes them appear naturally.

And we will do the identical exercise for one more dataset, with non-separable dataset, however the principle is similar. Nothing modified.

But now, we will see that for certains points, the hinge loss will not be zero. In our case below, we will see visually that there are 4 points that we’d like as Support Vectors.

SVM Model Training with Gradient Descent

We now train the SVM model explicitly, using gradient descent.
Nothing recent is introduced here. We reuse the identical optimization logic we already applied to linear and logistic regression.

Latest convention: Lambda (λ) or C

In lots of models we studied previously, similar to ridge or logistic regression, the target function is written as:

data-fit loss +λ ∥w∥

Here, the regularization parameter λ controls the penalty on the dimensions of the coefficients.

For SVMs, the same old convention is barely different. We relatively use C in front of the data-fit term.

Each formulations are equivalent.
They only differ by a rescaling of the target function.

We keep the parameter C since it is the usual notation utilized in SVMs. And we’ll see why we now have this convention later.

Gradient (subgradient)

We work with a linear decision function, and we will define the margin for every point as: mi = yi (axi + b)

Only observations such that mi<1 contribute to the hinge loss.

The subgradients of the target are as follows, and we will implement in Excel, using logical masks and SUMPRODUCT.

Parameter update

With a learning rate or step size η, the gradient descent updates are as follows, and we will do the same old formula:

We iterate these updates until convergence.

And, by the best way, this training procedure also gives us something very nice to visualise. At each iteration, because the coefficients are updated, the size of the margin changes.

So we will visualize, step-by-step, how the margin evolves throughout the learning process.

Optimization vs. geometric formulation of SVM

This figure below shows the same objective function of the SVM model written in two different languages.

On the left, the model is expressed as an optimization problem.
We minimize a mix of two things:

a term that keeps the model easy, by penalizing large coefficients,
and a term that penalizes classification errors or margin violations.

That is the view we now have been using thus far. It’s natural when we expect when it comes to loss functions, regularization, and gradient descent. It’s probably the most convenient form for implementation and optimization.

On the fitting, the identical model is expressed in a geometric way.

As a substitute of talking about losses, we discuss:

margins,
constraints,
and distances to the separating boundary.

When the info is perfectly separable, the model looks for the separating line with the largest possible margin, without allowing any violation. That is the hard-margin case.

When perfect separation is unattainable, violations are allowed, but they’re penalized. This results in the soft-margin case.

What is vital to grasp is that these two views are strictly equivalent.

The optimization formulation mechanically enforces the geometric constraints:

penalizing large coefficients corresponds to maximizing the margin,
penalizing hinge violations corresponds to allowing, but controlling, margin violations.

So this will not be two different models, and never two different ideas.
It’s the same SVM, seen from two complementary perspectives.

Once this equivalence is evident, the SVM becomes much less mysterious: it is solely a linear model with a selected way of measuring errors and controlling complexity, which naturally results in the maximum-margin interpretation everyone knows.

Unified Linear Classifier

From the optimization viewpoint, we will now take a step back and take a look at the larger picture.

What we now have built will not be just “the SVM”, but a general linear classification framework.

A linear classifier is defined by three independent decisions:

a linear decision function,
a loss function,
a regularization term.

Once this is evident, many models appear as easy mixtures of those elements.

In practice, this is precisely what we will do with SGDClassifier in scikit-learn.

From the identical viewpoint, we will:

mix the hinge loss with L1 regularization,
replace hinge loss with squared hinge loss,
use log-loss, hinge loss, or other margin-based losses,
select L2 or L1 penalties depending on the specified behavior.

Each selection changes how errors are penalized or how coefficients are controlled, however the underlying model stays the identical: a linear decision function trained by optimization.

Primal vs Dual Formulation

It’s possible you’ll have already got heard concerning the dual form of SVM.

Up to now, we now have worked entirely within the primal form:

we optimized the model coefficients directly,
using loss functions and regularization.

The dual form is one other solution to write the identical optimization problem.

As a substitute of assigning weights to features, the twin form assigns a coefficient, normally called alpha, to each data point.

We is not going to derive or implement the twin form in Excel, but we will still observe its result.

Using scikit-learn, we will compute the alpha values and confirm that:

the primal and dual forms result in the same model,
same decision boundary, same predictions.

What makes the twin form particularly interesting for SVM is that:

most alpha values are exactly zero,
only just a few data points have non-zero alpha.

These points are the support vectors.

This behavior is particular to margin-based losses just like the hinge loss.

Finally, the twin form also explains why SVMs can use the kernel trick.

By working with similarities between data points, we will construct non-linear classifiers without changing the optimization framework.

We’ll see this tomorrow.

Conclusion

In this text, we didn’t approach SVM as a geometrical object with complicated formulas. As a substitute, we built it step-by-step, ranging from models we already know.

By changing only the loss function, then adding regularization, we naturally arrived on the SVM. The model didn’t change. Only the best way we penalize errors did.

Seen this manner, SVM will not be a brand new family of models. It’s a natural extension of linear and logistic regression, viewed through a distinct loss.

We also showed that:

the optimization view and the geometric view are equivalent,
the maximum-margin interpretation comes directly from regularization,
and the notion of support vectors emerges naturally from the twin perspective.

Once these links are clear, SVM becomes much easier to grasp and to position amongst other linear classifiers.

In the following step, we’ll use this recent perspective to go further, and see how kernels extend this concept beyond linear models.

The Machine Learning “Advent Calendar” Day 15: SVM in Excel