the k-NN Regressor and the thought of prediction based on distance, we now take a look at the k-NN Classifier.
The principle is identical, but classification allows us to introduce several useful variants, reminiscent of Radius Nearest Neighbors, Nearest Centroid, multi-class prediction, and probabilistic distance models.
So we’ll first implement the k-NN classifier, then discuss how it may possibly be improved.
You should use this Excel/Google sheet while reading this text to higher follow all the reasons.
Titanic survival dataset
We’ll use the Titanic survival dataset, a classic example where each row describes a passenger with features reminiscent of class, sex, age, and fare, and the goal is to predict whether the passenger survived.

Principle of k-NN for Classification
k-NN classifier is so just like k-NN regressor that I could almost write one single article to elucidate them each.
In reality, after we search for the nearest neighbors, we don’t use the worth in any respect, let alone its nature.
BUT, there are still some interesting facts about how classifiers (binary or multi-class) are built, and the way the features will be handled otherwise.
We start with the binary classification task, after which the multi-class classification.
One Continuous Feature for Binary Classification
So, very quick, we are able to do the identical exercise for one continuous feature, with this dataset.
For the worth of y, we often use 0 and 1 to differentiate the 2 classes. But you possibly can notice, otherwise you will notice that it may possibly be a source of confusion.

Now, give it some thought: 0 and 1 are also numbers, right? So, we are able to exactly do the identical process as if we’re doing a regression.
That’s right. Nothing changes within the computation, as you see in the next screenshot. And you possibly can after all try to change the worth of the brand new statement yourself.

The one difference is how we interpret the result. After we take the “average” of the neighbors’ values, this number is known because the probability that the brand new statement belongs to class 1.
So in point of fact, the “average” value just isn’t the nice interpretation, however it is fairly the proportion of sophistication 1.
We can even manually create this plot, to indicate how the expected probability changes over a variety of values.
Traditionally, to avoid ending up with a 50 percent probability, we elect an odd value for k, in order that we are able to at all times resolve with majority voting.

Two-feature for Binary classification
If we have now two features, the operation can also be almost the identical as in k-NN regressor.

One feature for multi-class classification
Now, let’s take an example of three classes for the goal variable y.
Then we are able to see that we cannot use the notion of “average” anymore, for the reason that number that represents the category just isn’t actually a number. And we must always higher call them “category 0”, “category 1”, and “category 2”.

From k-NN to Nearest Centroids
When k Becomes too Large
Now, let’s make k large. How large? As large as possible.
Remember, we also did this exercise with k-NN regressor, and the conclusion was that if k equals the overall variety of observations within the training dataset, then k-NN regressor is the straightforward average-value estimator.
For the k-NN classifier, it is nearly the identical. If k equals the overall variety of observations, then for every class, we’ll get its overall proportion inside the complete training dataset.
Some people, from a Bayesian standpoint, call these proportions the priors!
But this doesn’t help us much to categorise a brand new statement, because these priors are the identical for each point.
The Creation of Centroids
So allow us to take yet another step.
For every class, we can even group together all of the feature values that belong to that class, and compute their average.
These averaged feature vectors are what we call centroids.
What can we do with these centroids?
We will use them to categorise a brand new statement.
As a substitute of recalculating distances to the complete dataset for each latest point, we simply measure the space to every class centroid and assign the category of the closest one.
With the Titanic survival dataset, we are able to start with a single feature, age, and compute the centroids for the 2 classes: passengers who survived and passengers who didn’t.

Now, additionally it is possible to make use of multiple continuous features.
For instance, we are able to use the 2 features age and fare.

And we are able to discuss some essential characteristics of this model:
- The size is essential, as we discussed before for k-NN regressor.
- The missing values will not be an issue here: after we compute the centroids per class, every one is calculated with the available (non-empty) values
- We went from essentially the most “complex” and “large” model (within the sense that the actual model is the complete training dataset, so we have now to store all of the dataset) to the only model (we only use one value per feature, and we only store these values as our model)
From highly nonlinear to naively linear
But now, can you’re thinking that of 1 major drawback?
Whereas the fundamental k-NN classifier is very nonlinear, the Nearest Centroid method is incredibly linear.
On this 1D example, the 2 centroids are simply the common x values of sophistication 0 and sophistication 1. Because these two averages are close, the choice boundary becomes just the midpoint between them.
So as an alternative of a piecewise, jagged boundary that depends upon the precise location of many training points (as in k-NN), we obtain a straight cutoff that only depends upon two numbers.
This illustrates how Nearest Centroids compresses the complete dataset into an easy and really linear rule.

A note on regression: why centroids don’t apply
Now, this sort of improvement just isn’t possible for the k-NN regressor. Why?
In classification, each class forms a bunch of observations, so computing the common feature vector for every class is smart, and this offers us the category centroids.
But in regression, the goal is continuous. There aren’t any discrete groups, no class boundaries, and subsequently no meaningful solution to compute “the centroid of a category”.
A continuous goal has infinitely many possible values, so we cannot group observations by their value to form centroids.
The one possible “centroid” in regression could be the global mean, which corresponds to the case k = N in k-NN regressor.
And this estimator is way too easy to be useful.
Briefly, Nearest Centroids Classifier is a natural improvement for classification, however it has no direct equivalent in regression.
Further statistical improvements
What else can we do with the fundamental k-NN classifier?
Average and variance
With Nearest Centroids Classifier, we used the only statistic that’s the average. A natural reflex in statistics is so as to add the variance as well.
So, now, distance is not any longer Euclidean, but Mahalanobis distance. Using this distance, we get the probability based on the distribution characterised by the mean and variance of every class.
Categorical Features handling
For categorical features, we cannot compute averages or variances. And for k-NN regressor, we saw that it was possible to do one-hot encoding or ordinal/label encoding. But the size is essential and challenging to find out.
Here, we are able to do something equally meaningful, by way of probabilities: we are able to count the proportions of every category inside a category.
These proportions act exactly like probabilities, describing how likely each category is inside each class.
This concept is directly linked to models reminiscent of Categorical Naive Bayes, where classes are characterised by frequency distributions over the categories.
Weighted Distance
One other direction is to introduce weights, in order that closer neighbors count greater than distant ones. In scikit-learn, there may be the “weights” argument that permits us to accomplish that.
We can even switch from “k neighbors” to a set radius across the latest statement, which ends up in radius-based classifiers.
Radius Nearest Neighbors
Sometimes, we are able to find this following graphic to elucidate k-NN classifier. But actually, with a radius like this, it reflects more the thought of Radius Nearest Neighbors.
One advantage is the control of the neighborhood. It is particularly interesting after we know the concrete meaning of the space, reminiscent of the geographical distance.

But the disadvantage is that you may have to know the radius upfront.
By the way in which, this notion of radius nearest neighbors can also be suitable for regression.
Recap of various variants
All these small changes give different models, every one attempting to improve the fundamental idea of comparing neighbors in response to a more complex definition of distance, with a control parameter what allows us to get local neighbors, or more global characterization of neighborhood.
We won’t explore all these models here. I simply cannot help myself from going a bit too far when a small variation naturally leads to a different idea.
For now, consider this as an announcement of the models we’ll implement later this month.

Conclusion
In this text, we explored the k-NN classifier from its most elementary form to several extensions.
The central idea just isn’t really modified: a brand new statement is classed by how similar it’s to the training data.
But this straightforward idea can take many alternative shapes.
With continuous features, similarity is predicated on geometric distance.
With categorical features, we glance as an alternative at how often each category appears among the many neighbors.
When k becomes very large, the complete dataset collapses into just a couple of summary statistics, which leads naturally to the Nearest Centroids Classifier.
Understanding this family of distance-based and probability-based ideas helps us see that many machine-learning models are simply other ways of answering the identical query:
Which class does this latest statement most have a resemblance to?
In the subsequent articles, we’ll proceed exploring density-based models, which will be understood as global measures of similarity between observations and classes.
