K Nearest Neighbor Classifier, Explained: A Visual Guide with Code Examples for Beginners

-

The friendly neighbor approach to machine learning

All illustrations in this text were created by writer, incorporating licensed design elements from Canva Pro.

Imagine a way that makes predictions by probably the most similar examples it has seen before. That is the essence of the Nearest Neighbor Classifier — a straightforward yet intuitive algorithm that brings a touch of real-world logic to machine learning.

While the dummy classifier sets the bare minimum performance standard, the Nearest Neighbor approach mimics how we regularly make decisions in every day life: by recalling similar past experiences. It’s like asking your neighbors how they dressed for today’s weather to come to a decision what it is best to wear. Within the realm of knowledge science, this classifier examines the closest data points to make its predictions.

A K Nearest Neighbor classifier is a machine learning model that makes predictions based on the bulk class of the K nearest data points within the feature space. The KNN algorithm assumes that similar things exist in close proximity, making it intuitive and straightforward to grasp.

Nearest Neighbor methods is certainly one of the best algorithms in machine learning.

Throughout this text, we’ll use this straightforward artificial golf dataset (inspired by [1]) for instance. This dataset predicts whether an individual will play golf based on weather conditions. It includes features like outlook, temperature, humidity, and wind, with the goal variable being whether to play golf or not.

Columns: ‘Outlook’, ‘Temperature’, ‘Humidity’, ‘Wind’ and ‘Play’ (goal feature)
# Import libraries
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np

# Make the dataset
dataset_dict = {
'Outlook': ['sunny', 'sunny', 'overcast', 'rainy', 'rainy', 'rainy', 'overcast', 'sunny', 'sunny', 'rainy', 'sunny', 'overcast', 'overcast', 'rainy', 'sunny', 'overcast', 'rainy', 'sunny', 'sunny', 'rainy', 'overcast', 'rainy', 'sunny', 'overcast', 'sunny', 'overcast', 'rainy', 'overcast'],
'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'Wind': [False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False],
'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes']
}
original_df = pd.DataFrame(dataset_dict)

print(original_df)

KNN algorithm requires the information to be scaled first. Convert categorical columns into 0 & 1 and in addition scale the numerical features in order that no single feature dominates the space metric.

The specific columns (Outlook & Windy) are encoded using one-hot encoding while the numerical columns are scaled using standard scaling (z-normalization). The method is completed individually for training and test set.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Preprocess data
df = pd.get_dummies(original_df, columns=['Outlook'], prefix='', prefix_sep='', dtype=int)
df['Wind'] = df['Wind'].astype(int)
df['Play'] = (df['Play'] == 'Yes').astype(int)
df = df[['sunny','rainy','overcast','Temperature','Humidity','Wind','Play']]

# Split data and standardize features
X, y = df.drop(columns='Play'), df['Play']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)

scaler = StandardScaler()
float_cols = X_train.select_dtypes(include=['float64']).columns
X_train[float_cols] = scaler.fit_transform(X_train[float_cols])
X_test[float_cols] = scaler.transform(X_test[float_cols])

# Print results
print(pd.concat([X_train, y_train], axis=1).round(2), 'n')
print(pd.concat([X_test, y_test], axis=1).round(2), 'n')

The KNN classifier operates by finding the K nearest neighbors to a brand new data point after which voting on probably the most common class amongst these neighbors. Here’s how it really works:

  1. Calculate the space between the brand new data point and all points within the training set.
  2. Select the K nearest neighbors based on these distances.
  3. Take a majority vote of the classes of those K neighbors.
  4. Assign the bulk class to the brand new data point.
For our golf dataset, a KNN classifier might have a look at the 5 most similar weather conditions up to now to predict whether someone will play golf today.

Unlike many other algorithms, KNN doesn’t have a definite training phase. As a substitute, it memorizes your entire training dataset. Here’s the method:

  1. Select a worth for K (the variety of neighbors to contemplate).
In 2D setting, it’s like finding the vast majority of the closest colours.
from sklearn.neighbors import KNeighborsClassifier

# Select the Variety of Neighbors ('k')
k = 5

2. Select a distance metric (e.g., Euclidean distance, Manhattan distance).

Probably the most common distance metric is Euclidean Distance. That is identical to finding the straight line distance between two points in real world.
import numpy as np

# Select a Distance Metric
distance_metric = 'euclidean'

# Attempting to calculate distance between ID 0 and ID 1
print(np.linalg.norm(X_train.loc[0].values - X_train.loc[1].values))

3. Store/Memorize all of the training data points and their corresponding labels.

# Initialize the k-NN Classifier
knn_clf = KNeighborsClassifier(n_neighbors=k, metric=distance_metric)

# "Train" the kNN (although no real training happens)
knn_clf.fit(X_train, y_train)

Once the Nearest Neighbor Classifier has been “trained” (i.e., the training data has been stored), here’s the way it makes predictions for brand spanking new instances:

  1. Distance Calculation: For the brand new instance, calculate its distance from all stored training instances using the chosen distance metric.
For ID 14, we calculate the space to every member of the training set (ID 0 — ID 13).
from scipy.spatial import distance

# Compute the distances from the primary row of X_test to all rows in X_train
distances = distance.cdist(X_test.iloc[0:1], X_train, metric='euclidean')

# Create a DataFrame to display the distances
distance_df = pd.DataFrame({
'Train_ID': X_train.index,
'Distance': distances[0].round(2),
'Label': y_train
}).set_index('Train_ID')

print(distance_df.sort_values(by='Distance'))

2. Neighbor Selection and Prediction: Discover the K nearest neighbors based on the calculated distances, then assign probably the most common class amongst these neighbors as the anticipated class for the brand new instance.

After calculating its distance to all stored data points and sorting from lowest to highest, we discover the 5 nearest neighbors (top 5). If the bulk (3 or more) of those neighbors are labeled “NO”, we predict “NO” for ID 14.
# Use the k-NN Classifier to make predictions
y_pred = knn_clf.predict(X_test)
print("Label :",list(y_test))
print("Prediction:",list(y_pred))
With this straightforward model, we manage to get ok accuracy, a lot better than guessing randomly!
from sklearn.metrics import accuracy_score

# Evaluation Phase
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy.round(4)*100}%')

While KNN is conceptually easy, it does have just a few vital parameters:

  1. K: The variety of neighbors to contemplate. A smaller K can result in noise-sensitive results, while a bigger K may smooth out the choice boundary.
The upper the worth of k, the more likely that it is going to select the bulk class (”YES”).
labels, predictions, accuracies = list(y_test), [], []

k_list = [3, 5, 7]
for k in k_list:
knn_clf = KNeighborsClassifier(n_neighbors=k)
knn_clf.fit(X_train, y_train)
y_pred = knn_clf.predict(X_test)
predictions.append(list(y_pred))
accuracies.append(accuracy_score(y_test, y_pred).round(4)*100)

df_predictions = pd.DataFrame({'Label': labels})
for k, pred in zip(k_list, predictions):
df_predictions[f'k = {k}'] = pred

df_accuracies = pd.DataFrame({'Accuracy ': accuracies}, index=[f'k = {k}' for k in k_list]).T

print(df_predictions)
print(df_accuracies)

2. Distance Metric: This determines how similarity between points is calculated. Common options include:

  • Euclidean distance (straight-line distance)
  • Manhattan distance (sum of absolute differences)
  • Minkowski distance (a generalization of Euclidean and Manhattan distances)

3. Weight Function: This decides the best way to weight the contribution of every neighbor. Options include:

  • ‘uniform’: All neighbors are weighted equally.
  • ‘distance’: Closer neighbors have a greater influence than those farther away.

Like every algorithm in machine learning, KNN has its strengths and limitations.

Pros:

  1. Simplicity: Easy to grasp and implement.
  2. No Assumptions: Doesn’t assume anything concerning the data distribution.
  3. Versatility: Could be used for each classification and regression tasks.
  4. No Training Phase: Can quickly incorporate recent data without retraining.

Cons:

  1. Computationally Expensive: Must compute distances to all training samples for every prediction.
  2. Memory Intensive: Requires storing all training data.
  3. Sensitive to Irrelevant Features: Could be thrown off by features that aren’t vital to the classification.
  4. Curse of Dimensionality: Performance degrades in high-dimensional spaces.

The K-Nearest Neighbors (KNN) classifier stands out as a fundamental algorithm in machine learning, offering an intuitive and effective approach to classification tasks. Its simplicity makes it a really perfect place to begin for beginners, while its versatility ensures its value for skilled data scientists. KNN’s power lies in its ability to make predictions based on the proximity of knowledge points, without requiring complex training processes.

Nonetheless, it’s crucial to do not forget that KNN is only one tool within the vast machine learning toolkit. As you progress in your data science journey, use KNN as a stepping stone to grasp more complex algorithms, all the time considering your specific data characteristics and problem requirements when selecting a model. By mastering KNN, you’ll gain useful insights into classification techniques, setting a robust foundation for tackling more advanced machine learning challenges.

# Import libraries
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load data
dataset_dict = {
'Outlook': ['sunny', 'sunny', 'overcast', 'rainy', 'rainy', 'rainy', 'overcast', 'sunny', 'sunny', 'rainy', 'sunny', 'overcast', 'overcast', 'rainy', 'sunny', 'overcast', 'rainy', 'sunny', 'sunny', 'rainy', 'overcast', 'rainy', 'sunny', 'overcast', 'sunny', 'overcast', 'rainy', 'overcast'],
'Temperature': [85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0],
'Humidity': [85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0],
'Wind': [False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False],
'Play': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes']
}
df = pd.DataFrame(dataset_dict)

# Preprocess data
df = pd.get_dummies(df, columns=['Outlook'], prefix='', prefix_sep='', dtype=int)
df['Wind'] = df['Wind'].astype(int)
df['Play'] = (df['Play'] == 'Yes').astype(int)

# Split data
X, y = df.drop(columns='Play'), df['Play']
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)

# Standardize features
scaler = StandardScaler()
float_cols = X_train.select_dtypes(include=['float64']).columns
X_train[float_cols] = scaler.fit_transform(X_train[float_cols])
X_test[float_cols] = scaler.transform(X_test[float_cols])

# Train model
knn_clf = KNeighborsClassifier(n_neighbors=3, metric='euclidean')
knn_clf.fit(X_train, y_train)

# Predict and evaluate
y_pred = knn_clf.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")

Further Reading

For an in depth explanation of the KNeighborsClassifier and its implementation in scikit-learn, readers can consult with the official documentation [2], which provides comprehensive information on its usage and parameters.

Technical Environment

This text uses Python 3.7 and scikit-learn 1.5. While the concepts discussed are generally applicable, specific code implementations may vary barely with different versions.

In regards to the Illustrations

Unless otherwise noted, all images are created by the writer, incorporating licensed design elements from Canva Pro.

Past Articles by Creator

[1] T. M. Mitchell, Machine Learning (1997), McGraw-Hill Science/Engineering/Math, pp. 59

[2] F. Pedregosa et al., Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011. [Online]. Available: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x