Basic Understanding of K-Means Clustering

Artificial Intelligence

Basic Understanding of K-Means Clustering

admin

May 3, 2023

Basic Understanding of K-Means Clustering

What’s K-Means clustering?

K-Means clustering is an unsupervised machine learning algorithm used for clustering or grouping similar data points together in a dataset. It’s a partitioning algorithm, which divides the information into non-overlapping clusters, where each data point belongs to a single cluster. K-means clustering goals to attenuate the sum of squared distances between each data point and its assigned centroid.

Theory — How does it work?

Step 1. First, we want to come to a decision the worth of K, which is the variety of clusters we wish to create. The worth of K could be decided either randomly or by utilizing some method like Elbow, Silhouette.

Step 2. Next, we randomly select K points from the dataset to act as initial centroids for every cluster.

Step 3. We then calculate the Euclidean distance between each data point and the centroids and assign the information point to the closest centroid, creating K clusters.

Step 4. After assigning all data points to their nearest centroid, we update each centroid’s location by computing the mean of all the information points assigned to that centroid.

Step 5. We repeat steps 3 and 4 until the algorithm converges, which implies that the centroids not move or the development within the sum of squared distances between data points and their assigned centroid becomes insignificant.

How does the code work?

Import the vital libraries:

import numpy as np
import matplotlib.pyplot as plt

Generate or load your data right into a numpy array:

X = np.random.rand(100, 2) * 2
plt.scatter(X[:, 0], X[:, 1])
plt.show()

Select the variety of clusters, K, and initialize the centroids randomly:

K = 7
centroids = X[np.random.choice(len(X), K, replace=False)]
plt.scatter(X[:, 0], X[:, 1])
plt.scatter(centroids[:, 0], centroids[:, 1], marker='x', s=200, linewidths=3, color='r')
plt.show()

while True:
# Assign data points to the closest centroid
distances = np.sqrt(((X - centroids[:, np.newaxis])**2).sum(axis=2))
labels = np.argmin(distances, axis=0)# Calculate recent centroids
new_centroids = np.array([X[labels == i].mean(axis=0) for i in range(K)])
# Check for convergence
if np.all(centroids == new_centroids):
break
# Update centroids
centroids = new_centroidsplt.scatter(X[:, 0], X[:, 1], c=labels)
plt.scatter(centroids[:, 0], centroids[:, 1], marker='x', s=200, linewidths=3, color='r')
plt.show()

So, here we will see that data have been clustered successfully.

Limitations of K means algorithm

K means clustering is sensitive to initial centroid selection. The algorithm may converge to a suboptimal solution if the initial centroids should not chosen appropriately.
K means clustering is sensitive to outliers
K means clustering is sensitive to the worth of K. If K shouldn’t be chosen appropriately it could give suboptimal clusters.

This can be a very basic explanation of K-Means clustering. The unique algorithm can have a more detailed implementation. This post is to assist readers start with K-Means in an easy way.