DBSCAN, Explained in 5 Minutes

Fastest implementation in python🐍

What’s DBSCAN [1]? Tips on how to construct it in python? There are various articles covering this topic, but I believe the algorithm itself is so easy and intuitive that it’s possible to elucidate its idea in only 5 minutes, so let’s try to do this.

DBSCAN = Density-Based Spatial Clustering of Applications with Noise

What does it mean?

The algorithm searches for clusters inside the info based on the spatial distance between objects.
The algorithm can discover outliers (noise).

Why do you wish DBSCAN in any respect???

Extract a brand new feature. If the dataset you’re coping with is large, it is perhaps helpful to seek out obvious clusters inside the info and work with each cluster individually (train different models for various clusters).
Compress the info. Often we now have to take care of tens of millions of rows, which is pricey computationally and time consuming. Clustering the info after which keeping only X% from each cluster might save your wicked data science soul. Subsequently, you’ll keep the balance contained in the dataset, but reduce its size.
Novelty detection. It’s been mentioned before that DBSCAN detects noise, however the noise is perhaps a previously unknown feature of the dataset, which you’ll preserve and use in modeling.

Then you might say: but there may be the super-reliable and effective k-means algorithm.

Yes, however the sweetest part about DBSCAN is that it overcomes the drawbacks of k-means, and also you don’t must specify the variety of clusters. DBSCAN detects clusters for you!

DBSCAN has two components defined by a user: vicinity, or radius (𝜀), and the variety of neighbors (N).

For a dataset consisting of some objects, the algorithm relies on the next ideas:

Core objets. An object known as a core object if inside distance 𝜀 it has at the very least N other objects.
An non-core object lying inside 𝜀-vicinity of a core-point known as a border object.
A core object forms a cluster with all of the core and border objects inside 𝜀-vicinity.
If an object is neither core or border, it’s called noise (outlier). It doesn’t belong to any cluster.

To implement DBSCAN it’s essential to create a distance function. In this text we will likely be using the Euclidean distance:

The pseudo-code for our algorithm looks like this:

As at all times the code of this text yow will discover on my GitHub.

Let’s begin with the gap function:

def distances(object, data):
euclidean = []
for row in data: #iterating through all of the objects within the dataset
d = 0
for i in range(data.shape[1]): #calculating sum of squared residuals for all of the coords
d+=(row[i]-object[i])**2
euclidean.append(d**0.5) #taking a sqaure root
return np.array(euclidean)

Now let’s construct the body of the algorithm:

def DBSCAN(data, epsilon=0.5, N=3):
visited, noise = [], [] #lists to gather visited points and outliers
clusters = [] #list to gather clusters
for i in range(data.shape[0]): #iterating through all of the points
if i not in visited: #getting in if the purpose's not visited
visited.append(i)
d = distances(data[i], data) #getting distances to all the opposite points
neighbors = list(np.where((d<=epsilon)&(d!=0))[0]) #getting the list of neighbors within the epsilon vicinity and removing distance = 0 (it's the purpose itself)
if len(neighbors)        noise.append(i)
else:
cluster = [i] #otherwise it forms a brand new cluster
for neighbor in neighbors: #iterating trough all of the neighbors of the purpose i
if neighbor not in visited: #if neighbor is not visited
visited.append(neighbor)
d = distances(data[neighbor], data) #get the distances to other objects from the neighbor
neighbors_idx = list(np.where((d<=epsilon)&(d!=0))[0]) #getting neighbors of the neighbor
if len(neighbors_idx)>=N: #if the neighbor has N or more neighbors, than it is a core point
neighbors += neighbors_idx #add neighbors of the neighbor to the neighbors of the ith object
if not any(neighbor in cluster for cluster in clusters):
cluster.append(neighbor) #if neighbor isn't in clusters, add it there
clusters.append(cluster) #put the cluster into clusters listreturn clusters, noise

Done!

Let’s check the correctness of our implementation and compare it with sklearn.

Let’s generate some synthetic data:

X1 = [[x,y] for x, y in zip(np.random.normal(6,1, 2000), np.random.normal(0,0.5, 2000))]
X2 = [[x,y] for x, y in zip(np.random.normal(10,2, 2000), np.random.normal(6,1, 2000))]
X3 = [[x,y] for x, y in zip(np.random.normal(-2,1, 2000), np.random.normal(4,2.5, 2000))]fig, ax = plt.subplots()
ax.scatter([x[0] for x in X1], [y[1] for y in X1], s=40, c='#00b8ff', edgecolors='#133e7c', linewidth=0.5, alpha=0.8)
ax.scatter([x[0] for x in X2], [y[1] for y in X2], s=40, c='#00ff9f', edgecolors='#0abdc6', linewidth=0.5, alpha=0.8)
ax.scatter([x[0] for x in X3], [y[1] for y in X3], s=40, c='#d600ff', edgecolors='#ea00d9', linewidth=0.5, alpha=0.8)
ax.spines[['right', 'top', 'bottom', 'left']].set_visible(False)
ax.set_xticks([])
ax.set_yticks([])
ax.set_facecolor('black')
ax.patch.set_alpha(0.7)

Let’s apply our implementation and visualize the outcomes:

For sklearn implementation we got the identical clusters:

That’s it, they’re similar. 5 minutes and we’re done! Whenever you try DBSCANning yourself, don’t forget to tune epsilon and the variety of neighbors since they highlt influence the ultimate results.

===========================================

Reference:

[1] Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996, August). Density-based spatial clustering of applications with noise. In Int. Conf. knowledge discovery and data mining (Vol. 240, №6).

[2] Yang, Yang, et al. “An efficient DBSCAN optimized by arithmetic optimization algorithm with opposition-based learning.” The journal of supercomputing 78.18 (2022): 19566–19604.

===========================================

All my publications on Medium are free and open-access, that’s why I’d really appreciate in case you followed me here!

P.s. I’m extremely captivated with (Geo)Data Science, ML/AI and Climate Change. So if you wish to work together on some project pls contact me in LinkedIn.

🛰️Follow for more🛰️

DBSCAN, Explained in 5 Minutes

Fastest implementation in python🐍

What does it mean?

Why do you wish DBSCAN in any respect???

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Powerful ASR + diarization + speculative decoding with Hugging Face Inference Endpoints

Bringing the Artificial Evaluation LLM Performance Leaderboard to Hugging Face

Introducing the Open Leaderboard for Hebrew LLMs!

Prompt Engineering vs RAG for Editing Resumes

Subscribe to Enterprise Hub along with your AWS Account

DBSCAN, Explained in 5 Minutes

Fastest implementation in python🐍

What does it mean?

Why do you wish DBSCAN in any respect???

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.