The Machine Learning “Advent Calendar” Day 10: DBSCAN in Excel

-

of my Machine Learning “Advent Calendar”. I would really like to thanks on your support.

I even have been constructing these Google Sheet files for years. They evolved little by little. But when it’s time to publish them, I all the time need hours to reorganize every thing, clean the layout, and make them nice to read.

Today, we move to DBSCAN.

DBSCAN Does Not Learn a Parametric Model

Similar to LOF, DBSCAN is not a parametric model. There isn’t a formula to store, no rules, no centroids, and nothing compact to reuse later.

We must keep the whole dataset since the density structure is determined by all points.

Its full name is Density-Based Spatial Clustering of Applications with Noise.

But careful: this “density” shouldn’t be a Gaussian density.

It’s a notion of density. Just “what number of neighbors live near me”.

Why DBSCAN Is Special

As its name indicates, DBSCAN does two things at the identical time:

  • it finds clusters
  • it marks anomalies (the points that don’t belong to any cluster)

This is strictly why I present the algorithms on this order:

  • k-means and GMM are clustering models. They output a compact object: centroids for k-means, means and variances for GMM.
  • Isolation Forest and LOF are pure anomaly detection models. Their only goal is to seek out unusual points.
  • DBSCAN sits in between. It does each clustering and anomaly detection, based only on the notion of neighborhood density.

A Tiny Dataset to Keep Things Intuitive

We stick with the identical tiny dataset that we used for LOF: 1, 2, 3, 7, 8, 12

In case you take a look at these numbers, you already see two compact groups:
one around 1–2–3, one other around 7–8, and 12 living alone.

DBSCAN captures exactly this intuition.

Summary in 3 Steps

DBSCAN asks three easy questions for every point:

  1. What number of neighbors do you’ve got inside a small radius (eps)?
  2. Do you’ve got enough neighbors to grow to be a Core point (minPts)?
  3. Once we all know the Core points, to which connected group do you belong?

Here is the summary of the DBSCAN algorithm in 3 steps:

DBSCAN in excel – all images by creator

Allow us to begin step-by-step.

DBSCAN in 3 steps

Now that we understand the concept of density and neighborhoods, DBSCAN becomes very easy to explain.
All the things the algorithm does suits into three easy steps.

Step 1 – Count the neighbors

The goal is to examine what number of neighbors each point has.

We take a small radius called eps.

For every point, we take a look at all other points and mark those whose distance is lower than eps.
These are the neighbors.

This provides us the primary idea of density:
some extent with many neighbors is in a dense region,
some extent with few neighbors lives in a sparse region.

For a 1-dimensional toy example like ours, a typical selection is:
eps = 2

We draw a bit of interval of radius 2 around each point.

Why is it called eps?

The name eps comes from the Greek letter ε (epsilon), which is traditionally utilized in mathematics to represent a small quantity or a small radius around some extent.
So in DBSCAN, eps is literally “the small neighborhood radius”.

It answers the query:
How far will we go searching each point?

So in Excel, step one is to compute the pairwise distance matrix, then count what number of neighbors each point has inside eps.

Step 2 – Core Points and Density Connectivity

Now that we all know the neighbors from Step 1, we apply minPts to make a decision which points are Core.

minPts means here minimum variety of points.

It’s the smallest variety of neighbors some extent will need to have (contained in the eps radius) to be considered a Core point.

A degree is Core if it has at the least minPts neighbors inside eps.
Otherwise, it could grow to be Border or Noise.

With eps = 2 and minPts = 2, we have now 12 that shouldn’t be Core.

Once the Core points are known, we simply check which points are from them. If some extent could be reached by moving from one Core point to a different inside eps, it belongs to the identical group.

In Excel, we will represent this as an easy connectivity table that shows which points are linked through Core neighbors.

This connectivity is what DBSCAN uses to form clusters in Step 3.

Step 3 – Assign cluster labels

The goal is to show connectivity into actual clusters.

Once the connectivity matrix is prepared, the clusters appear naturally.
DBSCAN simply groups all connected points together.

To provide each group an easy and reproducible name, we use a really intuitive rule:

The cluster label is the smallest point within the connected group.

For instance:

  • Group {1, 2, 3} becomes cluster 1
  • Group {7, 8} becomes cluster 7
  • A degree like 12 with no Core neighbors becomes Noise

This is strictly what we’ll display in Excel using formulas.

Final thoughts

DBSCAN is ideal to show the concept of local density.

There isn’t a probability, no Gaussian formula, no estimation step.
Just distances, neighbors, and a small radius.

But this simplicity also limits it.
Because DBSCAN uses one fixed radius for everybody, it cannot adapt when the dataset accommodates clusters of various scales.

HDBSCAN keeps the identical intuition, but looks at radii and keeps what stays stable.
It’s much more robust, and far closer to how humans naturally see clusters.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x