The Machine Learning “Advent Calendar” Day 9: LOF in Excel

Yesterday, we worked with Isolation Forest, which is an Anomaly Detection method.

Today, we have a look at one other algorithm that has the identical objective. But unlike Isolation Forest, it does construct trees.

It known as LOF, or Local Outlier Factor.

People often summarize LOF with one sentence: Does this point live in a region with a lower density than its neighbors?

This sentence is definitely tricky to know. I struggled with it for a very long time.

Nevertheless, there may be one part that is instantly easy to know,
and we’ll see that it becomes the important thing point:
there may be a notion of neighbors.

And as soon as we speak about neighbors,
we naturally return to distance-based models.

We are going to explain this algorithm in 3 steps.

To maintain things quite simple, we’ll use this dataset, again:

1, 2, 3, 9

Do you do not forget that I actually have the copyright on this dataset? We did Isolation Forest with it, and we’ll do LOF with it again. And we may compare the 2 results.

LOF in Excel with 3 steps- all images by creator

All of the Excel files can be found through this Kofi link. Your support means quite a bit to me. The value will increase through the month, so early supporters get the perfect value.

All Excel/Google sheet files for ML and DL

Step 1 – k Neighbors and k-distance

LOF begins with something very simple:

Take a look at the distances between points.
Then find the k nearest neighbors of every point.

Allow us to take k = 2, just to maintain things minimal.

Nearest neighbors for every point

Point 1 → neighbors: 2 and three
Point 2 → neighbors: 1 and three
Point 3 → neighbors: 2 and 1
Point 9 → neighbors: 3 and a pair of

Already, we see a transparent structure emerging:

1, 2, and three form a good cluster
9 lives alone, removed from the others

The k-distance: a neighborhood radius

The k-distance is solely the biggest distance among the many k nearest neighbors.

And this is definitely the important thing point.

Because this single number tells you something very concrete:

If k-distance is small, the purpose is in a dense area.
If k-distance is large, the purpose is in a sparse area.

With just this one measure, you have already got a primary signal of “isolation”.

Here, we use the concept of “k nearest neighbors”, which after all reminds us of k-NN (the classifier or regressor).
The context here is different, however the calculation is strictly the identical.

And should you consider k-means, don’t mix them:
the “k” in k-means has nothing to do with the “k” here.

The k-distance calculation

For point 1, the 2 nearest neighbors are 2 and 3 (distances 1 and a pair of), so k-distance(1) = 2.

For point 2, neighbors are 1 and 3 (each at distance 1), so k-distance(2) = 1.

For point 3, the 2 nearest neighbors are 1 and 2 (distances 2 and 1), so k-distance(3) = 2.

For point 9, neighbors are 3 and 2 (6 and seven), so k-distance(9) = 7. This is big in comparison with all of the others.

In Excel, we are able to do a pairwise distance matrix to get the k-distance for every point.

Step 2 – Reachability Distances

For this step, I’ll just define the calculations here, and apply the formulas in Excel. Because, to be honest, I never succeeded find a very intuitive technique to explain the outcomes.

So, what’s “reachability distance”?

For a degree p and a neighbor o, we define this reachability distance as:

reach-dist(p, o) = max(k-dist(o), distance(p, o))

Why take the utmost?

The aim of reachability distance is to stabilize density comparison.

If the neighbor o lives in a really dense region (small k-dist), then we don’t need to permit an unrealistically small distance.

Particularly, for point 2:

Distance to 1 = 1, but k-distance(1) = 2 → reach-dist(2, 1) = 2
Distance to three = 1, but k-distance(3) = 2 → reach-dist(2, 3) = 2

Each neighbors force the reachability distance upward.

In Excel, we’ll keep a matrix format to display the reachability distances: one point in comparison with all of the others.

Average reachability distance

For every point, we are able to now compute the common value, which tells us: on average, how far do I would like to travel to succeed in my local neighborhood?

And now, do you notice something: the purpose 2 has a bigger average reachability distance than 1 and three.

This is just not that intuitive to me!

Step 3 – LRD and the LOF Rating

The ultimate step is type of a “normalization” to search out an anomaly rating.

First, we define the LRD, Local Reachability Density, which is solely the inverse of the common reachability distance.

And the ultimate LOF rating is calculated as:

So, LOF compares the density of a degree to the density of its neighbors.

Interpretation:

If LRD(p) ≈ LRD (neighbors), then LOF ≈ 1
If LRD(p) is far smaller, then LOF >> 1. So p is in a sparse region
If LRD(p) is far larger → LOF < 1. So p is in a really dense pocket.

I also did a version with more developments, and shorter formulas.

Understanding What “Anomaly” Means in Unsupervised Models

In unsupervised learning, there isn’t a ground truth. And this is strictly where things can change into tricky.

We do not need labels.
We do not need the “correct answer”.
We only have the structure of the info.

Take this tiny sample:

1, 2, 3, 7, 8, 12
(I even have the copyright on it.)

In case you have a look at it intuitively, which one appears like an anomaly?

Personally, I might say 12.

Now allow us to have a look at the outcomes. LOF says the outlier is 7.

(And you possibly can notice that with k-distance, we’d say that it’s 12.)

Now, we are able to compare Isolation Forest and LOF side by side.

On the left, with the dataset 1, 2, 3, 9, each methods agree:
9 is the clear outlier.
Isolation Forest gives it the bottom rating,
and LOF gives it the best LOF value.

If we glance closer, for Isolation Forest: 1, 2 and three don’t have any differences in rating. And LOF gives the next rating for two. That is what we already noticed.

With the dataset 1, 2, 3, 7, 8, 12, the story changes.

Isolation Forest points to 12 as probably the most isolated point.
This matches the intuition: 12 is way from everyone.
LOF, nevertheless, highlights 7 as an alternative.

So who is correct?

It’s difficult to say.

In practice, we first have to agree with business teams on what “anomaly” actually means within the context of our data.

Because in unsupervised learning, there isn’t a single truth.

There is just the definition of “anomaly” that every algorithm uses.

This is the reason it is amazingly essential to know
how the algorithm works, and what sort of anomalies it’s designed to detect.

Only then can you select whether LOF, or k-distance, or Isolation Forest is the best alternative to your specific situation.

And that is the entire message of unsupervised learning:

This is the reason understanding how the algorithm works
is more essential than the ultimate rating it produces.

Conclusion

LOF and Isolation Forest each detect anomalies, but they give the impression of being at the info through completely different lenses.

k-distance captures how far a degree must travel to search out its neighbors.
LOF compares local densities.
Isolation Forest isolates points using random splits.

And even on quite simple datasets, these methods can disagree.
One algorithm may flag a degree as an outlier, while one other highlights a totally different one.

And that is the important thing message:

In unsupervised learning, there isn’t a “true” outlier.
Each algorithm defines anomalies in response to its own logic.

This is the reason understanding a technique works is more essential than the number it produces.
Only then are you able to select the best algorithm for the best situation, and interpret the outcomes with confidence.

The Machine Learning “Advent Calendar” Day 9: LOF in Excel