The Machine Learning “Advent Calendar” Day 3: GNB, LDA and QDA in Excel

-

working with k-NN (k-NN regressor and k-NN classifier), we all know that the k-NN approach could be very naive. It keeps your entire training dataset in memory, relies on raw distances, and doesn’t learn any structure from the information.

We already began to enhance the k-NN classifier, and in today’s article, we are going to implement these different models:

  • GNB: Gaussian Naive Bayes
  • LDA: Linear Discriminant Evaluation
  • QDA: Quadratic Discriminant Evaluation

For all these models, the distribution is taken into account as Gaussian. So at the top, we may even see an approach to get a more customized distribution.

For those who read my previous article, listed here are some questions for you:

  • What’s the connection between LDA and QDA?
  • What’s the relation between GBN and QDA?
  • What happens if the information just isn’t Gaussian in any respect?
  • What’s the tactic to get a customized distribution?
  • What’s linear in LDA? What’s quadratic in QDA?

When reading through the article, you need to use this Excel/Google sheet.

GNB, LDA and QDA in Excel – image by writer

Nearest Centroids: What This Model Really Is

Let’s do a fast recap about what we already began yesterday.

We introduced an easy idea: once we calculate the common of every continuous feature inside a category, that class collapses into one single representative point.

This offers us the Nearest Centroids model.

Each class is summarized by its centroid, the common of all its feature values.

Now, allow us to take into consideration this from a Machine Learning perspective.
We normally separate the method into two parts: the step and the step.

For Nearest Centroids, we will draw a small “model card” to know what this model really is:

  • By computing one average vector per class. Nothing more.
  • Yes. A centroid could be computed using all available (non-empty) values.
  • Yes, absolutely, because distance to a centroid is determined by the units of every feature.
  • What are the hyperparameters? None.

We said that the k-NN classifier will not be an actual machine learning model since it just isn’t an actual model.

For Nearest Centroids, we will say that it just isn’t really a machine learning model since it can’t be tuned. So what about overfitting and underfitting?

Well, the model is so easy that it cannot memorize noise in the identical way k-NN does.

So, Nearest Centroids will only are likely to underfit when classes are complex or not well separated, because one single centroid cannot capture their full structure.

Understanding Class Shape with One Feature: Adding Variance

Now, on this section, we are going to use just one continuous feature, and a couple of classes.

To this point, we used just one statistic per class: the common value.
Allow us to now add a second piece of data: the variance (or equivalently, the usual deviation).

This tells us how “unfolded” each class is around its average.

A natural query appears immediately: Which variance should we use?

Essentially the most intuitive answer is to compute one variance per class, because each class might need a unique spread.

But there may be one other possibility: we could compute one common variance for each classes, normally as a weighted average of the category variances.

This feels a bit unnatural at first, but we are going to see later that this concept leads on to LDA.

So the table below gives us all the pieces we want for this model, in actual fact, for each versions (LDA and QDA) of the model.

  • the variety of observations in each class (to weight the classes)
  • the mean of every class
  • the usual deviation of every class
  • and the common standard deviation across each classes

With these values, your entire model is totally defined.

GNB, LDA and QDA in Excel – image by writer

Now, once we’ve a normal deviation, we will construct a more refined distance: the gap to the centroid divided by the usual deviation.

Why can we do that?

Because this provides a distance that’s by how variable the category is.

If a category has a big standard deviation, being removed from its centroid just isn’t surprising.

If a category has a really small standard deviation, even a small deviation becomes significant.

This straightforward normalization turns our Euclidean distance into something just a little bit more meaningful, that represents the form of every class.

This distance was introduced by Mahalanobis, so we call it the Mahalanobis distance.

Now we will do all these calculations directly within the Excel file.

GNB, LDA and QDA in Excel – image by writer

The formulas are straightforward, and with conditional formatting, we will clearly see how the gap to every center changes and the way the scaling affects the outcomes.

GNB, LDA and QDA in Excel – image by writer

Now, let’s do some plots, at all times in Excel.

This diagram below shows the total progression: how we start from the Mahalanobis distance, move to the likelihood under each class distribution, and eventually obtain the probability prediction.

GNB, LDA and QDA in Excel – image by writer

LDA vs. QDA, what can we see?

With only one feature, the difference becomes very easy to visualise.

For LDA, the separation on the x-axis is at all times cut into two parts. For this reason the tactic is named Discriminant Evaluation.

For QDA, even with just one feature, the model produces two cut points on the x-axis. In higher dimensions, this becomes a curved boundary, described by a quadratic function. Hence, the name Discriminant Evaluation.

GNB, LDA and QDA in Excel – image by writer

And you’ll be able to directly modify the parameters to see how they impact the choice boundary.

The changes within the means or variances will change the frontier, and Excel makes these effects very easy to visualise.

By the way in which, does the form of the LDA probability curve remind you of a model that you just surely know? Yes, it looks the exact same.

You may already guess which one, right?

But now the true query is: are they the identical model? And if not, how do they differ?

GNB, LDA and QDA in Excel – image by writer

We can even study the case with three classes. You may do this yourself as an exercise in Excel.

Listed below are the outcomes. For every class, we repeat the exact same procedure. And for the ultimate probability prediction, we simply sum all of the likelihoods and take the proportion of every one.

GNB, LDA and QDA in Excel – image by writer

Again, this approach can also be utilized in one other well-known model.
Do you understand which one? It’s way more familiar to most individuals, and this shows how closely connected these models really are.

If you understand considered one of them, you routinely understand the others significantly better.

Class Shape in 2D: Variance Only or Covariance as Well?

With one feature, we don’t speak about dependency, as there may be none. So on this case, QDA behaves exactly like Gaussian Naive Bayes. Because we normally allow each class to have its own variance, which is perfectly natural.

The difference will appear once we move to 2 or more features. At that time, we are going to distinguish cases of how the model treats the covariance between the features.

Gaussian Naive Bayes makes one very strong simplifying assumption:
the features are independent. That is the rationale for the word in its name.

LDA and QDA, nevertheless, don’t make this assumption. They permit interactions between features, and that is what generates linear or quadratic boundaries in higher dimensions.

Let’s do the exercice in Excel!

Gaussian Naive Bayes: no covariance

Allow us to begin with the only case: Gaussian Naive Bayes.

So, we don’t must compute any covariance in any respect, since the model assumes that the features are independent.

As an instance this, we will have a look at a small example with three classes.

GNB, LDA and QDA in Excel – image by writer

QDA: each class has its own covariance

For QDA, we now should calculate the covariance matrix for every class.

And once we’ve it, we also must compute its inverse, since it is used directly within the formula for the gap and the likelihood.

So there are just a few more parameters to compute in comparison with Gaussian Naive Bayes.

GNB, LDA and QDA in Excel – image by writer

LDA: all classes share the identical covariance

For LDA, all classes share the identical covariance matrix, which reduces the variety of parameters and forces the choice boundary to be linear.

Though the model is easier, it stays very effective in lots of situations, especially when the quantity of information is restricted.

GNB, LDA and QDA in Excel – image by writer

Customized Class Distributions: Beyond the Gaussian Assumption

To this point, we only talked about Gaussian distributions. And it’s for its simplificity. And we can also use other distributions. So even in Excel, it is rather easy to vary.

In point of fact, data normally don’t follow an ideal Gaussian curve.

For exploring a dataset, we use the empiric density plots almost each time. They offer a right away visual feeling of how the information is distributed.

And the kernel density estimator (KDE) as a non-parametric method, is commonly used.

BUT, in practice, KDE is never used as a full classification model. It just isn’t very convenient, and its predictions are sometimes sensitive to the alternative of bandwidth.

And what’s interesting is that this concept of kernels will come back again once we discuss other models.

So although we show it here mainly for exploration, it’s a necessary constructing block in machine learning.

KDE (Kernel Density Estimator) in Excel – image by writer

Conclusion

Today, we followed a natural path that begins with easy averages and progressively results in full probabilistic models.

  • Nearest Centroids compresses each class into one point.
  • Gaussian Naive Bayes adds the notion of variance, and assumes the independance of the features.
  • QDA gives each class its own variance or covariance
  • LDA simplifies the form by sharing the covariance.

We even saw that we will step outside the Gaussian world and explore customized distributions.

All these models are connected by the identical idea: a brand new statement belongs to the category it most resembles.

The difference is how we define resemblance, by distance, by variance, by covariance, or by a full probability distribution.

For all these models, we will do the 2 steps easily in Excel:

  • step one is to estimate the paramters, which could be regarded as the model training
  • the inference step that’s to calculate the gap and the probability for every class
GNB, LDA and QDA – image by writer

Yet another thing

Before closing this text, allow us to draw a small cartography of distance-based supervised models.

We have now two major families:

  • local distance models
  • global distance models

For local distance, we already know the 2 classical ones:

  • k-NN regressor
  • k-NN classifier

Each predict by taking a look at neighbors and using the local geometry of the information.

For global distance, all of the models we studied today belong to the classification world.

Why?

Because global distance requires centers defined by classes.
We measure how close a brand new statement is to every class prototype?

But what about regression?

Evidently this notion of world distance doesn’t exist for regression, or does it really?

The reply is yes, it does exist…

Mindmap – Distance-based machine learning supervised models – image by writer
ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x