Home Artificial Intelligence Linear Discriminant Evaluation (LDA) Can Be So Easy How it really works — the maths behind it Exploring the plot Final remarks

Linear Discriminant Evaluation (LDA) Can Be So Easy How it really works — the maths behind it Exploring the plot Final remarks

Linear Discriminant Evaluation (LDA) Can Be So Easy
How it really works — the maths behind it
Exploring the plot
Final remarks

Image created by Arus Nazaryan using Midjourney. Prompt: “Drone footage of two flocks of sheep, vibrant blue and deep red, divided by a fence which separates the flocks in the center, clean, realistic sheep, on green grass, photorealistic

Classification is a central topic in machine learning. Nevertheless, it may well be difficult to know how different algorithms work. In this text, we’ll make linear discriminant evaluation come alive with an interactive plot that you could experiment with. Get able to dive into the world of knowledge classification!

👇🏽 Click so as to add and take away data points, use drag to maneuver them. Change the population parameters and generate latest data samples.

If you happen to are applying or studying classification methods, you may have come across various methods reminiscent of Naive Bayes, K-nearest neighbours (KNN), quadratic discriminant evaluation (QDA) and linear discriminant evaluation (LDA). Nevertheless, it will not be at all times intuitive to know what different algorithms are doing. LDA is one among the primary approaches to learn and consider and this articles demonstrates how the technique works.

Let’s start from the start. All classification methods are approaches to reply the query: Which variety of class does this commentary belong to?

Within the plot above, there are two independent variables, x_1 on the horizontal axis and x_2 on the vertical axis. Consider the independent variables as scores in two subjects, e.g. physics and literature (starting from 0–20). The dependent value y is the category, within the plot represented as red or blue. Consider it as a binary variable we wish to predict, reminiscent of whether an applicant is admitted to school (“yes” or “no”). A set of given observations is displayed as circles. Each is characterised by a x_1 and a x_2 value and the category.

Now, if a latest data point is added, to which class shall it’s assigned?

LDA allows us to attract a boundary to divide the space in two parts (or multiple ones, but two on this case with two classes). For instance, below in figure 1 I marked a latest data point at (x_1 = 4, x_2 = 14) with a cross. Because it falls on the “red side” of the space, this commentary can be assigned to the red class.

Datapoints scattered on a plot, divided by the LDA boundary. The cross marks a new datapoint to be classified.
Figure 1: The brand new commentary on the cross (x_1=4, x_2=14) is assigned to the red class.

So given the LDA boundary, we will make classifications. But how can we draw the boundary line using the LDA algorithm?

The road divides the plot where the probability of the red class and the blue class is 50% each. Going to 1 side, the probability of red is higher, going to the alternative side, blue is more probable. We want to give you a technique to calculate how probable it’s for the brand new commentary to belong to every of the classes.

The probability that the brand new commentary belongs to the red class will be written this fashion: P(Y = red| X = (x_1 = 4, x_2 = 14)). To make it easier to read, as an alternative of x_1 = 4, x_2 = 14, I’ll in the next just write x which is a vector containing the 2 values.

In keeping with the Bayes theorem, we will express conditional probabilities as P(Y = red | X = x) = P(Y = red ) * P(X = x | Y = red) / P(X = x). So to calculate the probability that the brand new commentary is “red” given its x values, we’d like to know three other probabilities.

Let’s go step-by-step. P(Y = red) and P(X = x) are called “prior probabilities”. Within the plot, P(Y = red) will be calculated just by taking the share of all “red” observations of the full variety of observations (27,27% in figure 1).

Calculating the prior probability of x, P(X = x), is tougher. We don’t have any commentary with x_1 = 4 and x_2 = 14within the initial dataset, so what’s its probability? Zero? Not quite.

P(X = x) will be understood because the sum of the joint probabilities P(Y = red, X = x) and P(Y = blue, X = x) (see also here):

P(X = x) = P(Y = red, X = x) + P(Y = blue, X = x)

Each of the joint probabilities will be expressed with the Bayes theorem as:

P(Y = red, X = x) = P(X = x | Y=red) * P(red)
P(Y = blue, X = x) = P(X = x | Y=blue) * P(blue)

You see the priors P(red) and P(blue) enter again, which we already know methods to determine. What’s left for us to do, is to search out a technique to calculate the conditional probabilities P(X = x | Y = red) and P(X = x | Y = blue).

Even when we did never see any red data point at position x within the plot in our original data set, we will discover a probability if we regard x to be drawn from a population of a certain form. Normally when using real world data, the population will not be directly observable, so we don’t know the true type of the population distribution. What LDA does is to assume that the x values are normally distributed on the x_1 and x_2 axis. With real world data, this assumption will be kind of reasonable. As we’re coping with a generated data set here though, we don’t must worry about this problem. The info originate from populations which follow a standard (i.e. Gaussian) distribution, characterised by two parameters (per independent variable): ~N(mean, variance).

Using the formula for multivariate normal distributions, we will describe the distribution of the info belonging to class “red” as

P(X = x| Y = “red”)= frac{1}{(2pi)|Sigma|^{1/2}}exp(-frac{1}{2}(x-mu_{red})^T Sigma^{-1}(x-mu_{red}))

where Σ denotes a covariance matrix. LDA uses just one common covariance matrix for all classes, meaning that it assumes that every one classes have the identical variance(x_1), variance(x_2) and covariance(x_1, x_2).

Now that we’ve got the formula for P (X = x | Y = red), we will plug this within the equation above to get to P (Y = red|X = x). This provides us a monstrous equation, which I’ll skip here. Luckily, it gets simpler from here. Our goal now could be to search out the dividing line that separates the red and blue zones, i.e. we wish to search out those points where the chances of being class red or blue are equal:

P(Y = red|X = x) = P(Y = blue|X = x)

Performing some transformations, it may well be shown that minimising that is akin to minimising the next equation:

x^TSigma^{-1}mu_{red} — frac{1}{2}mu^T_{red}Sigma^{-1}mu_{red} + log(P(red))= x^TSigma^{-1}mu_{blue} — frac{1}{2}mu^T_{blue}Sigma^{-1}mu_{blue} + log(P(blue))

Solving this for x_2, we get a line like

x_2 = beta_0 + beta_1 * x_1

where β_0 and β_1 are the parameters depending on the means µ_red and µ_blue, the common covariance matrix Σ and the prior probabilities of red and blue, P(“red”) and P(“blue”).


  1. Thank you for your sharing. I am worried that I lack creative ideas. It is your article that makes me full of hope. Thank you. But, I have a question, can you help me?


Please enter your comment!
Please enter your name here