Home Artificial Intelligence An Accessible Derivation of Linear Regression

An Accessible Derivation of Linear Regression

1
An Accessible Derivation of Linear Regression

The maths behind the model, from additive assumptions to pseudoinverse matrices

Photo by Saad Ahmad on Unsplash

Technical disclaimer: It is feasible to derive a model without normality assumptions. We’ll go down this route since it’s straightforward enough to grasp and by assuming normality of the model’s output, we will reason in regards to the uncertainty of our predictions.

This post is meant for people who find themselves already aware of what linear regression is (and perhaps have used it a few times) and need a more principled understanding of the maths behind it.

Some background in basic probability (probability distributions, joint probability, mutually exclusive events), linear algebra, and stats might be required to profit from what follows. Without further ado, here we go:

The machine learning world is stuffed with amazing connections: the exponential family, regularization and prior beliefs, KNN and SVMs, Maximum Likelihood and Information Theory — it’s all connected! (I like Dark). This time we’ll discuss easy methods to derive one other one in every of the members of the exponential family: the Linear Regression model, and in the method we’ll see that the Mean Squared Error loss is theoretically well motivated. As with all regression model, we’ll give you the option to make use of it to predict numerical, continuous targets. It’s a straightforward yet powerful model that happens to be one in every of the workhorses of statistical inference and experimental design. Nonetheless we will likely be concerned only with its usage as a predictive tool. No pesky inference (and God forbid, causal) stuff here.

Alright, allow us to begin. We wish to predict something based on something else. We’ll call the predicted thing y and the something else x. As a concrete example, I offer the next toy situation: You’re a credit analyst working in a bank and also you’re excited about routinely checking out the proper credit limit for a bank customer. You furthermore may occur to have a dataset pertaining to past clients and what credit limit (the predicted thing) was approved for them, along with a few of their features resembling demographic info, past credit performance, income, etc. (the something else).

So we have now an incredible idea and write down a model that explains the credit limit when it comes to those features available to you, with the model’s fundamental assumption being that every feature contributes something to the observed output in an additive manner. For the reason that credit stuff was only a motivating (and contrived) example, let’s return to our pure math world of spherical cows, with our model turning into something like this:

We still have the expected stuff (y) and the something else we use to predict it (x). We concede that some form of noise is unavoidable (be it by virtue of imperfect measuring or our own blindness) and one of the best we will do is to assume that the model behind the info we observe is stochastic. The consequence of that is that we’d see barely different outputs for a similar input, so as an alternative of neat point estimates we’re “stuck” with a probability distribution over the outputs (y) conditioned on the inputs (x):

Every data point in y is replaced by slightly bell curve, whose mean lies within the observed values of y, and has some variance which we don’t care about in the intervening time. Then our little model will take the place of the distribution mean.

Assuming all those bell curves are literally normal distributions and their means (data points in y) are independent from one another, the (joint) probability of observing the dataset is

Logarithms and a few algebra to the rescue:

Logarithms are cool, aren’t they? Logs transform multiplication into sum, division into subtraction, and powers into multiplication. Quite handy from each algebraic and numerical standpoints. Eliminating constant stuff, which is irrelevant on this case, we arrive to the next maximum likelihood problem:

Well, that’s similar to

The expression we’re about to reduce is something very near the famous Mean Square Error loss. Actually, for optimization purposes they’re equivalent.

So what now? This minimization problem might be solved exactly using derivatives. We’ll make the most of the indisputable fact that the loss is quadratic, which suggests convex, which suggests one global minima; allowing us to take its derivative, set it to zero and solve for theta. Doing this we’ll find the worth of the parameters theta that makes the derivative of the loss zero. And why? since it is precisely at the purpose where the derivative is zero, that the loss is at its minimum.

To make every thing somewhat simpler, let’s express the loss in vector notation:

Here, X is an NxM matrix representing our whole dataset of N examples and M features and y is a vector containing the expected responses per training example. Taking the derivative and setting it to zero we get

There you’ve gotten it, the answer to the optimization problem we have now forged our original machine learning problem into. If you happen to go ahead and plug those parameter values into your model, you’ll have a trained ML model able to be evaluated using some holdout dataset (or perhaps through cross-validation).

If you happen to think that final expression looks an awful lot like the answer of a linear system,

it’s since it does. The additional stuff comes from the indisputable fact that for our problem to be similar to a vanilla linear system, we’d need an equal variety of features and training examples so we will invert X. Since that’s seldom the case we will only hope for a “best fit” solution — in some sense of best — resorting to the Moore-Penrose Pseudoinverse of X, which is a generalization of the great ol’ inverse matrix. The associated wikipedia entry makes for a fun reading.

1 COMMENT

LEAVE A REPLY

Please enter your comment!
Please enter your name here