The Matrix Algebra of Linear Regression

Artificial Intelligence

The Matrix Algebra of Linear Regression

admin

May 3, 2023

Looking under the hood on the matrix operations behind linear regression

11 min read

12 hours ago

Introduction

For many, easy linear regression is a standard place to begin for understanding model-based estimation and model comparison. No matter whether you’re taking an introductory statistics or data science course, you possibly can bet your bottom dollar that linear regression will crop up sooner or later. And there’s a very good reason for it.

Easy linear regression provides a natural extension of easy descriptive statistics by modeling the response variable as a linear combination of only a single predictor. This simplicity not only facilitates the interpretation of model parameters but in addition makes estimation via abnormal least squares (OLS) easier to know.

While most textbook introductions will provide an in depth mathematical treatment along with the broader conceptual elements, on the subject of actually implementing the model it’s almost never through first principles. Regardless of which language you’re using there’ll almost definitely be a convenience function that matches the model for you. And why wouldn’t you employ it? You don’t should do all those tedious calculations by hand! That’s definitely a plus; though I’m a firm believer that taking a while to familiarise yourself with a model’s statistical machinery is a necessary a part of becoming an efficient analyst and data scientist.

In an earlier post, I provided a primer on linear algebra that touched on some fundamental principles and operations. On this post, we’re going to construct on those concepts by looking under the hood and getting our hands dirty with the matrix operations that underpin linear regression.

Easy Linear Regression in Matrix Form

Most can be acquainted with the usual regression formula that models a response variable Y as a linear combination of a single predictor X:

The linear regression equation (image by writer).

where here I’ve adopted the convention of assuming errors are normally distributed. From here we’ll construct out the matrix representation by assigning the weather to vectors and matrices.

First, we’ll put all of the responses in an n-dimensional vector called the response vector:

I’ve been quite explicit here by including the dimensions of the vector — which is definitely represented as a column matrix — just so we are able to keep track of things. Nonetheless, it’s perfectly reasonable to make use of a lowercase boldface y in case you desired to.

Next, the predictor X is placed into an n × p matrix called the design matrix:

where p denotes the variety of columns and corresponds to the variety of coefficients within the model. Note that the primary column incorporates only ones — we’ll discuss this shortly — but that is to accommodate the intercept, which is a relentless. Subsequently, the variety of columns within the design matrix is all the time one greater than the variety of predictors you’ve got. In the instance above, we only have a single predictor meaning we’d like to estimate an intercept and a slope; so, p = 2 on this case.

The regression coefficients are also placed right into a p × 1 vector called the parameter vector:

The parameter vector (image by writer).

Again, p denotes the variety of parameters, though here p denotes the variety of rows, unlike the design matrix where p is the column dimension. This arrangement is significant because we’ll have to do some matrix multiplication with these two objects to compute the linear predictor.

Nonetheless, before we try this, there may be one last item to do — place all of the error terms into an n × 1 vector:

With all of those in place, we are able to now express the straightforward linear regression model using matrix notation like so:

The linear regression model in matrix form (image by writer).

The Linear Predictor

In words, the matrix formulation of the linear regression model is the product of two matrices X and β plus an error vector. The product of X and β is an n × 1 matrix called the linear predictor, which I’ll denote here:

The linear predictor vector (image by writer).

Now, matrix multiplication works a little bit in another way than you would possibly expect. I covered this in my primer on linear algebra — so in case you haven’t checked it out yow will discover it here — but I’ll quickly run through the main points now.

If we’ve got some m × q matrix A and one other q × r matrix B, then the product is an m × r matrix C (note how the q dimension drops out from the result). The rationale for this transformation in size is since the element positioned at row i and column j in C is the dot product of the ith row in A and the jth column in B:

The dot product (image by writer).

So, since the dot product takes the sum across the q dimension, this drops out from the resulting matrix.

For the straightforward linear regression case, the product is between an n × p matrix X and a p × 1 matrix β, which subsequently leads to an n × 1 matrix η. Following on from above, the (i, j) element in η is computed using the next dot product:

The linear predictor (image by writer).

Here the sum is taken over p, which we all know is the variety of coefficients within the model, and so p = 2.

If we then substitute the dot product into the linear predictor vector and substitute within the values from the primary column of the design matrix, we get the next (because j = 1 I’m going to drop that subscript to declutter the notation a little bit):

The linear predictor in full (image by writer).

All of this reduces to something very familiar indeed: each element in η is just our linear regression equation applied to every value of X! Hopefully, you possibly can see why the column of ones is included within the design matrix. This ensures that the intercept is added to every commentary.

Model Errors

There are three critical assumptions referring to the error terms — or perturbations — across the linear predictor that fall out of the Gauss-Markov theorem. The primary is that the expected conditional mean of the errors is zero, implying that the common of the errors mustn’t depend upon any particular value of X. This known as the zero conditional mean assumption:

The zero conditional mean assumption (image by writer).

Related to that is the homoscedasticity assumption which states that the variance of the errors mustn’t be affected by the values of the independent variables. That’s, the distribution of the errors ought to be entirely independent of any information contained inside the design matrix:

The homoscedasticity assumption (image by writer).

The ultimate assumption is the no autocorrelation assumption which requires that error terms are uncorrelated. This suggests that knowledge about one error term doesn’t confer any details about one other and subsequently doesn’t co-vary:

No autocorrelation assumption (image by writer).

Under these assumptions, the covariance matrix for the error terms is a scalar matrix and the errors are considered to be spherical:

The covariance matrix under the Gauss-Markov assumptions (image by writer).

Only a note before moving on, while I adopted the convention of normally distributed error, the Gauss-Markov theorem doesn’t require the error terms to be normal, nor does it require that errors are independent and identically distributed; only that the error terms are homoscedastic and uncorrelated. This suggests that a variable can have dependencies, but as long as those dependencies aren’t linear — which is what correlation measures — then parameter estimation can proceed safely.

Parameter Estimation via Strange Least Squares

When fitting a linear regression model to data the goal is to estimate the unknown model coefficients contained in β. The standard approach to finding the most effective values for these unknowns is to satisfy the least squares criterion, the goal of which is to reduce the full squared error between the linear predictor and the response. I’m going to step through how that is implemented for the straightforward linear regression model, though will move through this section fairly quickly.

In matrix form, the vector of errors, or residuals, is defined as follows:

Model error, or residuals (image by writer).

where the hat on top of β denotes the estimated coefficients. The sum of the squared residuals can then be written because the dot product of the error vector with itself:

The sum squared error (image by writer).

where T denotes the transpose operator¹. To derive the least squares criterion it’s convenient to write down out the errors in full as follows:

Expanding the dot product of the error vector (image by writer).

The concept, then, is to seek out the parameters that minimize this value. To achieve this, we’d like to take the derivative of the above with respect to the vector β and set this to zero:

The primary derivative of the sum of squared error (image by writer).

from which the normal equation might be derived:

The traditional equation (image by writer).

From here, we are able to find an answer for our unknown parameters by multiplying both sides of the conventional equation by the inverse of XᵀX. I cover matrix inversion on this post, though in case you haven’t read that, a matrix A is invertible if there exists a matrix B such that their product returns the identity matrix, I.

For a straightforward linear regression model, XᵀX is square with a size of two × 2, though more generally the matrix can be a p × p matrix. We then need to seek out one other 2 × 2 matrix that’s the multiplicative inverse of XᵀX. If such a matrix doesn’t exist then the equation can’t be solved, though if XᵀX is indeed invertible, then we are able to obtain the parameter vector b as follows:

The estimation equation for linear regression (image by writer).

Here you possibly can see why it’s crucial that the design matrix will not be rank deficient. If there are indeed linear dependencies between columns then the XᵀX can’t be inverted, meaning an answer doesn’t exist. This is especially true if you’ve got perfect multicollinearity.

A Closer Take a look at Parameter Estimation

It’s interesting to further consider the weather contained in each matrix. First, let’s take a look at the cross-product of the design matrix XᵀX:

Taking the cross product of the design matrix (image by writer).

From here we are able to see that the matrix incorporates products of every column within the design matrix. But what we’d like is the inverse of this matrix. I won’t undergo tips on how to derive the inverse, but it surely looks like this:

The inverse of the design matrix cross product (image by writer).

Finally, we also need the cross-product of the design matrix and the response vector Y, which produces the next:

The cross product of the design matrix with the response vector (image by writer).

With the matrices written out in full, we are able to substitute them into the estimation query and work through the calculation like so:

Derivation of the parameter vector (image by writer).

To be fair, there may be a bit occurring in that derivation, but it surely’s really the last line that’s most interesting. What this all boils all the way down to is something quite convenient; we are able to estimate the slope coefficient using the sample covariance and variance like so:

Estimate for the slope coefficient (image by writer).

Once we’ve got that, we are able to use this estimate, together with the mean of y and x, to derive the estimate for the intercept:

Estimate for the intercept coefficient (image by writer).

Fitted Values

We’ve got now used some linear algebra to seek out the best-fitting parameters for a straightforward linear regression model. Now that we’ve got these in hand the following step is to see how well the fitted values correspond with the values contained within the response vector.

All we’d like to acquire the fitted values is the design matrix together with the parameter vector b. Multiplying together, the fitted values are computed like so:

Note that our fitted values have a hat placed on top of them to point that they’re estimated values. Because of this, another approach to express the fitted values is as a mixture of the response vector and the hat matrix, which gets its name from the incontrovertible fact that it puts a hat on Y.

To see this, let’s write out the equation for the fitted values by substituting b with the complete equation for the parameter vector:

Principally, every term that involves the design matrix X is lumped together to form the definition of the hat matrix:

Final Remarks

In case you’ve made it up to now, thanks for sticking at it. There was so much to unpack, obviously. Nonetheless, I hope you possibly can see how the principles of matrix algebra are used to construct easy linear regression models. Now, while I even have focussed on easy linear regression throughout to maintain the examples concise, all the things discussed here works the identical way for multiple regression, too. All that changes is that the dimensionality of the matrices and vectors increases.

Footnotes

[1] The notation I even have used here is different from what I utilized in my primer article, but they mean the exact same thing. The rationale for the change is to maintain in line with how the matrix formulations are typically described.