Theoretical Deep Dive into Linear Regression The Data Generation Process What Are We Actually Minimizing? Minimize The Loss Function Conclusion

Artificial Intelligence

Theoretical Deep Dive into Linear Regression The Data Generation Process What Are We Actually Minimizing? Minimize The Loss Function Conclusion

admin

June 26, 2023

Theoretical Deep Dive into Linear Regression
The Data Generation Process
What Are We Actually Minimizing?
Minimize The Loss Function
Conclusion

You need to use some other prior distribution on your parameters to create more interesting regularizations. You may even say that your parameters w are normally distributed but with some correlation matrix Σ.

Allow us to assume that Σ is positive-definite, i.e. we’re within the non-degenerate case. Otherwise, there isn’t a density p(w).

In the event you do the maths, you’ll find out that we then need to optimize

Image by the creator.

for some matrix Γ. This can also be called .

start with the indisputable fact that

Image by the creator.

and keep in mind that positive-definite matrices might be decomposed right into a product of some invertible matrix and its transpose.

Great, so we defined our model and know what we wish to optimize. But how can we optimize it, i.e. learn the very best parameters that minimize the loss function? And when is there a novel solution? Let’s discover.

Unusual Least Squares

Allow us to assume that we don’t regularize and don’t use sample weights. Then, the MSE might be written as

Image by the creator.

This is kind of abstract, so allow us to write it otherwise as

Using matrix calculus, you possibly can take the derivative of this function with respect to w (we assume that the bias term b is included there).

In the event you set this gradient to zero, you find yourself with

Image by the creator.

If the (n × k)-matrix X has a rank of k, so does the (k × k)-matrix XᵀX, i.e. it’s invertible. Why? It follows from rank(X) = rank(XᵀX).

On this case, we get the

Image by the creator.

Software packages don’t optimize like this but as a substitute use gradient descent or other iterative techniques since it is quicker. Still, the formula is sweet and offers us some high-level insights concerning the problem.

But is that this really a minimum? We will discover by computing the Hessian, which is XᵀX. The matrix is positive-semidefinite since wᵀXᵀXw = |Xw|² ≥ 0 for any w. It’s even positive-definite since XᵀX is invertible, i.e. 0 just isn’t an eigenvector, so our optimal w is indeed minimizing our problem.

Perfect Multicollinearity

That was the friendly case. But what happens if X has a rank smaller than k? This might occur if we have now two features in our dataset where one is a multiple of the opposite, e.g. we use the features height (in m) and height (in cm) in our dataset. Then we have now height (in cm) = 100 * height (in m).

It could possibly also occur if we one-hot encode categorical data and don’t drop certainly one of the columns. For instance, if we have now a feature color in our dataset that might be red, green, or blue, then we will one-hot encode and find yourself with three columns color_red, color_green, and color_blue. For these features, we have now color_red + color_green + color_blue = 1, which induces perfect multicollinearity as well.

In these cases, the rank of XᵀX can also be smaller than k, so this matrix just isn’t invertible.

End of story.

Or not? Actually, no, because it may mean two things: (XᵀX)w = Xᵀy has

no solution or
infinitely many solutions.

It seems that in our case, we will obtain one solution using the Moore-Penrose inverse. Because of this we’re within the case of infinitely many solutions, all of them giving us the identical (training) mean squared error loss.

If we denote the Moore-Penrose inverse of A by A⁺, we will solve the linear system of equations as

Image by the creator.

To get the opposite infinitely many solutions, just add the null space of XᵀX to this specific solution.

Minimization With Tikhonov Regularization

Recall that we could add a previous distribution to our weights. We then had to reduce

Image by the creator.

for some invertible matrix Γ. Following the identical steps as in bizarre least squares, i.e. taking the derivative with respect to w and setting the result to zero, the answer is

Image by the creator.

The neat part:

XᵀX + ΓᵀΓ is at all times invertible!

Allow us to discover why. It suffices to indicate that the null space of XᵀX + ΓᵀΓ is just {0}. So, allow us to take a w with (XᵀX + ΓᵀΓ)w = 0. Now, our goal is to indicate that w = 0.

From (XᵀX + ΓᵀΓ)w = 0 it follows that

Image by the creator.

which in turn implies |Γw| = 0 → Γw = 0. Since Γ is invertible, w needs to be 0. Using the identical calculation, we will see that the Hessian can also be positive-definite.