Linear Regression Is Actually a Projection Problem (Part 2: From Projections to Predictions)

think that linear regression is about fitting a line to data.

But mathematically, that’s not what it’s doing.

It’s finding the closest possible vector to your goal throughout the
space spanned by features.

To know this, we want to vary how we take a look at our data.

In Part 1, we’ve got a basic idea of what a vector is and explored the concepts of dot products and projections.

Now, let’s apply these concepts to unravel a linear regression problem.

We’ve got this data.

Image by Writer

The Usual Way: Feature Space

After we try to know linear regression, we generally start with a scatter plot drawn between the independent and dependent variables.

Each point on this plot represents a single row of knowledge. We then attempt to fit a line through these points, with the goal of minimizing the sum of squared residuals.

To resolve this mathematically, we write down the price function equation and apply differentiation to search out the precise formulas for the slope and intercept.

As we already discussed in my earlier multiple linear regression (MLR) blog, that is the usual strategy to understand the issue.

That is what we call as a feature space.

After doing all that process, we get a worth for the slope and intercept. Here we want to look at one thing.

Allow us to say ŷᵢ is the expected value at a certain point. We’ve got the slope and intercept value, and now in response to our data, we want to predict the worth.

If ŷᵢ is the expected price for House 1, we calculate it by utilizing

[
beta_0 + beta_1 cdot text{size}
]

What have we done here? We’ve got a size value, and we’re scaling it with a certain number, which we call the slope (β₁), to get the worth as near to the unique value as possible.

We also add an intercept (β₀) as a base value.

Now let’s remember this point, and we’ll move to the following perspective.

A Shift in Perspective

Let’s take a look at our data.

Now, as an alternative of considering Price and Size as axes, let’s consider each house as an axis.

We’ve got three houses, which implies we are able to treat House A because the X-axis, House B because the Y-axis, and House C because the Z-axis.

Then, we simply plot our points.

After we consider the dimensions and price columns as axes, we get three points, where each point represents the dimensions and price of a single house.

Nevertheless, after we consider each house as an axis, we get two points in a three-d space.

One point represents the sizes of all three houses, and the opposite point represents the costs of all three houses.

That is what we call the column space, and that is where the linear regression happens.

From Points to Directions

Now let’s connect our two points to the origin and now we call them as vectors.

Okay, let’s decelerate and take a look at what we’ve got done and why we did it.

As a substitute of a standard scatter plot where size and price are the axes (Feature Space), we considered each house as an axis and plotted the points (Column Space).

We are actually saying that linear regression happens on this Column Space.

You may be considering: Wait, we learn and understand linear regression using the standard scatter plot, where we minimize the residuals to search out a best-fit line.

Yes, that’s correct! But in Feature Space, linear regression is solved using calculus. We get the formulas for the slope and intercept using partial differentiation.

Should you remember my previous blog on MLR, we derived the formulas for the slopes and intercepts after we had two features and a goal variable.

You’ll be able to observe how messy it was to calculate those formulas using calculus. Now imagine if you might have 50 or 100 features; it becomes complex.

By switching to Column Space, we alter the lens through which we view regression.

We take a look at our data as vectors and use the concept of projections. The geometry stays the exact same whether we’ve got 2 features or 2,000 features.

So, if calculus gets that messy, what’s the true good thing about this unchanging geometry? Let’s discuss exactly what happens in Column Space.”

Why This Perspective Matters

Now that we’ve got an idea of what Feature Space and Column Space are, let’s give attention to the plot.

We’ve got two points, where one represents the sizes and the opposite represents the costs of the homes.

Why did we connect them to the origin and consider them vectors?

Because, as we already discussed, in linear regression we’re finding a number (which we call the slope or weight) to scale our independent variable.

We would like to scale the Size so it gets as near the Price as possible, minimizing the residual.

You can not visually scale a floating point; you possibly can only scale something when it has a length and a direction.

By connecting the points to the origin, they turn out to be vectors. Now they’ve each magnitude and direction, and we already know that we are able to scale vectors.

Okay, we established that we treat these columns as vectors because we are able to scale them, but there’s something much more vital to learn here.

Let’s take a look at our two vectors: the Size vector and the Price vector.

First, if we take a look at the Size vector (1, 2, 3), it points in a really specific direction based on the pattern of its numbers.

From this vector, we are able to understand that House 2 is twice as large as House 1, and House 3 is 3 times as large.

There’s a selected 1:2:3 ratio, which forces the Size vector to point in a single exact direction.

Now, if we take a look at the Price vector, we are able to see that it points in a rather different direction than the Size vector, based by itself numbers.

The direction of an arrow simply shows us the pure, underlying pattern of a feature across all our homes.

If our prices were exactly (2, 4, 6), then our Price vector would lie exactly in the identical direction as our Size vector. That will mean size is an ideal, direct predictor of price.

But in real life, this isn’t possible. The value of a house is just not just depending on size; there are numerous other aspects that affect it, which is why the Price vector points barely away.

That angle between the 2 vectors (1,2,3) and (4,8,9) represents the real-world noise.

The Geometry Behind Regression

Now, we use the concept of projections that we learned in Part 1.

Let’s consider our Price vector (4, 8, 9) as a destination we wish to succeed in. Nevertheless, we only have one direction we are able to travel which is the trail of our Size vector (1, 2, 3).

If we travel along the direction of the Size vector, we are able to’t perfectly reach our destination since it points in a unique direction.

But we are able to travel to a selected point on our path that gets us as near the destination as possible.

The shortest path from our destination dropping all the way down to that exact point makes an ideal 90-degree angle.

In Part 1, we discussed this idea using the ‘highway and residential’ analogy.

We’re applying the very same concept here. The one difference is that in Part 1, we were in a 2D space, and here we’re in a 3D space.

I referred to the feature as a ‘way’ or a ‘highway’ because we only have one direction to travel.

This distinction between a ‘way’ and a ‘direction’ will turn out to be much clearer later after we add multiple directions!

A Easy Option to See This

We will already observe that that is the very same concept as vector projections.

We derived a formula for this in Part 1. So, why wait?

Let’s just apply the formula, right?

No. Not yet.

There’s something crucial we want to know first.

In Part 1, we were coping with a 2D space, so we used the highway and residential analogy. But here, we’re in a 3D space.

To know it higher, let’s use a brand new analogy.

Consider this 3D space as a physical room. There’s a lightbulb hovering within the room on the coordinates (4, 8, 9).

The trail from the origin to that bulb is our Price vector which we call as a goal vector.

We would like to succeed in that bulb, but our movements are restricted.

We will only walk along the direction of our Size vector (1, 2, 3), moving either forward or backward.

Based on what we learned in Part 1, you would possibly say, ‘Let’s just apply the projection formula to search out the closest point on our path to the bulb.’

And also you could be right. That is absolutely the closest we are able to get to the bulb in that direction.

Why We Need a Base Value?

But before we move forward, we must always observe yet another thing here.

We already discussed that we’re finding a single number (a slope) to scale our Size vector so we are able to get as near the Price vector as possible. We will understand this with an easy equation:

Price = β₁ × Size

But what if the dimensions is zero? Regardless of the value of β₁ is, we get a predicted price of zero.

But is that this right? We’re saying that if the dimensions of a home is 0 square feet, the worth of the home is 0 dollars.

This is just not correct because there must be a base value for every house. Why?

Because even when there isn’t any physical constructing, there continues to be a worth for the empty plot of land it sits on. The value of the ultimate home is heavily depending on this base plot price.

We call this base value β₀. In traditional algebra, we already know this because the intercept, which is the term that shifts a line up and down.

So, how will we add a base value in our 3D room? We do it by adding a Base Vector.

Combining Directions

Now we’ve got added a base vector (1, 1, 1), but what is definitely done using this base vector?

From the above plot, we are able to observe that by adding a base vector, we’ve got yet another direction to maneuver in that space.

We will move in each the directions of the Size vector and the Base vector.

Don’t get confused by them as “ways”; they’re directions, and it would be clear once we get to a degree by moving in each of them.

Without the bottom vector, our base value was zero. We began with a base value of zero for each house. Now that we’ve got a base vector, let’s first move along it.

For instance, let’s move 3 steps within the direction of the Base vector. By doing so, we reach the purpose (3, 3, 3). We’re currently at (3, 3, 3), and we wish to succeed in as close as possible to our Price vector.

This implies the bottom value of each home is 3 dollars, and our latest place to begin is (3, 3, 3).

Next, let’s move 2 steps within the direction of our Size vector (1, 2, 3). This implies calculating 2 * (1, 2, 3) = (2, 4, 6).

Due to this fact, from (3, 3, 3), we move 2 steps along the House A axis, 4 units along the House B axis, and 6 steps along the House C axis.

Principally, we’re adding the vectors here, and the order doesn’t matter.

Whether we move first through the bottom vector or the dimensions vector, it gets us to the very same point. We just moved along the bottom vector first to know the thought higher!

The Space of All Possible Predictions

This manner, we use each the directions to get as near our Price vector. In the sooner example, we scaled the Base vector by 3, which implies here β₀ = 3, and we scaled the Size vector by 2, which implies β1 = 2.

From this, we are able to observe that we want the very best combination of β₀ and β₁ in order that we are able to understand how many steps we travel along the bottom vector and what number of steps we travel along the dimensions vector to succeed in that time which is closest to our Price vector.

In this manner, if we try all the various mixtures of β₀ and β₁, then we get an infinite variety of points, and let’s see what it looks like.

We will see that each one the points formed by the various mixtures of β0 and β1 along the directions of the Base vector and Size vector form a flat 2D plane in our 3D space.

Now, we’ve got to search out the purpose on that plane which is nearest to our Price vector.

We already know learn how to get to that time. As we discussed in Part 1, we discover the shortest path by utilizing the concept of geometric projections.

Now we want to search out the precise point on the plane which is nearest to the Price vector.

We already discussed this in Part 1 using our ‘home and highway’ analogy, where the shortest path from the highway to the house formed a 90-degree angle with the highway.

There, we moved in a single dimension, but here we’re moving on a 2D plane. Nevertheless, the rule stays the identical.

The shortest distance between the tip of our price vector and a degree on the plane is where the trail between them forms an ideal 90-degree angle with the plane.

From a Point to a Vector

Before we dive into the maths, allow us to make clear exactly what is going on in order that it feels easy to follow.

Until now, we’ve got been talking about finding the particular point on our plane that’s closest to the tip of our goal price vector. But what will we actually mean by this?

To succeed in that time, we’ve got to travel across our plane.

We do that by moving along our two available directions, that are our Base and Size vectors, and scaling them.

If you scale and add two vectors together, the result’s all the time a vector!

If we draw a straight line from the middle on the origin on to that exact point on the plane, we create what known as the Prediction Vector.

Moving along this single Prediction Vector gets us to the very same destination as taking those scaled steps along the Base and Size directions.

The Vector Subtraction

Now we’ve got two vectors.

We would like to know the precise difference between them. In linear algebra, we discover this difference using vector subtraction.

After we subtract our Prediction from our Goal, the result’s our Residual Vector, also referred to as the Error Vector.

For this reason that dotted red line is just not only a measurement of distance. It’s a vector itself!

After we deal in feature space, we try to attenuate the sum of squared residuals. Here, by finding the purpose on the plane closest to the worth vector, we’re not directly on the lookout for where the physical length of the residual path is the bottom!

Linear Regression Is a Projection

Now let’s start the maths.

[
text{Let’s start by representing everything in matrix form.}
]

[
X =
begin{bmatrix}
1 & 1
1 & 2
1 & 3
end{bmatrix}
quad
y =
begin{bmatrix}
4
8
9
end{bmatrix}
quad
beta =
begin{bmatrix}
b_0
b_1
end{bmatrix}
]
[
text{Here, the columns of } X text{ represent the base and size directions.}
]
[
text{And we are trying to combine them to reach } y.
]
[
hat{y} = Xbeta
]
[
= b_0
begin{bmatrix}
1
1
1
end{bmatrix}
+
b_1
begin{bmatrix}
1
2
3
end{bmatrix}
]
[
text{Every prediction is just a combination of these two directions.}
]
[
e = y – Xbeta
]
[
text{This error vector is the gap between where we want to be.}
]
[
text{And where we actually reach.}
]
[
text{For this gap to be the shortest possible,}
]
[
text{it must be perfectly perpendicular to the plane.}
]
[
text{This plane is formed by the columns of } X.
]
[
X^T e = 0
]
[
text{Now we substitute ‘e’ into this condition.}
]
[
X^T (y – Xbeta) = 0
]
[
X^T y – X^T X beta = 0
]
[
X^T X beta = X^T y
]
[
text{By simplifying we get the equation.}
]
[
beta = (X^T X)^{-1} X^T y
]
[
text{Now we compute each part step by step.}
]
[
X^T =
begin{bmatrix}
1 & 1 & 1
1 & 2 & 3
end{bmatrix}
]
[
X^T X =
begin{bmatrix}
3 & 6
6 & 14
end{bmatrix}
]
[
X^T y =
begin{bmatrix}
21
47
end{bmatrix}
]
[
text{computing the inverse of } X^T X.
]
[
(X^T X)^{-1}
=
frac{1}{(3 times 14 – 6 times 6)}
begin{bmatrix}
14 & -6
-6 & 3
end{bmatrix}
]
[
=
frac{1}{42 – 36}
begin{bmatrix}
14 & -6
-6 & 3
end{bmatrix}
]
[
=
frac{1}{6}
begin{bmatrix}
14 & -6
-6 & 3
end{bmatrix}
]
[
text{Now multiply this with } X^T y.
]
[
beta =
frac{1}{6}
begin{bmatrix}
14 & -6
-6 & 3
end{bmatrix}
begin{bmatrix}
21
47
end{bmatrix}
]
[
=
frac{1}{6}
begin{bmatrix}
14 cdot 21 – 6 cdot 47
-6 cdot 21 + 3 cdot 47
end{bmatrix}
]
[
=
frac{1}{6}
begin{bmatrix}
294 – 282
-126 + 141
end{bmatrix}
=
frac{1}{6}
begin{bmatrix}
12
15
end{bmatrix}
]
[
=
begin{bmatrix}
2
2.5
end{bmatrix}
]
[
text{With these values, we can finally compute the exact point on the plane.}
]
[
hat{y} =
2
begin{bmatrix}
1
1
1
end{bmatrix}
+
2.5
begin{bmatrix}
1
2
3
end{bmatrix}
=
begin{bmatrix}
4.5
7.0
9.5
end{bmatrix}
]
[
text{And this point is the closest possible point on the plane to our target.}
]

We got the purpose (4.5, 7.0, 9.5). That is our prediction.

This point is the closest to the tip of the worth vector, and to succeed in that time, we want to maneuver 2 steps along the bottom vector, which is our intercept, and a couple of.5 steps along the dimensions vector, which is our slope.

What Modified Was the Perspective

Let’s recap what we’ve got done on this blog. We haven’t followed the regular method to unravel the linear regression problem, which is the calculus method where we try to distinguish the equation of the loss function to get the equations for the slope and intercept.

As a substitute, we selected one other method to unravel the linear regression problem which is the tactic of vectors and projections.

We began with a Price vector, and we would have liked to construct a model that predicts the worth of a house based on its size.

By way of vectors, that meant we initially only had one direction to maneuver in to predict the worth of the home.

Then, we also added the Base vector by realizing there ought to be a baseline starting value.

Now we had two directions, and the query was how close can we get to the tip of the Price vector by moving in those two directions?

We usually are not just fitting a line; we’re working inside an area.

In feature space: we minimize error

In column space: we drop perpendiculars

Through the use of different mixtures of the slope and intercept, we got an infinite variety of points that created a plane.

The closest point, which we would have liked to search out, lies somewhere on that plane, and we found it by utilizing the concept of projections and the dot product.

Through that geometry, we found the proper point and derived the Normal Equation!

You could ask, “Don’t we get this normal equation by utilizing calculus as well?” You might be exactly right! That’s the calculus view, but here we’re coping with the geometric linear algebra view to actually understand the geometry behind the maths.

Linear regression is just not just optimization.

It’s projection.

I hope you learned something from this blog!

Should you think something is missing or might be improved, be at liberty to depart a comment.

Should you haven’t read Part 1 yet, you possibly can read it here. It covers the fundamental geometric intuition behind vectors and projections.

Thanks for reading!