Regression, finally!
For Day 11, I waited many days to present this model. It marks the start of a latest journey on this “Advent Calendar“.
Until now, we mostly checked out models based on distances, neighbors, or local density. As it’s possible you’ll know, for tabular data, decision trees, especially ensembles of decision trees, are very performant.
But starting today, we switch to a different perspective: the weighted approach.
Linear Regression is our first step into this world.
It looks easy, however it introduces the core ingredients of contemporary ML: loss functions, gradients, optimization, scaling, collinearity, and interpretation of coefficients.
Now, after I say, Linear Regression, I mean Extraordinary Least Square Linear Regression. As we progress through this “Advent Calendar” and explore related models, you will note why it can be crucial to specify this, since the name “linear regression” will be confusing.
Some people say that Linear Regression is machine learning.
Their argument is that machine learning is a “latest” field, while Linear Regression existed long before, so it can’t be considered ML.
That is misleading.
Linear Regression matches perfectly inside machine learning because:
- it learns parameters from data,
- it minimizes a loss function,
- it makes predictions on latest data.
In other words, Linear Regression is one in every of the oldest models, but in addition one in every of the most fundamental in machine learning.
That is the approach utilized in:
- Linear Regression,
- Logistic Regression,
- and, later, Neural Networks and LLMs.
For deep learning, this weighted, gradient-based approach is the one which is used in every single place.
And in modern LLMs, we are not any longer talking about a number of parameters. We’re talking about billions of weights.
In this text, our Linear Regression model has exactly 2 weights.
A slope and an intercept.
That’s all.
But now we have to start somewhere, right?
And listed here are a number of questions you possibly can bear in mind as we progress through this text, and within the ones to come back.
- We are going to attempt to interpret the model. With one feature, y=ax+b, everyone knows that a is the slope and b is the intercept. But how will we interpret the coefficients where there are 10, 100 or more features?
- Why is collinearity between features such an issue for linear regression? And the way can we do to unravel this issue?
- Is scaling necessary for linear regression?
- Can Linear regression be overfitted?
- And the way are the opposite models of this weighted familly (Logistic Regression, SVM, Neural Networks, Ridge, Lasso, etc.), all connected to the identical underlying ideas?
These questions form the thread of this text and can naturally lead us toward future topics within the “Advent Calendar”.
Understanding the Trend line in Excel
Starting with a Easy Dataset
Allow us to begin with a quite simple dataset that I generated with one feature.
Within the graph below, you possibly can see the feature variable x on the horizontal axis and the goal variable y on the vertical axis.
The goal of Linear Regression is to search out two numbers, a and b, such that we are able to write the connection:
y=a x +b
Once we all know a and b, this equation becomes our model.
Creating the Trend Line in Excel
In Google Sheets or Excel, you possibly can simply add a trend line to visualise the very best linear fit.
That already gives you the results of Linear Regression.

However the purpose of this text is to compute these coefficients ourselves.
If we wish to make use of the model to make predictions, we want to implement it directly.

Introducing Weights and the Cost Function
A Note on Weight-Based Models
That is the primary time within the Advent Calendar that we introduce weights.
Models that learn weights are sometimes called parametric discriminant models.
Why discriminant?
Because they learn a rule that directly separates or predicts, without modeling how the information was generated.
Before this chapter, we already saw models that had parameters, but they weren’t discriminant, they were generative.
Allow us to recap quickly.
- Decision Trees use splits, or rules, and so there are not any weights to learn. So that they are non-parametric models.
- k-NN shouldn’t be a model. It keeps the entire dataset and uses distances at prediction time.
Nonetheless, after we move from Euclidean distance to Mahalanobis distance, something interesting happens…
LDA and QDA estimate parameters:
- means of every class
- covariance matrices
- priors
These are real parameters, but they should not weights.
These models are generative because they model the density of every class, after which use it to make predictions.
So regardless that they’re parametric, they don’t belong to the weight-based family.
And as you possibly can see, these are all classifiers, and so they estimate parameters for every class.

Linear Regression is our first example of a model that learns weights to construct a prediction.
That is the start of a brand new family within the Advent Calendar:
models that depend on weights + a loss function to make predictions.
The Cost Function
How can we obtain the parameters a and b?
Well, the optimal values for a and b are those minimizing the fee function, which is the Squared Error of the model.
So for every data point, we are able to calculate the Squared Error.
Squared Error = (prediction-real value)²=(a*x+b-real value)²
Then we are able to calculate the MSE, or Mean Squared Error.
As we are able to see in Excel, the trendline gives us the optimal coefficients. In case you manually change these values, even barely, the MSE will increase.
This is strictly what “optimal” means here: some other combination of a and b makes the error worse.

The classic closed-form solution
Now that we all know what the model is, and what it means to reduce the squared error, we are able to finally answer the important thing query:
How will we compute the 2 coefficients of Linear Regression, the slope and the intercept ?
There are two ways to do it:
- the exact algebraic solution, generally known as the closed-form solution,
- or gradient descent, which we’ll explore just after.
If we take the definition of the MSE and differentiate it with respect to and , something beautiful happens: the whole lot simplifies into two very compact formulas.

These formulas only use:
- the typical of x and y,
- how x varies (its variance),
- and the way x and y vary together (their covariance).
So even without knowing any calculus, and with only basic spreadsheet functions, we are able to reproduce the precise solution utilized in statistics textbooks.
Tips on how to interpret the coefficients
For one feature, interpretation is easy and intuitive:
The slope a
It tells us how much y changes when x increases by one unit.
If the slope is 1.2, it means:
The intercept b
It’s the anticipated value of y when x = 0.
Often, x = 0 doesn’t exist in the true context of the information, so the intercept shouldn’t be all the time meaningful by itself.
Its role is generally to to match the middle of the information.
This is often how Linear Regression is taught:
a slope, an intercept, and a straight line.
With one feature, interpretation is simple.
With two, still manageable.
But as soon as we start adding many features, it becomes harder.
Tomorrow, we’ll discuss further concerning the interpretation.
Today, we’ll do the gradient descent.
Gradient Descent, Step by Step
After seeing the classic algebraic solution for Linear Regression, we are able to now explore the opposite essential tool behind modern machine learning: optimization.
The workhorse of optimization is Gradient Descent.
Understanding it on a quite simple example makes the logic much clearer once we apply it to Linear Regression.
A Gentle Warm-Up: Gradient Descent on a Single Variable
Before implementing the gradient descent for the Linear Regression, we are able to first do it for an easy function: (x-2)^2.
Everyone knows the minimum is at x=2.
But allow us to pretend we have no idea that, and let the algorithm discover it by itself.
The thought is to search out the minimum of this function using the next process:
- First, we randomly select an initial value.
- Then for every step, we calculate the worth of the derivative function df (for this x value): df(x)
- And the following value of x is obtained by subtracting the worth of derivative multiplied by a step size: x = x – step_size*df(x)
You may modify the 2 parameters of the gradient descent: the initial value of x and the step size.
Yes, even with 100, or 1000. That’s quite surprising to see, how well it really works.

But, in some cases, the gradient descent won’t work. For instance, if the step size is simply too big, the x value can explode.

Gradient descent for linear regression
The principle of the gradient descent algorithm is identical for linear regression: now we have to calculate the partial derivatives of the fee function with respect to the parameters a and b. Let’s note them as da and db.
Squared Error = (prediction-real value)²=(a*x+b-real value)²
da=2(a*x+b-real value)*x
db=2(a*x+b-real value)

After which, we are able to do the updates of the coefficients.

With this tiny update, step-by-step, the optimal value might be found after a number of interations.
In the next graph, you possibly can see how a and b converge towards the goal value.

We also can see all the small print of y hat, residuals and the partial derivatives.
We will fully appreciate the great thing about gradient descent, visualized in Excel.
For these two coefficients, we are able to observe how quick the convergence is.

Now, in practice, now we have many observations and this must be done for every data point. That’s where things turn out to be crazy in Google Sheet. So, we use only 10 data points.
You will note that I first created a sheet with long formulas to calculate da and db, which contain the sum of the derivatives of all of the observations. Then I created one other sheet to indicate all the small print.
Conclusion
Linear Regression may look easy, however it introduces almost the whole lot that modern machine learning relies on.
With just two parameters, a slope and an intercept, it teaches us:
- the best way to define a value function,
- the best way to find optimal parameters, numerically,
- and the way optimization behaves after we adjust learning rates or initial values.
The closed-form solution shows the elegance of the mathematics.
Gradient Descent shows the mechanics behind the scenes.
Together, they form the muse of the “weighted + loss function” family that features Logistic Regression, SVM, Neural Networks, and even today’s LLMs.
Recent Paths Ahead
You might think Linear Regression is straightforward, but with its foundations now clear, you possibly can extend it, refine it, and reinterpret it through many alternative perspectives:
- Change the loss function
Replace squared error with logistic loss, hinge loss, or other functions, and latest models appear. - Move to classification
Linear Regression itself can separate two classes (0 and 1), but more robust versions result in Logistic Regression and SVM. And what about multiclass classification? - Model nonlinearity
Through polynomial features or kernels, linear models suddenly turn out to be nonlinear in the unique space. - Scale to many features
Interpretation becomes harder, regularization becomes essential, and latest numerical challenges appear. - Primal vs dual
Linear models will be written in two ways. The primal view learns the weights directly. The dual view rewrites the whole lot using dot products between data points. - Understand modern ML
Gradient Descent, and its variants, are the core of neural networks and enormous language models.
What we learned here with two parameters generalizes to billions.
All the pieces in this text stays throughout the boundaries of Linear Regression, yet it prepares the bottom for a whole family of future models.
Day after day, the Advent Calendar will show how all these ideas connect.
