Mastering the Basics: How Linear Regression Unlocks the Secrets of Complex Models

-

Full explanation on Linear Regression and the way it learns

The Crane Stance. Public Domain image from Openverse

Identical to Mr. Miyagi taught young Daniel LaRusso karate through repetitive easy chores, which ultimately transformed him into the Karate Kid, mastering foundational algorithms like linear regression lays the groundwork for understanding essentially the most complex of AI architectures akin to Deep Neural Networks and LLMs.

Through this deep dive into the straightforward yet powerful linear regression, you’ll learn a lot of the basic parts that make up essentially the most advanced models built today by billion-dollar corporations.

Linear regression is an easy mathematical method used to grasp the connection between two variables and make predictions. Given some data points, akin to the one below, linear regression attempts to attract the line of best fit through these points. It’s the “wax on, wax off” of knowledge science.

An image showing many points on a graph being modelled by linear regression by tracing the line of best fit through those points
Example of linear regression model on a graph. Image captured by Writer

Once this line is drawn, we’ve got a model that we are able to use to predict recent values. Within the above example, given a brand new house size, we could try to predict its price with the linear regression model.

The Linear Regression Formula

The formula of linear regression
Labelled Linear Regression Formula. Image captured by Writer

Y is the dependent variable, that which you should calculate — the home price within the previous example. Its value is dependent upon other variables, hence its name.

X are the independent variables. These are the aspects that influence the worth of Y. When modelling, the independent variables are the input to the model, and what the model spits out is the prediction or Ŷ.

β are parameters. We give the name parameter to those values that the model adjusts (or learns) to capture the connection between the independent variables X and the dependent variable Y. So, because the model is trained, the input of the model will remain the identical, however the parameters will probably be adjusted to higher predict the specified output.

Parameter Learning

We require a number of things to have the opportunity to regulate the parameters and achieve accurate predictions.

  1. Training Data — this data consists of input and output pairs. The inputs will probably be fed into the model and through training, the parameters will probably be adjusted in an try to output the goal value.
  2. Cost function — also referred to as the loss function, is a mathematical function that measures how well a model’s prediction matches the goal value.
  3. Training Algorithm — is a technique used to regulate the parameters of the model to minimise the error as measured by the associated fee function.

Let’s go over a price function and training algorithm that may be utilized in linear regression.

MSE is a commonly used cost function in regression problems, where the goal is to predict a continuous value. That is different from classification tasks, akin to predicting the following token in a vocabulary, as in Large Language Models. MSE focuses on numerical differences and is utilized in a wide range of regression and neural network problems, that is the way you calculate it:

The formula of mean squared error (mse)
Mean Squared Error (MSE) formula. Image captured by Writer
  1. Calculate the difference between the anticipated value, Ŷ, and the goal value, Y.
  2. Square this difference — ensuring all errors are positive and likewise penalising large errors more heavily.
  3. Sum the squared differences for all data samples
  4. Divide the sum by the variety of samples, n, to get the typical squared error

You’ll notice that as our prediction gets closer to the goal value the MSE gets lower, and the further away they’re the larger it grows. Each ways progress quadratically since the difference is squared.

The concept of gradient descent is that we are able to travel through the “cost space” in small steps, with the target of arriving at the worldwide minimum — the bottom value within the space. The price function evaluates how well the present model parameters predict the goal by giving us the loss value. Randomly modifying the parameters doesn’t guarantee any improvements. But, if we examine the gradient of the loss function with respect to every parameter, i.e. the direction of the loss after an update of the parameter, we are able to adjust the parameters to maneuver towards a lower loss, indicating that our predictions are getting closer to the goal values.

Labelled graph showing the key concepts of the gradient descent algorithm. The local and global minimum, the learning rate and how it makes the position advance towards a lower cost
Labelled graph showing the important thing concepts of the gradient descent algorithm. Image captured by Writer

The steps in gradient descent should be rigorously sized to balance progress and precision. If the steps are too large, we risk overshooting the worldwide minimum and missing it entirely. However, if the steps are too small, the updates will grow to be inefficient and time-consuming, increasing the likelihood of getting stuck in a neighborhood minimum as an alternative of reaching the specified global minimum.

Gradient Descent Formula

Labelled gradient descent formula
Labelled Gradient Descent formula. Image captured by Writer

Within the context of linear regression, θ may very well be β0 or β1. The gradient is the partial derivative of the associated fee function with respect to θ, or in simpler terms, it’s a measure of how much the associated fee function changes when the parameter θ is barely adjusted.

A big gradient indicates that the parameter has a major effect on the associated fee function, while a small gradient suggests a minor effect. The sign of the gradient indicates the direction of change for the associated fee function. A negative gradient means the associated fee function will decrease because the parameter increases, while a positive gradient means it is going to increase.

So, within the case of a big negative gradient, what happens to the parameter? Well, the negative sign up front of the training rate will cancel with the negative sign of the gradient, leading to an addition to the parameter. And for the reason that gradient is large we will probably be adding a big number to it. So, the parameter is adjusted substantially reflecting its greater influence on reducing the associated fee function.

Let’s take a have a look at the costs of the sponges Karate Kid used to clean Mr. Miyagi’s automotive. If we desired to predict their price (dependent variable) based on their height and width (independent variables), we could model it using linear regression.

We are able to start with these three training data samples.

Training data for the linear regression example modelling prices of sponges
Training data for the linear regression example modelling prices of sponges. Image captured by Writer

Now, let’s use the Mean Square Error (MSE) as our cost function J, and linear regression as our model.

Formula for the cost function derived from MSE and linear regression
Formula for the associated fee function derived from MSE and linear regression. Image captured by Writer

The linear regression formula uses X1 and X2 for width and height respectively, notice there are not any more independent variables since our training data doesn’t include more. That’s the belief we absorb this instance, that the width and height of the sponge are enough to predict its price.

Now, step one is to initialise the parameters, on this case to 0. We are able to then feed the independent variables into the model to get our predictions, Ŷ, and check how far these are from our goal Y.

Step 0 in gradient descent algorithm and the calculation of the mean squared error
Step 0 in gradient descent algorithm and the calculation of the mean squared error. Image captured by Writer

Immediately, as you possibly can imagine, the parameters usually are not very helpful. But we are actually prepared to make use of the Gradient Descent algorithm to update the parameters into more useful ones. First, we want to calculate the partial derivatives of every parameter, which is able to require some calculus, but luckily we only must this once in the entire process.

Working out of the partial derivatives of the linear regression parameters.
Understanding of the partial derivatives of the linear regression parameters. Image captured by Writer

With the partial derivatives, we are able to substitute within the values from our errors to calculate the gradient of every parameter.

Calculation of parameter gradients
Calculation of parameter gradients. Image captured by Writer

Notice there wasn’t any must calculate the MSE, because it’s in a roundabout way utilized in the strategy of updating parameters, only its derivative is. It’s also immediately apparent that each one gradients are negative, meaning that each one may be increased to cut back the associated fee function. The subsequent step is to update the parameters with a learning rate, which is a hyper-parameter, i.e. a configuration setting in a machine learning model that’s specified before the training process begins. Unlike model parameters, that are learned during training, hyper-parameters are set manually and control points of the training process. Here we arbitrarily use 0.01.

Parameter updating in the first iteration of gradient descent
Parameter updating in the primary iteration of gradient descent. Image captured by Writer

This has been the ultimate step of our first iteration within the strategy of gradient descent. We are able to use these recent parameter values to make recent predictions and recalculate the MSE of our model.

Last step in the first iteration of gradient descent, and recalculation of MSE after parameter updates
Last step in the primary iteration of gradient descent, and recalculation of MSE after parameter updates. Image captured by Writer

The brand new parameters are getting closer to the true sponge prices, and have yielded a much lower MSE, but there’s lots more training left to do. If we iterate through the gradient descent algorithm 50 times, this time using Python as an alternative of doing it by hand — since Mr. Miyagi never said anything about coding — we’ll reach the next values.

Results of some iterations of the gradient descent algorithm, and a graph showing the MSE over the gradient descent steps
Results of some iterations of the gradient descent algorithm, and a graph showing the MSE over the gradient descent steps. Image captured by Writer

Eventually we arrived to a fairly good model. The true values I used to generate those numbers were [1, 2, 3] and after only 50 iterations, the model’s parameters got here impressively close. Extending the training to 200 steps, which is one other hyper-parameter, with the identical learning rate allowed the linear regression model to converge almost perfectly to the true parameters, demonstrating the facility of gradient descent.

Lots of the basic concepts that make up the complicated martial art of artificial intelligence, like cost functions and gradient descent, may be thoroughly understood just by studying the straightforward “wax on, wax off” tool that linear regression is.

Artificial intelligence is an unlimited and sophisticated field, built upon many ideas and methods. While there’s rather more to explore, mastering these fundamentals is a major first step. Hopefully, this text has brought you closer to that goal, one “wax on, wax off” at a time.

ASK DUKE

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x