Understanding Gradient Descent for Machine Learning What’s Loss Function? 1. Batch Gradient Descent 2. Stochastic Gradient Descent 3. Mini-Batch Gradient Descent Conclusion

Artificial Intelligence

Understanding Gradient Descent for Machine Learning What’s Loss Function? 1. Batch Gradient Descent 2. Stochastic Gradient Descent 3. Mini-Batch Gradient Descent Conclusion

admin

May 23, 2023

Understanding Gradient Descent for Machine Learning
What’s Loss Function?
1. Batch Gradient Descent
2. Stochastic Gradient Descent
3. Mini-Batch Gradient Descent
Conclusion

A deep dive into Batch, Stochastic, and Mini-Batch Gradient Descent algorithms using Python

Gradient descent is a preferred optimization algorithm that’s utilized in machine learning and deep learning models akin to linear regression, logistic regression, and neural networks. It uses first-order derivatives iteratively to reduce the fee function by updating model coefficients (for regression) and weights (for neural networks).

In this text, we’ll delve into the mathematical theory of gradient descent and explore learn how to perform calculations using Python. We are going to examine various implementations including Batch Gradient Descent, Stochastic Gradient Descent, and Mini-Batch Gradient Descent, and assess their effectiveness on a variety of test cases.

While following the article, you may take a look at the Jupyter Notebook on my GitHub for complete evaluation and code.

Before a deep dive into gradient descent, let’s first undergo the loss function.

or are used interchangeably to explain the error in a prediction. A loss value indicates how different a prediction is from the actual value and the loss function aggregates all of the loss values from multiple data points right into a single number.

You possibly can see within the image below, the model on the left has high loss whereas the model on the fitting has low loss and suits the information higher.

High loss vs low loss (blue lines) from the corresponding regression line in yellow.

The loss function (J) is used as a performance measurement for prediction algorithms and the principal goal of a predictive model is to reduce its loss function, which is set by the values of the model parameters (i.e., θ0 and θ1).

For instance, linear regression models incessantly use squared loss to compute the loss value and mean squared error is the loss function that averages all squared losses.

Squared Loss value (L2 Loss) and Mean Squared Error (MSE)

The linear regression model works behind the scenes by going through several iterations to optimize its coefficients and reach the bottom possible mean squared error.

What’s Gradient Descent?

The gradient descent algorithm is generally described with a mountain analogy:

⛰ Imagine yourself standing atop a mountain, with limited visibility, and you need to reach the bottom. While descending, you may encounter slopes and pass them using larger or smaller steps. Once you have reached a slope that is sort of leveled, you may know that you have arrived at the bottom point. ⛰

In technical terms, refers to those slopes. When the slope is zero, it might indicate that you just’ve reached a function’s minimum or maximum value.

Like within the mountain analogy, GD minimizes the starting loss value by taking repeated steps in the wrong way of the gradient to cut back the loss function.

At any given point on a curve, the steepness of the slope could be determined by a — a straight line that touches the purpose (red lines within the image above). Just like the tangent line, the gradient of a degree on the loss function is calculated with respect to the parameters, and a small step is taken in the wrong way to cut back the loss.

To summarize, the technique of gradient descent could be broken down into the next steps:

Select a place to begin for the model parameters.
Determine the gradient of the fee function with respect to the parameters and continually adjust the parameter values through iterative steps to reduce the fee function.
Repeat step 2 until the fee function not decreases or the utmost variety of iterations is reached.

We will examine the gradient calculation for the previously defined cost (loss) function. Although we’re utilizing linear regression with an intercept and coefficient, this reasoning could be prolonged to regression models incorporating several variables.

Linear regression function with 2 parameters, cost function, and objective function

Partial derivatives calculated wrt model parameters

💡 Sometimes, the purpose that has been reached may only be a local minimum or a plateau. In such cases, the model must proceed iterating until it reaches the worldwide minimum. Reaching the worldwide minimum is unfortunately not guaranteed but with a correct variety of iterations and a learning rate we will increase the probabilities.

When using gradient descent, it can be crucial to pay attention to the potential challenge of stopping at an area minimum or on a plateau. To avoid this, it is important to decide on the suitable variety of iterations and learning rate. We are going to discuss this further in the next sections.

Learning_rate is the hyperparameter of gradient descent to define the scale of the training step. It might probably be tuned using hyperparameter tuning techniques.

If the learning_rate is ready too high it could lead to a jump that produces a loss value greater than the place to begin. A high learning_rate might cause gradient descent to ,leading it to repeatedly obtain higher loss values and stopping it from finding the minimum.

Example case: A high learning rate causes GD to diverge

If the learning_rate is ready too low it might result in a lengthy computation process where gradient descent iterates through quite a few rounds of gradient calculations to succeed in and discover the minimum loss value.

Example case: A low learning rate causes GD to take an excessive amount of time to converge

The worth of the training step is set by the slope of the curve, which implies that as we approach the minimum point, the training steps change into smaller.

When using low learning rates, the progress made can be regular, whereas high learning rates may lead to either exponential progress or being stuck at low points.