The Machine Learning “Advent Calendar” Bonus 2: Gradient Descent Variants in Excel

-

use gradient descent to seek out the optimal values of their weights. Linear regression, logistic regression, neural networks, and enormous language models all depend on this principle. Within the previous articles, we used easy gradient descent since it is less complicated to indicate and easier to know.

The identical principle also appears at scale in modern large language models, where training requires adjusting thousands and thousands or billions of parameters.

Nevertheless, real training rarely uses the essential version. It is commonly too slow or too unstable. Modern systems use variants of gradient descent that improve speed, stability, or convergence.

On this bonus article, we give attention to these variants. We take a look at why they exist, what problem they solve, and the way they alter the update rule. We don’t use a dataset here. We use one variable and one function, only to make the behavior visible. The goal is to indicate the movement, not to coach a model.

1. Gradient Descent and the Update Mechanism

1.1 Problem setup

To make these ideas visible, we is not going to use a dataset here, because datasets introduce noise and make it harder to watch the behavior directly. As an alternative, we’ll use a single function:
f(x) = (x – 2)²

We start at x = 4, and the gradient is:
gradient = 2*(x – 2)

This straightforward setup removes distractions. The target shouldn’t be to coach a model, but to know how the several optimisation rules change the movement toward the minimum.

1.2 The structure behind every update

Every optimisation method that follows in this text is built on the identical loop, even when the inner logic becomes more sophisticated.

  • First, we read the present value of x.
  • Then, we compute the gradient with the expression 2*(x – 2).
  • Finally, we update x in keeping with the precise rule defined by the chosen variant.

The destination stays the identical and the gradient all the time points in the proper direction, but the best way we move along this direction changes from one method to a different. This alteration in movement is the essence of every variant.

1.3 Basic gradient descent because the baseline

Basic gradient descent applies a direct update based on the present gradient and a set learning rate:

x = x – lr * 2*(x – 2)

That is essentially the most intuitive type of learning since the update rule is straightforward to know and simple to implement. The strategy moves steadily toward the minimum, however it often does so slowly, and it could actually struggle when the training rate shouldn’t be chosen fastidiously. It represents the inspiration on which all other variants are built.

Gradient descent in Excel – all images by creator

2. Learning Rate Decay

Learning Rate Decay doesn’t change the update rule itself. It changes the dimensions of the training rate across iterations in order that the optimisation becomes more stable near the minimum. Large steps help when x is way from the goal, but smaller steps are safer when x gets near the minimum. Decay reduces the chance of overshooting and produces a smoother landing.

There shouldn’t be a single decay formula. Several schedules exist in practice:

  • exponential decay
  • inverse decay (the one shown within the spreadsheet)
  • step-based decay
  • linear decay
  • cosine or cyclical schedules

All of those follow the identical idea: the training rate becomes smaller over time, however the pattern will depend on the chosen schedule.

Within the spreadsheet example, the decay formula is the inverse form:
lr_t = lr / (1 + decay * iteration)

With the update rule:
x = x – lr_t * 2*(x – 2)

This schedule starts with the complete learning rate at the primary iteration, then regularly reduces it. At the start of the optimisation, the step size is large enough to maneuver quickly. As x approaches the minimum, the training rate shrinks, stabilising the update and avoiding oscillation.

On the chart, each curves start at x = 4. The fixed learning rate version moves faster at first but approaches the minimum with less stability. The decay version moves more slowly but stays controlled. This confirms that decay doesn’t change the direction of the update. It only changes the step size, and that change affects the behavior.

3. Momentum Methods

Gradient Descent moves in the proper direction but could be slow on flat regions. Momentum methods address this by adding inertia to the update.

They accumulate direction over time, which creates faster progress when the gradient stays consistent. This family includes standard Momentum, which builds speed, and Nesterov Momentum, which anticipates the following position to cut back overshooting.

3.1 Standard momentum

Standard momentum introduces the concept of inertia into the training process. As an alternative of reacting only to the present gradient, the update keeps a memory of previous gradients in the shape of a velocity variable:

velocity = 0.9(x – 2)
x = x – lr * velocity

This approach accelerates learning when the gradient stays consistent for multiple iterations, which is very useful in flat or shallow regions.

Nevertheless, the identical inertia that generates speed can even result in overshooting the minimum, which creates oscillations across the goal.

3.2 Nesterov Momentum

Nesterov Momentum is a refinement of the previous method. As an alternative of updating the rate at the present position alone, the tactic first estimates where the following position can be, after which evaluates the gradient at that anticipated location:

velocity = 0.9((x – 0.9*velocity) – 2)
x = x – lr * velocity

This look-ahead behaviour reduces the overshooting effect that may appear in regular Momentum, which ends up in a smoother approach to the minimum and fewer oscillations. It keeps the advantage of speed while introducing a more careful sense of direction.

4. Adaptive Gradient Methods

Adaptive Gradient Methods adjust the update based on information gathered during training. As an alternative of using a set learning rate or relying only on the present gradient, these methods adapt to the dimensions and behavior of recent gradients.

The goal is to cut back the step size when gradients change into unstable and to permit normal progress when the surface is more predictable. This approach is helpful in deep networks or irregular loss surfaces, where the gradient can change in magnitude from one step to the following.

4.1 RMSProp (Root Mean Square Propagation)

RMSProp stands for Root Mean Square Propagation. It keeps a running average of squared gradients in a cache, and this value influences how aggressively the update is applied:

cache = 0.9(x – 2))²
x = x – lr / sqrt(cache) * 2*(x – 2)

The cache becomes larger when gradients are unstable, which reduces the update size. When gradients are small, the cache grows more slowly, and the update stays near the traditional step. This makes RMSProp effective in situations where the gradient scale shouldn’t be consistent, which is common in deep learning models.

4.2 Adam (Adaptive Moment Estimation)

Adam stands for Adaptive Moment Estimation. It combines the concept of Momentum with the adaptive behaviour of RMSProp. It keeps a moving average of gradients to capture direction, and a moving average of squared gradients to capture scale:

m = 0.9(2v + 0.001(x – 2))²
x = x – lr * m / sqrt(v)

The variable m behaves like the rate in momentum, and the variable v behaves just like the cache in RMSProp. Adam updates each values at every iteration, which allows it to speed up when progress is obvious and shrink the step when the gradient becomes unstable. This balance between speed and control is what makes Adam a regular alternative in neural network training.

4.3 Other Adaptive Methods

Adam and RMSProp are essentially the most common adaptive methods, but they are usually not the one ones. Several related methods exist, each with a particular objective:

  • AdaGrad adjusts the training rate based on the complete history of squared gradients, however the rate can shrink too quickly.
  • AdaDelta modifies AdaGrad by limiting how much the historical gradient affects the update.
  • Adamax uses the infinity norm and could be more stable for very large gradients.
  • Nadam adds Nesterov-style look-ahead behaviour to Adam.
  • RAdam attempts to stabilise Adam within the early phase of coaching.
  • AdamW separates weight decay from the gradient update and is really helpful in lots of modern frameworks.

These methods follow the identical idea as RMSProp and Adam: adapting the update to the behavior of the gradients. They represent refinements or extensions of the concepts introduced above, and so they are a part of the identical broader family of adaptive optimisation algorithms.

Conclusion

All methods in this text aim for a similar goal: moving x toward the minimum. The difference is the trail. Gradient Descent provides the essential rule. Momentum adds speed, and Nesterov improves control. RMSProp adapts the step to gradient scale. Adam combines these ideas, and Learning Rate Decay adjusts the step size over time.

Each method solves a particular limitation of the previous one. None of them replace the baseline. They extend it. In practice, optimisation shouldn’t be one rule, but a set of mechanisms that work together.

The goal stays the identical. The movement becomes simpler.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x