Improve your Gradient Descent: The Epic Quest for the Optimal Stride Contents Introduction Method 1: Fixed Step Size Method 2: Exact Line Search Method 3: Backtracking Line Search Conclusion

Artificial Intelligence

Improve your Gradient Descent: The Epic Quest for the Optimal Stride Contents Introduction Method 1: Fixed Step Size Method 2: Exact Line Search Method 3: Backtracking Line Search Conclusion

admin

May 25, 2023

Improve your Gradient Descent: The Epic Quest for the Optimal Stride
Contents
Introduction
Method 1: Fixed Step Size
Method 2: Exact Line Search
Method 3: Backtracking Line Search
Conclusion

Techniques for Optimizing Step-Size/Learning Rate in Gradient Descent for Machine Learning

Introduction
Method 1: Fixed Step Size
Method 2: Exact Line Search
Method 3: Backtracking Line Search
Conclusion

When training any machine learning model, Gradient Descent is one of the commonly used techniques to optimize for the parameters. Gradient descent offers an efficient way of minimizing the loss function for the incoming data, especially in cases, where they might not be a closed-form solution to the issue. Typically, consider a machine learning problem defined by a convex and differentiable function f: Rᵈ → R (most loss functions follow these properties). The goal is to search out x* ∈ Rᵈ that minimizes the loss function:

Gradient Descent provides an iterative approach to solving this problem. The update rule is given as:

Where x⁽ᵏ⁾ refers back to the value of x within the kth iteration of the algorithm, and tₖ refers back to the step size or the training rate of the model within the kth iteration. The overall workflow of the algorithm is given as follows:

Determine the loss function f and compute its gradient ∇f.
Start with a random alternative of x ∈ Rᵈ, call it x⁽⁰⁾(the starting iterate).
Until you reach the stopping criteria (e.g., the error falls below a certain threshold), do the next:
A) Determine the direction along which x have to be reduced or increased. Under gradient descent, that is given by the direction opposite to the gradient of the loss function evaluated at the present iterate. vₖ = ∇ₓ f(x⁽ᵏ ⁻ ¹⁾)
B) Determine the step size or the magnitude of the change: tₖ.
C) Update the iterate: x⁽ᵏ⁾= x⁽ᵏ ⁻ ¹⁾ − tₖ∇ₓ f(x⁽ᵏ ⁻ ¹⁾)

That’s your complete workflow in a nutshell: Take the present iterate, search for the direction by which it must be updated (vₖ), determine the magnitude of the update (tₖ), and update it.

Illustration of Gradient Descent [Image by Author]

So, what’s this text about? In this text, our focus will probably be on step 3B: finding the optimal step size or the magnitude of tₖ. In relation to gradient descent, that is one of the neglected features of optimizing your model. The dimensions of the step can greatly affect how briskly your algorithm converges to the answer in addition to the accuracy of the answer it converges to. most frequently, data scientists simply set a set value for the step size during your complete learning process or may occasionally use validation techniques to coach it. But, there are various more efficient ways to go about solving this problem. In this text, we are going to discuss 3 other ways to find out the step size tₖ:

Fixed Step Size
Exact Line Search
Backtracking Line Search (Armijo’s Rule)

For every of those methods, we are going to discuss the speculation and implement it to calculate the primary few iterates for an example. Specifically, we are going to consider the next loss function to guage the model:

The 3D-Plot of the function is shown below:

Loss Function (3D Plot) [Image by Author generated using LibreTexts]

From the figure, it’s evident that the worldwide minima is x* = [0; 0]. Throughout this text, we are going to manually calculate the primary few iterates and computationally determine the variety of steps for convergence for every of those methods. We will even trace the pattern of the descent (aka the iterate trajectory) to grasp how each of those techniques affects the [process of convergence. Usually, it’s easier to refer to the contour plot of the function (instead of its 3D plot) to better evaluate the different trajectories that follow. The contour plot of the function can be easily generated via the following code:

# Load Packages
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()
sns.set(style="darkgrid")
from matplotlib import cm
from matplotlib.ticker import LinearLocator, FormatStrFormatter
from mpl_toolkits.mplot3d import Axes3D

# Define Function
f = lambda x,y:  2*x**2 + 3*y**2 - 2*x*y - 1# Plot contour
X = np.arange(-1, 1, 0.005)
Y = np.arange(-1, 1, 0.005)
X, Y = np.meshgrid(X, Y)
Z = f(X,Y)
plt.figure(figsize=(12, 7))
cmap = plt.cm.get_cmap('viridis')
plt.contour(X,Y,Z,250, cmap=cmap)

Contour Plot of f [Image by Author generated using Python]

Let’s start!

This method is the best to make use of, and the one mostly used for training the ML model. This involves setting:

One must be very careful while selecting the appropriate t under this method. While a small value of t can result in very accurate solutions, the convergence can turn out to be quite slow. Alternatively, a big t makes the algorithm faster, but at the associated fee of accuracy. Using this method requires the implementer to fastidiously balance the trade-off between the speed of convergence and the accuracy of the answer yielded.

In practice, most data scientists use validation techniques similar to hold-out validation or k-fold cross-validation to optimize for t. This system involves making a partition of the training data (called the validation data), which is used to optimize for the performance by running the algorithm on a discrete set of values that t can take. Let’s take a look at our example:

Step one is to compute it’s gradient:

For all subsequent calculations, we are going to take the initialization to be x⁽⁰⁾= [1; 1]. Under this strategy, we set:

The primary two iterates are calculated as follows:

We compute the remaining iterates programmatically via the next Python Code:

# Define the function f(x, y)
f = lambda x, y: 2*x**2 + 3*y**2 - 2*x*y - 1# Define the derivative of f(x, y)
def df(x, y):
return np.array([4*x - 2*y, 6*y - 2*x])
# Perform gradient descent optimization
def grad_desc(f, df, x0, y0, t=0.1, tol=0.001):
x, y = [x0], [y0]  # Initialize lists to store x and y coordinates
num_steps = 0  # Initialize the variety of steps taken
# Proceed until the norm of the gradient is below the tolerance
while np.linalg.norm(df(x0, y0)) > tol:  
v = -df(x0, y0)  # Compute the direction of descent
x0 = x0 + t*v[0]  # Update x coordinate
y0 = y0 + t*v[1]  # Update y coordinate
x.append(x0)  # Append updated x coordinate to the list
y.append(y0)  # Append updated y coordinate to the list
num_steps += 1  # Increment the variety of steps taken
return x, y, num_steps
# Run the gradient descent algorithm with initial point (1, 1)
a, b, n = grad_desc(f, df, 1, 1)
# Print the variety of steps taken for convergence
print(f"Variety of Steps to Convergence: {n}")

Within the above code, we’ve defined the next convergence criteria (which will probably be consistently utilized for all methods):

On running the above code, we discover that it takes around 26 steps to converge. The next plot shows the trajectory of the iterates through the gradient descent:

# Plot the contours
X = np.arange(-1.1, 1.1, 0.005)
Y = np.arange(-1.1, 1.1, 0.005)
X, Y = np.meshgrid(X, Y)
Z = f(X,Y)
plt.figure(figsize=(12, 7))
plt.contour(X,Y,Z,250, cmap=cmap, alpha = 0.6)
n = len(a)
for i in range(n - 1):
plt.plot([a[i]],[b[i]],marker='o',markersize=7, color ='r')
plt.plot([a[i + 1]],[b[i + 1]],marker='o',markersize=7, color ='r')
plt.arrow(a[i],b[i],a[i + 1] - a[i],b[i + 1] - b[i], 
head_width=0, head_length=0, fc='r', ec='r', linewidth=2.0)

Contour Plot: Fixed Step Size = 0.1 [Image by Author generated using Python]

To have a greater understanding of how critical it’s to decide on the appropriate t on this method, let’s see gauge the effect of accelerating or decreasing t. If we decrease the worth of t from 0.1 to 0.01, the variety of steps to converge increases drastically from 26 to 295. The iterate trajectory for this case is shown below:

Contour Plot: Fixed Step Size = 0.01 [Image by Author generated using Python]

Nevertheless, on increasing the t from 0.1 to 0.2, the variety of steps to converge decreases from 26 to a mere 11, as shown by the next trajectory:

Contour Plot: Fixed Step Size = 0.2 [Image by Author generated using Python]

Nevertheless, it is necessary to notice that this doesn’t all the time be the case. If the worth of the step size is simply too large, it is feasible that the iterates simply bounce away from the optimal solution and never converge. In actual fact, increasing t from 0.2 to simply around 0.3 causes the iterate values to shoot up, making it unimaginable to converge. That is seen from the next contour plot (with t = 0.3) for just the primary 8 steps:

Contour Plot: Fixed Step Size = 0.3 [Image by Author generated using Python]

Thus, it is obvious that finding the appropriate value of t is amazingly vital on this method and even a small increase or decrease can greatly affect the algorithm’s ability to converge. Now, let’s talk concerning the next method to find out t.

On this method, we don’t assign a straightforward pre-fixed value of t at every iteration. As an alternative, we treat the issue of finding the optimal t itself as a 1D optimization problem. In other words, we’re concerned about finding the perfect update t, that minimizes the worth of the function:

Notice how cool that is! We’ve a multi-dimensional optimization problem (minimizing f) that we try to solve using gradient descent. We all know the perfect direction to update our iterate (vₖ = − ∇ₓ f(x⁽ᵏ ⁻ ¹⁾)), but we want to search out the optimal step size tₖ. In other words, the worth of the function for the subsequent iterate only relies on the worth of tₖ that we selected to make use of. So, we treat this as one other (but simpler!) optimization problem:

So, we update x⁽ᵏ⁾ to be the iterate that best minimizes the loss function f. This really helps increase the speed of convergence. Nevertheless, it also adds an extra time requirement: To compute the minimizer of g(t). Often, this isn’t an issue because it’s a 1D function, but sometimes it could take longer than expected. Thus, while using this method, it’s essential to balance the trade-off between the variety of steps reduced to converge and the extra time requirement to compute the argmin. Let’s take a look at our example:

The primary two iterates are calculated as follows:

We compute the remaining iterates programmatically via the next Python Code

# Import package for 1D Optimization
from scipy.optimize import minimize_scalardef grad_desc(f, df, x0, y0, tol=0.001):
x, y = [x0], [y0]  # Initialize lists to store x and y coordinates
num_steps = 0  # Initialize the variety of steps taken
# Proceed until the norm of the gradient is below the tolerance
while np.linalg.norm(df(x0, y0)) > tol:  
v = -df(x0, y0)  # Compute the direction of descent
# Define optimizer function for searching t
g = lambda t: f(x0 + t*v[0], y0 + t*v[1]) 
t = minimize_scalar(g).x # Minimize t
x0 = x0 + t*v[0]  # Update x coordinate
y0 = y0 + t*v[1]  # Update y coordinate
x.append(x0)  # Append updated x coordinate to the list
y.append(y0)  # Append updated y coordinate to the list
num_steps += 1  # Increment the variety of steps taken
return x, y, num_steps
# Run the gradient descent algorithm with initial point (1, 1)
a, b, n = grad_desc(f, df, 1, 1)
# Print the variety of steps taken for convergence
print(f"Variety of Steps to Convergence: {n}")

Just as before, within the above code, we’ve defined the next convergence criteria (which will probably be consistently utilized for all methods):

On running the above code, we discover that it takes only 10 steps to converge ( a significant improvement from the fixed step size). The next plot shows the trajectory of the iterates through the gradient descent:

# Plot the contours
X = np.arange(-1.1, 1.1, 0.005)
Y = np.arange(-1.1, 1.1, 0.005)
X, Y = np.meshgrid(X, Y)
Z = f(X,Y)
plt.figure(figsize=(12, 7))
plt.contour(X,Y,Z,250, cmap=cmap, alpha = 0.6)
n = len(a)
for i in range(n - 1):
plt.plot([a[i]],[b[i]],marker='o',markersize=7, color ='r')
plt.plot([a[i + 1]],[b[i + 1]],marker='o',markersize=7, color ='r')
plt.arrow(a[i],b[i],a[i + 1] - a[i],b[i + 1] - b[i], head_width=0, 
head_length=0, fc='r', ec='r', linewidth=2.0)

Contour Plot: Exact Line Search [Image by Author generated using Python]

Now, let’s talk concerning the next method to find out t.

Backtracking is an adaptive method of selecting the optimal step size. In my experience, I even have found this to be one of the useful strategies for optimizing the step size. The convergence will likely be much faster than fixed step size without the complications of maximizing a 1D function g(t) in a precise line search. This method involves starting out with a quite large step size (t¯ = 1) and continuing to diminish it until a required decrease in f(x) is observed. Allow us to first take a take a look at the algorithm and subsequently, we will probably be discussing the specifics:

Algorithm 1: Backtracking (Armijo–Goldstein condition) [Image by Author]

In other words, we start with a big step size (which is significant normally within the initial stages of the algorithm) and check if it helps us improve the present iterate by a given threshold. If the step size is found to be too large, we reduce it by multiplying it with a scalar constant β ∈ (0, 1). We repeat this process until a desired decrease in f is obtained. Specifically, we elect the biggest t such that:

i.e., the decrease is not less than σt || ∇ₓ f(x⁽ᵏ ⁻ ¹⁾) ||². But, why this value? It may be mathematically shown (via Taylor’s first order expansion) that t || ∇ₓ f(x⁽ᵏ ⁻ ¹⁾) ||² is the minimum decrease in f that we will expect through an improvement made through the current iteration. There’s an extra factor of σ within the condition. That is to account for the very fact, that even when we cannot achieve the theoretically guaranteed decrease t || ∇ₓ f(x⁽ᵏ ⁻ ¹⁾) ||², we not less than hope to realize a fraction of it scaled by σ. That’s to say, we require that the achieved reduction if f be not less than a set fraction σ of the reduction promised by the first-order Taylor approximation of f at x⁽ᵏ⁾. If the condition will not be fulfilled, we scale down t to a smaller value via β. Let’s take a look at our example (setting t¯= 1, σ = β = 0.5):

The primary two iterates are calculated as follows:

Likewise,

We compute the remaining iterates programmatically via the next Python Code:

# Perform gradient descent optimization
def grad_desc(f, df, x0, y0, tol=0.001):
x, y = [x0], [y0]  # Initialize lists to store x and y coordinates
num_steps = 0  # Initialize the variety of steps taken
# Proceed until the norm of the gradient is below the tolerance
while np.linalg.norm(df(x0, y0)) > tol:  
v = -df(x0, y0)  # Compute the direction of descent
# Compute the step size using Armijo line search
t = armijo(f, df, x0, y0, v[0], v[1]) 
x0 = x0 + t*v[0]  # Update x coordinate
y0 = y0 + t*v[1]  # Update y coordinate
x.append(x0)  # Append updated x coordinate to the list
y.append(y0)  # Append updated y coordinate to the list
num_steps += 1  # Increment the variety of steps taken
return x, y, num_stepsdef armijo(f, df, x1, x2, v1, v2, s = 0.5, b = 0.5):
t = 1
# Perform Armijo line search until the Armijo condition is satisfied
while (f(x1 + t*v1, x2 + t*v2) > f(x1, x2) + 
t*s*np.matmul(df(x1, x2).T, np.array([v1, v2]))):
t = t*b # Reduce the step size by an element of b
return t
# Run the gradient descent algorithm with initial point (1, 1)
a, b, n = grad_desc(f, df, 1, 1)
# Print the variety of steps taken for convergence
print(f"Variety of Steps to Convergence: {n}")

Just as before, within the above code, we’ve defined the next convergence criteria (which will probably be consistently utilized for all methods):

On running the above code, we discover that it takes only 10 steps to converge. The next plot shows the trajectory of the iterates through the gradient descent:

# Plot the contours
X = np.arange(-1.1, 1.1, 0.005)
Y = np.arange(-1.1, 1.1, 0.005)
X, Y = np.meshgrid(X, Y)
Z = f(X,Y)
plt.figure(figsize=(12, 7))
plt.contour(X,Y,Z,250, cmap=cmap, alpha = 0.6)
n = len(a)
for i in range(n - 1):
plt.plot([a[i]],[b[i]],marker='o',markersize=7, color ='r')
plt.plot([a[i + 1]],[b[i + 1]],marker='o',markersize=7, color ='r')
plt.arrow(a[i],b[i],a[i + 1] - a[i],b[i + 1] - b[i], head_width=0, 
head_length=0, fc='r', ec='r', linewidth=2.0)

Contour Plot: Backtracking [Image by Author generated using Python]

In this text, we familiarised ourselves with a few of the useful techniques to optimize for the step size within the gradient descent algorithm. Specifically, we covered 3 important techniques: Fixed Step Size, which involves maintaining the identical step size or learning rate throughout the training process, Exact Line Search, which involves minimizing the loss as a function of t, and Armijo Backtracking involving a gradual reduction within the step size until a threshold is met. While these are a few of the most fundamental techniques which you could use to tune your optimization, there exist an unlimited array of other methods (eg. setting t as a function of the variety of iterations). These tools are generally used in additional complex settings, similar to Stochastic Gradient Descent. The aim of this text was not only to introduce you to those techniques but additionally to make you aware of the intricacies that may affect your optimization algorithm. While most of those techniques are utilized in the context of Gradient descent, they can be applied to other optimization algorithms (e.g., Newton-Raphson Method). Each of those techniques has its own merits and will be preferred over the others for specific applications and algorithms.

Hope you enjoyed reading this text! In case you will have any doubts or suggestions, do reply within the comment box. Please be at liberty to contact me via mail.

For those who liked my article and wish to read more of them, please follow me.

Note: All images have been made by the writer.

Techniques for Optimizing Step-Size/Learning Rate in Gradient Descent for Machine Learning

LEAVE A REPLY Cancel reply