Home Artificial Intelligence Gradient Descent From Scratch- Batch Gradient Descent, Stochastic Gradient Descent, and Mini-Batch Gradient Descent. Loss and price Function Gradient descent Implementation from scratch with python Conclusion

Gradient Descent From Scratch- Batch Gradient Descent, Stochastic Gradient Descent, and Mini-Batch Gradient Descent. Loss and price Function Gradient descent Implementation from scratch with python Conclusion

2
Gradient Descent From Scratch- Batch Gradient Descent, Stochastic Gradient Descent, and Mini-Batch Gradient Descent.
Loss and price Function
Gradient descent
Implementation from scratch with python
Conclusion

In this text, I’ll take you thru the implementation of Batch Gradient Descent, Stochastic Gradient Descent, and Mini-Batch Gradient Descent coding from scratch in python. This can be beginners friendly. Understanding gradient descent method will show you how to in optimising your loss during ML model training. Also, I’ll explain the impact of momentum in the educational process.

Gradient Descent

What you’ll learn in this text:

Loss and price functions are utilized in machine learning to quantify the discrepancy between the actual values and the anticipated value throughout the model training.Each are vital components in machine learning.

is used to judge the performance of a model on a single training statement. It takes the anticipated output from the model and compares it to the true goal output, quantifying the discrepancy between the 2.

It is often computed as the typical or sum of the person loss function values across the total training dataset. The price function is a worldwide measure of how well the model performs and is utilized by optimization methods resembling gradient descent to iteratively adjust the model’s parameters during training.

Cost Function

Gradient is a linear approximation of a function. Gradient Descent is an iterative means of finding the local maximum and minimum of a function. As defined by IBM “Gradient descent is an optimization algorithm which is commonly-used to coach machine learning models and neural networks.”

The gradient of the associated fee function represents the direction and magnitude of the steepest increase in the associated fee. By moving in the other way of the gradient, which is the negative gradient, during optimization, the algorithm goals to converge towards the optimal set of parameters that provide the most effective fit to the training data. Gradient descent is a flexible optimization technique that will be utilised in various machine learning algorithms, resembling linear regression, logistic regression, neural networks, and support vector machines.

As said earlier, Gradient descent is a distinguished optimization approach for minimising a model’s cost function. The price function quantifies the difference between the expected and actual values based on the training data to evaluate the model’s performance.

The gradient descent formula consist is given below

Gradient Descent

Firstly, you initialise a weight randomly and use it to estimate the model to acquire the worth of the associated fee function. Find the derivative of the associated fee function with respect to the parameter of the model, it will give the gradient of the associated fee function. The gradient will point towards the steepest ascent direction. The movement of the gradient descent will proceed until it reaches the purpose with the smallest possible loss function value.

In batch gradient descent, the loss for all of the points within the training set are averaged, and the model (weight) is updated only after evaluating all of the training examples in a single training iteration.

The advantage of this method is that it gives a more accurate value for the gradient. One drawback of batch gradient descent is that the stable error gradient can sometimes result in convergence at a suboptimal solution, failing to achieve the most effective possible model performance.

Stochastic Gradient Descent (SGD) is a simplified version of Gradient Descent (GD) that addresses a few of its challenges. In SGD, the gradient is computed for under one randomly chosen partition of the shuffled dataset during each iteration, as a substitute of using your entire dataset. This modification significantly reduces computational time. Nevertheless, because SGD iterates one statement at a time, In comparison to GD, which evaluates the whole dataset in each iteration, it might produce noisier results.

That is balanced between batch gradient descent and stochastic gradient descent. Here the information is splitted into small batches after which compute loss for every. The weights are updated after each batch.

With the intention to speed up a model’s learning process, optimization methods like gradient descent continuously utilise momentum. It adds a recent element that affects the direction wherein the model parameters must be updated depending on the collected gradient data from earlier iterations.

The impact of momentum in the educational process can lead to faster convergence, smoother optimization paths, enhanced escape from local optima, and improved generalisation ability, making it a useful technique for optimising the training means of machine learning models.

A full training loop using gradient descent algorithm will follow these steps;

– initialise parameters- Run some variety of epochs

– use parameters to make predictions

– Compute and store losses

– Compute gradients of the loss with respect to parameters

– Use gradients to update the parameters

– Do the rest or End!

Lets write just a few lines of code that does a component of every of those steps before the important training loop

def linear_function(X, theta):#Linear function
assert X.ndim > 1
assert theta.ndim > 1
return np.dot(X, theta)

#Loss function (MSE)
def mean_squared_error(ytrue, ypred):
return np.mean((ytrue - ypred)**2)

#Here we initialize the load(parameter)
def initialize_theta(D):
return np.zeros([D, 1])

#Here we compute the gradient of lost. That is gotten by finding the derivative of the loss with respect to the load
def batch_gradient(X, y, theta):
return -2.0 * np.dot(X.T, (y - linear_function(X, theta)))

#Updating function: gradient descent
def update_function(theta, grads, step_size):
return theta - step_size * grads

The important loop for batch gradient descent. The function we created above were used accordingly.

def train_batch_gradient_descent(X, y, num_epochs, step_size=0.1, plot_every=1):

N, D = X.shape
theta = initialize_theta(D)
losses = []
for epoch in range(num_epochs): # Do some iterations
ypred = linear_function(X, theta) # make predictions with current parameters
loss = mean_squared_error(y, ypred) # Compute mean squared error
grads = batch_gradient(X, y, theta) # compute gradients of loss wrt parameters
theta = update_function(theta, grads, step_size) # Update your parameters with the gradients

losses.append(loss)
print(f"nEpoch {epoch}, loss {loss}")
return losses

You should utilize the matplotlib library to plot the loss.

plt.plot(losses)
plt.xlabel("epoch")
plt.ylabel("loss")
plt.title("training curve")

Stochastic gradient Descent unlike batch gradient descent, pick random sample(s) or subset of samples and updates parameters with their gradients.

Below, we’ll write code that gets a single sample and computes gradients for that sample.Its also vital we shuffle our data.

def per_sample_gradient(xi, yi, theta):
return -2.0 * xi * (yi - linear_function(xi, theta))

def shuffle_data(X, y):
N, _ = X.shape
shuffled_idx = np.random.permutation(N)
return X[shuffled_idx], y[shuffled_idx]

def train_with_sgd(X, y, num_epochs, step_size, plot_every=1):
"""Train with stochastic gradient descent"""
N, D = X.shape
theta = initialize_theta(D)
losses = []
epoch = 0
loss_tolerance = 0.001
avg_loss = float("inf")

while epoch < num_epochs and avg_loss > loss_tolerance:
running_loss = 0.0
shuffled_x, shuffled_y = shuffle_data(X, y)

for idx in range(shuffled_x.shape[0]):
sample_x = shuffled_x[idx].reshape(-1, D)
sample_y = shuffled_y[idx].reshape(-1, 1)
ypred = linear_function(sample_x, theta)
loss = mean_squared_error(sample_y, ypred)
running_loss += loss
grads = per_sample_gradient(sample_x, sample_y, theta)
theta = update_function(theta, grads, step_size)

avg_loss = running_loss/ X.shape[0]
losses.append(avg_loss)
print(f"Epoch {epoch}, loss {avg_loss}")

epoch += 1

return losses

Also, I explained momentum within the introduction, this code will show how you’ll be able to integrate momentum into SDG.

Momentum formula
def get_momentum(momentum, grad, beta):
return beta * momentum + (1. - beta) * grad

SDG with momentum

def train_sgd_with_momentum(X, y, num_epochs, step_size, beta, plot_every=1):
"""Train with stochastic gradient descent"""
N, D = X.shape
theta = initialize_theta(D)
losses = []
epoch = 0
loss_tolerance = 0.001
avg_loss = float("inf")

while epoch < num_epochs and avg_loss > loss_tolerance:
momentum = 0.0
running_loss = 0.0
shuffled_x, shuffled_y = shuffle_data(X, y)

for idx in range(shuffled_x.shape[0]):
sample_x = shuffled_x[idx].reshape(-1, D)
sample_y = shuffled_y[idx].reshape(-1, 1)
ypred = linear_function(sample_x, theta)
loss = mean_squared_error(sample_y, ypred)
running_loss += loss
grad = per_sample_gradient(sample_x, sample_y, theta)
momentum = get_momentum(momentum, grad, beta)
theta = update_function(theta, momentum, step_size)#on this function the gradient can be replaced by the momentum

avg_loss = running_loss/ X.shape[0]
losses.append(avg_loss)
print(f"Epoch {epoch}, loss {avg_loss}")

epoch += 1

return losses

Let’s now conclude with a mini-batch gradient descent.

As an alternative of computing gradients for the total dataset on this case, we split the information into batches, excluding the last batch, each batch having the scale “batch_size” that we defined. Do you understand the rationale?

Several aspects must be noted on this case;

Since we only take a batch at a time, we must compute loss for the batch before averaging over all sample points.

def minibatch_gradient_descent(X, y, num_epochs, step_size=0.1, batch_size=3, plot_every=1):
N, D = X.shape
theta = initialize_theta(D)
losses = []
num_batches = N//batch_size
X, y = shuffle_data(X, y) # shuffle the information

for epoch in range(num_epochs): # Do some iterations
running_loss = 0.0

for batch_idx in range(0, N, batch_size):
x_batch = X[batch_idx: batch_idx + batch_size] # select a batch of features
y_batch = y[batch_idx: batch_idx + batch_size] # and a batch of labels

ypred = linear_function(x_batch, theta) # make predictions with current parameters
loss = mean_squared_error(y_batch, ypred) # Compute mean squared error
grads = batch_gradient(x_batch, y_batch, theta) # compute gradients of loss wrt parameters
theta = update_function(theta, grads, step_size) # Update your parameters with the gradients
running_loss += (loss * x_batch.shape[0]) # loss is mean for a batch, dividing by N_batch gives
# us a sum for the batch so we will average later by diving
# by the total data size

avg_loss = running_loss/ N
losses.append(avg_loss)
print(f"nEpoch {epoch}, loss {avg_loss}")
return losses

2 COMMENTS

LEAVE A REPLY

Please enter your comment!
Please enter your name here