Home Artificial Intelligence Backpropagation: Step-By-Step Derivation The Three Phases of the Algorithm Forward Pass Backward Pass Gradient Descent Backpropagation: The Complete Algorithm Backpropagation Example

Backpropagation: Step-By-Step Derivation The Three Phases of the Algorithm Forward Pass Backward Pass Gradient Descent Backpropagation: The Complete Algorithm Backpropagation Example

1
Backpropagation: Step-By-Step Derivation
The Three Phases of the Algorithm
Forward Pass
Backward Pass
Gradient Descent
Backpropagation: The Complete Algorithm
Backpropagation Example

Photo by DeepMind on Unsplash

Within the previous article we talked about multi-layer perceptrons (MLPs) as the primary neural network model that would solve non-linear and complicated problems.

For a very long time it was not clear the way to train these networks on a given data set. While single-layer perceptrons had an easy learning rule that was guaranteed to converge to an answer, it couldn’t be prolonged to networks with a couple of layer. The AI community has struggled with this problem for greater than 30 years (in a period often called the “AI winter”), when eventually in 1986 Rumelhart et al. introduced the of their groundbreaking paper [1].

In this text we are going to discuss the backpropagation algorithm intimately and derive its mathematical formulation step-by-step. Since that is the essential algorithm used to coach neural networks of all types (including the deep networks we’ve got today), I think it might be useful to anyone working with neural networks to know the small print of this algorithm.

Although you will discover descriptions of this algorithm in lots of textbooks and online sources, in writing this text I even have tried to maintain the next principles in mind:

  • Use clear and consistent notations.
  • Explain every step of the mathematical derivation.
  • Derive the algorithm for probably the most general case, i.e., for networks with any variety of layers and any activation or loss functions.

After deriving the backpropagation equations, a whole pseudocode for the algorithm is given after which illustrated on a numerical example.

Before reading the article, I like to recommend that you simply refresh your calculus knowledge, specifically in the world of derivatives (including partial derivatives and the chain rule of derivatives).

Now grab a cup of coffee and let’s dive in 🙂

The backpropagation algorithm consists of three phases:

  1. . On this phase we feed the inputs through the network, make a prediction and measure its error with respect to the true label.
  2. We propagate the gradients of the error with respect to every certainly one of the weights backward from the output layer to the input layer.
  3. . We barely tweak the connection weights within the network by taking a step in the other way of the error gradients.

We are going to now go over each certainly one of these phases in additional detail.

Within the forward pass, we propagate the inputs in a forward direction, layer-by-layer, until the output is generated. The activation of neuron i in layer l is computed using the next equation:

The forward pass equation

where f is the activation function, zᵢˡ is the online input of neuron i in layer l, wᵢⱼˡ is the connection weight between neuron j in layer l — 1 and neuron i in layer l, and bᵢˡ is the bias of neuron i in layer l. For more details on the notations and the derivation of this equation see my previous article.

To simplify the derivation of the educational algorithm, we are going to treat the bias as if it were the burden w₀ of an input neuron x₀ that has a continuing value of 1. This permits us to jot down the above equation as follows:

Within the backward pass we propagate the gradients of the error from the output layer back to the input layer.

Definition of the Error and Loss Functions

We first define the error of the network on the training set with respect to its weights. Let’s denote by the vector that comprises all of the weights of the network.

Assume that we’ve got n training samples {(xᵢ, yᵢ)}, i = 1,…,n, and the output of the network on sample i is oᵢ. Then the error of the network with respect to is:

The error function

where J(y, o) is the The particular loss function that we use will depend on the duty the network is trying to perform:

  1. For regression problems, we use the function:

2. For binary classification problems, we use (also often called the ):

3. For multi-class classification problem, we use the function:

where k is the variety of classes.

The explanation why we use these specific loss functions can be explained in a future article.

Our goal is to seek out the weights that minimize E(). Unfortunately, this function is non-convex due to non-linear activations of the hidden neurons. Which means that it could have multiple local minima:

There are numerous techniques that could be used to stop gradient descent from getting stuck in a neighborhood minimum, similar to momentum. These techniques can be covered in future articles.

Finding the Gradients of the Error

So as to use gradient descent, we want to compute the partial derivatives of E() with respect to every certainly one of the weights within the network:

The partial derivative of E with respect to a given weight

To simplify the mathematical derivation, we are going to assume that we’ve got just one training example and find the partial derivatives of the error with respect to that example:

where y is the label of this instance and o is the output of the network for that example. The extension to n training samples is simple, because the derivative of the sum of functions is just the sum of their derivatives.

The computation of the partial derivatives of the weights within the hidden layers shouldn’t be trivial, since those weights don’t affect directly the output (and hence the error). To handle this problem, we are going to use the chain rule of derivatives to ascertain a relationship between the gradients of the error in a given layer and the gradients in the following layer.

The Delta Terms

We first note that E will depend on the burden wᵢⱼˡ only via the online input zᵢˡ of neuron i in layer l. Due to this fact, we will apply the chain rule of derivatives to the gradient of E with respect to this weight:

The second derivative on the correct side of the equation is:

Due to this fact, we will write:

The variable δᵢ is named the of neuron i or for brief.

The Delta Rule

The establishes the connection between the delta terms in layer l and the delta terms in layer l + 1.

To derive the delta rule, we again use the chain rule of derivatives. The loss function will depend on the online input of neuron i only via the online inputs of all of the neurons it’s connected to in layer l + 1. Due to this fact we will write:

where the index j within the sum goes over all of the neurons in layer l + 1 that neuron i in layer l is connected to.

Once more we use the chain rule to decompose the second partial derivative contained in the brackets:

The primary partial derivative contained in the brackets is just the delta of neuron j in layer l + 1, due to this fact we will write:

The second partial derivative is straightforward to compute:

Due to this fact we get:

But aᵢˡ = f(zᵢˡ), where f is the activation function. Hence, the partial derivative outside the sum is just the derivative of the activation function f’(x) for x = zᵢˡ.

Due to this fact we will write:

The delta rule

This equation, often called the ,shows the connection between the deltas in layer l and the deltas in layer l + 1. More specifically, each delta in layer l is a linear combination of the deltas in layer l + 1, where the coefficients of the mixture are the connection weights between these layers. The delta rule allows us to compute all of the delta terms (and thus all of the gradients of the error) recursively, ranging from the deltas within the output layer and going back layer-by-layer until we reach the input layer.

The next diagram illustrates the flow of the error information:

The calculation of the delta of neuron i in layer l by backpropagation of the deltas from those neurons in layer l+1 to which it’s connected. The black arrows indicate the direction of flow during forward propagation, and the red arrows indicate the backward propagation of the error.

For specific activation functions, we will derive more explicit equations for the delta rule. For instance, if we use the sigmoid function then:

The derivative of the sigmoid function has an easy form:

Hence:

Then the delta rule for the sigmoid function gets the next form:

The delta rule for the sigmoid function

The Deltas within the Output Layer

The ultimate piece of the puzzle are the delta terms within the output layer, that are the primary ones that we want to compute.

The deltas within the output layer depend each on the loss function and the activation function utilized in the output neurons:

where f is the activation function used to compute the output.

  1. In regression problems, the activation function we use within the output is the identity function f(x) = x, whose derivative is 1, and the loss function is the squared loss. Due to this fact the delta is:

2. In binary classification problems, the activation function we use is sigmoid and the loss function is log loss, due to this fact we get:

In other words, the delta is solely the difference between the network’s output and the label.

3. In multiclass classification problems, we’ve got k output neurons (where k is the variety of classes) and we use softmax activation and the cross-entropy log loss. Much like the previous case, the delta term of the ith output neuron is surprisingly easy:

where oᵢ is the i-th component of the network’s prediction and yᵢ is the i-th component of the label. The proof is somewhat longer, and you will discover it on this well-written Medium article.

Once we finish computing all of the delta terms, we will use gradient descent to update the weights. In gradient descent, we take small steps in the other way of the gradient (i.e., within the direction of the steepest descent) to be able to catch up with to the minimum error:

Gradient descent

Do not forget that the partial derivative of the error function with respect to every weight is:

Due to this fact, we will write the gradient descent update rule as follows:

where α is a learning rate that controls the step size (0 < α < 1). In other words, we subtract from the burden between neuron j in layer l — 1 and neuron i in layer l the delta of neuron i multiplied by the activation of neuron j (scaled by the educational rate).

Gradient descent could be applied in certainly one of the next modes:

  1. — the weights are updated after we compute the error on the whole training set.
  2. — a gradient descent step is performed after every training example. Typically converges faster than batch gradient descent but is less stable.
  3. — a middle way between batch gradient descent and SGD. We use small batches of random training samples (normally between 10 to 1,000 examples) for the gradient updates. This reduces the noise in SGD but continues to be more efficient than full-batch updates, and it’s probably the most common form to coach neural networks.

We at the moment are able to present the whole algorithm in its full glory:

As an exercise, attempt to implement this algorithm in Python (or your favorite programming language).

Imagine that we’ve got a binary classification problem with two binary inputs and a single binary output. Our neural network has two hidden layers with the next weights:

The activation function within the hidden layers and within the output unit is the sigmoid function, and the educational rate is α = 0.5.

The network is presented with a training example with the inputs x₁ = 1 and x₂ = 0, and the goal label is y = 1. Let’s perform one iteration of the backpropagation algorithm to update the weights.

We start with forward propagation of the inputs:

The forward pass

The output of the network is 0.6718 while the true label is 1, hence we want to update the weights to be able to increase the network’s output and make it closer to the label.

We first compute the delta on the output node. Since this can be a binary classification problem we use the log loss function, and the delta on the output is oy.

The delta on the output neuron

We now propagate the deltas from the output neuron back to the input layer using the delta rule:

The backward pass

Note how the deltas grow to be increasingly smaller as we return within the layers, causing the early layers within the network to coach very slowly. This phenomenon, often called the , was certainly one of the essential explanation why backpropagation was not successful in training deep networks (until deep learning has appeared).

Finally, we perform one step of gradient descent:

Gradient descent step

Let’s do one other forward pass to see if the network’s output became closer to the goal:

One other forward pass

Indeed, the output has increased from 0.6718 to 0.7043!

All images unless otherwise noted are by the creator.

Thanks for reading!

References

[1] Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. “Learning representations by back-propagating errors.” nature 323.6088 (1986): 533–536.

1 COMMENT

LEAVE A REPLY

Please enter your comment!
Please enter your name here