## Complete guide to the algorithm used to coach neural networks

Within the previous article we talked about multi-layer perceptrons (MLPs) as the primary neural network model that might solve non-linear and complicated problems.

For a very long time it was not clear methods to train these networks on a given data set. While single-layer perceptrons had a straightforward learning rule that was guaranteed to converge to an answer, it couldn’t be prolonged to networks with multiple layer. The AI community has struggled with this problem for greater than 30 years (in a period often known as the “AI winter”), when eventually in 1986 Rumelhart et al. introduced the **backpropagation algorithm **of their groundbreaking paper [1].

In this text we are going to discuss the backpropagation algorithm intimately and derive its mathematical formulation step-by-step. Since that is the predominant algorithm used to coach neural networks of every kind (including the deep networks now we have today), I feel it might be useful to anyone working with neural networks to know the main points of this algorithm.

Although you will discover descriptions of this algorithm in lots of textbooks and online sources, in writing this text I actually have tried to maintain the next principles in mind:

- Use clear and consistent notations
- Explain every step of the mathematical derivation
- Derive the algorithm for probably the most general case, i.e., for networks with any variety of layers and any activation or loss functions

After deriving the backpropagation equations, a whole pseudocode for the algorithm is given after which illustrated on a numerical example.

Before reading the article, I like to recommend that you just refresh your calculus knowledge, specifically in the world of derivatives (including partial derivatives and the chain rule of derivatives).

Now grab a cup of coffee and let’s dive in 🙂

The backpropagation algorithm consists of three phases:

**Forward pass**. On this phase we feed the inputs through the network, make a prediction and measure its error with respect to the true label.**Backward pass.**We propagate the gradients of the error with respect to every considered one of the weights backward from the output layer to the input layer.**Gradient descent step**. We barely tweak the connection weights within the network by taking a step in the other way of the error gradients.

We’ll now go over each considered one of these phases in additional detail.

Within the forward pass, we propagate the inputs in a forward direction, layer-by-layer, until the output is generated. The activation of neuron *i* in layer *l* is computed using the next equation:

where *f* is the activation function, *zᵢˡ* is the web input of neuron *i* in layer *l*, *wᵢⱼˡ* is the connection weight between neuron *j* in layer *l *— 1 and neuron *i* in layer *l*, and *bᵢˡ* is the bias of neuron *i* in layer *l*. For more details on the notations and the derivation of this equation see my previous article.

To simplify the derivation of the training algorithm, we are going to treat the bias as if it were the burden *w*₀ of an input neuron *x*₀ that has a relentless value of 1. This allows us to jot down the above equation as follows:

Within the backward pass we propagate the gradients of the error from the output layer back to the input layer.

## Definition of the Error and Loss Functions

We first define the error of the network on the training set with respect to its weights. Let’s denote by **w** the vector that incorporates all of the weights of the network.

Assume that now we have *n* training samples {(*xᵢ, yᵢ*)}, *i* = 1,…,*n*, and the output of the network on sample *i* is *oᵢ*. Then the error of the network with respect to **w** is:

where *J*(*y*, *o*) is the **loss function. **The precise loss function that we use depends upon the duty the network is trying to perform:

- For regression problems, we use the
**squared loss**function:

2. For binary classification problems, we use **log loss** (also often known as the **binary cross-entropy loss**):

3. For multi-class classification problem, we use the **cross-entropy loss** function:

where *k* is the variety of classes.

The the explanation why we use these specific loss functions will probably be explained in a future article.

Our goal is to search out the weights **w** that minimize *E*(**w**). Unfortunately, this function is non-convex due to the non-linear activations of the hidden neurons. Because of this it could have multiple local minima:

There are numerous techniques that could be used to forestall gradient descent from getting stuck in an area minimum, akin to momentum. These techniques will probably be covered in future articles.

## Finding the Gradients of the Error

So as to use gradient descent, we want to compute the partial derivatives of *E*(**w**) with respect to every considered one of the weights within the network:

To simplify the mathematical derivation, we are going to assume that now we have just one training example and find the partial derivatives of the error with respect to that example:

where *y* is the label of this instance and *o* is the output of the network for that example. The extension to *n* training samples is easy, because the derivative of the sum of functions is just the sum of their derivatives.

The computation of the partial derivatives of the weights within the hidden layers will not be trivial, since those weights don’t affect directly the output (and hence the error). To deal with this problem, we are going to use the chain rule of derivatives to determine a relationship between the gradients of the error in a given layer and the gradients in the following layer.

## The Delta Terms

We first note that *E* depends upon the burden *wᵢⱼˡ* only via the web input *zᵢˡ* of neuron *i* in layer *l*. Due to this fact, we will apply the chain rule of derivatives to the gradient of *E* with respect to this weight:

The second derivative on the correct side of the equation is:

Due to this fact, we will write:

The variable *δᵢ* known as the **delta term** of neuron *i* or **delta** for brief.

## The Delta Rule

The **delta rule** establishes the connection between the delta terms in layer *l* and the delta terms in layer *l* + 1.

To derive the delta rule, we again use the chain rule of derivatives. The loss function depends upon the web input of neuron *i* only via the web inputs of all of the neurons it’s connected to in layer *l* + 1. Due to this fact we will write:

where the index *j* within the sum goes over all of the neurons in layer *l* + 1 that neuron *i* in layer *l* is connected to.

Once more we use the chain rule to decompose the second partial derivative contained in the brackets:

The primary partial derivative contained in the brackets is just the delta of neuron *j* in layer *l* + 1, subsequently we will write:

The second partial derivative is straightforward to compute:

Due to this fact we get:

But *aᵢˡ* = *f*(*zᵢˡ*), where *f* is the activation function. Hence, the partial derivative outside the sum is just the derivative of the activation function *f*’(*x*) for *x* = *zᵢˡ*.

Due to this fact we will write:

This equation, often known as the **delta rule**,** **shows the connection between the deltas in layer *l* and the deltas in layer *l* + 1. More specifically, each delta in layer *l* is a linear combination of the deltas in layer *l* + 1, where the coefficients of the mixture are the connection weights between these layers. The delta rule allows us to compute all of the delta terms (and thus all of the gradients of the error) recursively, ranging from the deltas within the output layer and going back layer-by-layer until we reach the input layer.

The next diagram illustrates the flow of the error information:

For specific activation functions, we will derive more explicit equations for the delta rule. For instance, if we use the sigmoid function then:

The derivative of the sigmoid function has a straightforward form:

Hence:

Then the delta rule for the sigmoid function gets the next form:

## The Deltas within the Output Layer

The ultimate piece of the puzzle are the delta terms within the output layer, that are the primary ones that we want to compute.

The deltas within the output layer depend each on the loss function and the activation function utilized in the output neurons:

where *f* is the activation function used to compute the output.

- In regression problems, the activation function we use within the output is the identity function
*f*(*x*) =*x*, whose derivative is 1, and the loss function is the squared loss. Due to this fact the delta is:

2. In binary classification problems, the activation function we use is sigmoid and the loss function is log loss, subsequently we get:

In other words, the delta is just the difference between the network’s output and the label.

3. In multiclass classification problems, now we have *k* output neurons (where *k* is the variety of classes) and we use softmax activation and the cross-entropy log loss. Much like the previous case, the delta term of the *i*th output neuron is surprisingly easy:

where *oᵢ* is the *i*-th component of the network’s prediction and *yᵢ* is the *i*-th component of the label. The proof is somewhat longer, and you will discover it on this well-written Medium article.

Once we finish computing all of the delta terms, we will use gradient descent to update the weights. In gradient descent, we take small steps in the other way of the gradient in an effort to catch up with to the minimum error:

Do not forget that the partial derivative of the error function with respect to every weight is:

Due to this fact, we will write the gradient descent update rule as follows:

where *α* is a learning rate that controls the step size (0 < *α *< 1). In other words, we subtract from the burden between neuron *j* in layer *l *— 1 and neuron *i* in layer *l* the delta of neuron *i* multiplied by the activation of neuron *j *(scaled by the training rate).

Gradient descent could be applied in considered one of the next modes:

**Batch gradient descent**— the weights are updated after we compute the error on the complete training set.**Stochastic gradient descent (SGD)**— a gradient descent step is performed after every training example. Typically converges faster than batch gradient descent but is less stable.**Mini-batch gradient descent**— a middle way between batch gradient descent and SGD. We use small batches of random training samples (normally between 10 to 1,000 examples) for the gradient updates. This reduces the noise in SGD but remains to be more efficient than full-batch updates, and it’s probably the most common form to coach neural networks.

We at the moment are able to present the complete algorithm in its full glory:

As an exercise, attempt to implement this algorithm in Python (or your favorite programming language).

Imagine that now we have a binary classification problem with two binary inputs and a single binary output. Our neural network has two hidden layers with the next weights:

The activation function within the hidden layers and within the output unit is the sigmoid function, and the training rate is *α *= 0.5*.*

The network is presented with a training example with the inputs *x*₁ = 1 and *x*₂ = 0, and the goal label is *y* = 1. Let’s perform one iteration of the backpropagation algorithm to update the weights.

We start with forward propagation of the inputs:

The output of the network is 0.6718 while the true label is 1, hence we want to update the weights in an effort to increase the network’s output and make it closer to the label.

We first compute the delta on the output node. Since it is a binary classification problem we use the log loss function, and the delta on the output is *o* — *y*.

We now propagate the deltas from the output neuron back to the input layer using the delta rule:

Note how the deltas turn out to be increasingly smaller as we return within the layers, causing the early layers within the network to coach very slowly. This phenomenon, often known as the **vanishing gradients**, was considered one of the predominant the explanation why backpropagation was not successful in training deep networks (until deep learning has appeared).

Finally, we perform one step of gradient descent:

Let’s do one other forward pass to see if the network’s output became closer to the goal:

Indeed, the output has increased from 0.6718 to 0.7043!

**Final Notes**

All images unless otherwise noted are by the writer.

Thanks for reading!

## References

[1] Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. “Learning representations by back-propagating errors.” *nature* 323.6088 (1986): 533–536.

jazz music