Backpropagation: Step-By-Step Derivation The Three Phases of the Algorithm Forward Pass Backward Pass Gradient Descent Backpropagation: The Complete Algorithm Backpropagation Example

-

Photo by DeepMind on Unsplash

Within the previous article we talked about multi-layer perceptrons (MLPs) as the primary neural network model that would solve non-linear and complicated problems.

For a very long time it was not clear the way to train these networks on a given data set. While single-layer perceptrons had an easy learning rule that was guaranteed to converge to an answer, it couldn’t be prolonged to networks with a couple of layer. The AI community has struggled with this problem for greater than 30 years (in a period often called the “AI winter”), when eventually in 1986 Rumelhart et al. introduced the of their groundbreaking paper [1].

In this text we are going to discuss the backpropagation algorithm intimately and derive its mathematical formulation step-by-step. Since that is the essential algorithm used to coach neural networks of all types (including the deep networks we’ve got today), I think it might be useful to anyone working with neural networks to know the small print of this algorithm.

Although you will discover descriptions of this algorithm in lots of textbooks and online sources, in writing this text I even have tried to maintain the next principles in mind:

  • Use clear and consistent notations.
  • Explain every step of the mathematical derivation.
  • Derive the algorithm for probably the most general case, i.e., for networks with any variety of layers and any activation or loss functions.

After deriving the backpropagation equations, a whole pseudocode for the algorithm is given after which illustrated on a numerical example.

Before reading the article, I like to recommend that you simply refresh your calculus knowledge, specifically in the world of derivatives (including partial derivatives and the chain rule of derivatives).

Now grab a cup of coffee and let’s dive in 🙂

The backpropagation algorithm consists of three phases:

  1. . On this phase we feed the inputs through the network, make a prediction and measure its error with respect to the true label.
  2. We propagate the gradients of the error with respect to every certainly one of the weights backward from the output layer to the input layer.
  3. . We barely tweak the connection weights within the network by taking a step in the other way of the error gradients.

We are going to now go over each certainly one of these phases in additional detail.

Within the forward pass, we propagate the inputs in a forward direction, layer-by-layer, until the output is generated. The activation of neuron i in layer l is computed using the next equation:

The forward pass equation

where f is the activation function, zᵢˡ is the online input of neuron i in layer l, wᵢⱼˡ is the connection weight between neuron j in layer l — 1 and neuron i in layer l, and bᵢˡ is the bias of neuron i in layer l. For more details on the notations and the derivation of this equation see my previous article.

To simplify the derivation of the educational algorithm, we are going to treat the bias as if it were the burden w₀ of an input neuron x₀ that has a continuing value of 1. This permits us to jot down the above equation as follows:

Within the backward pass we propagate the gradients of the error from the output layer back to the input layer.

Definition of the Error and Loss Functions

We first define the error of the network on the training set with respect to its weights. Let’s denote by the vector that comprises all of the weights of the network.

Assume that we’ve got n training samples {(xᵢ, yᵢ)}, i = 1,…,n, and the output of the network on sample i is oᵢ. Then the error of the network with respect to is:

The error function

where J(y, o) is the The particular loss function that we use will depend on the duty the network is trying to perform:

  1. For regression problems, we use the function:

2. For binary classification problems, we use (also often called the ):

3. For multi-class classification problem, we use the function:

where k is the variety of classes.

The explanation why we use these specific loss functions can be explained in a future article.

Our goal is to seek out the weights that minimize E(). Unfortunately, this function is non-convex due to non-linear activations of the hidden neurons. Which means that it could have multiple local minima:

There are numerous techniques that could be used to stop gradient descent from getting stuck in a neighborhood minimum, similar to momentum. These techniques can be covered in future articles.

Finding the Gradients of the Error

So as to use gradient descent, we want to compute the partial derivatives of E() with respect to every certainly one of the weights within the network:

The partial derivative of E with respect to a given weight

To simplify the mathematical derivation, we are going to assume that we’ve got just one training example and find the partial derivatives of the error with respect to that example:

where y is the label of this instance and o is the output of the network for that example. The extension to n training samples is simple, because the derivative of the sum of functions is just the sum of their derivatives.

The computation of the partial derivatives of the weights within the hidden layers shouldn’t be trivial, since those weights don’t affect directly the output (and hence the error). To handle this problem, we are going to use the chain rule of derivatives to ascertain a relationship between the gradients of the error in a given layer and the gradients in the following layer.

The Delta Terms

We first note that E will depend on the burden wᵢⱼˡ only via the online input zᵢˡ of neuron i in layer l. Due to this fact, we will apply the chain rule of derivatives to the gradient of E with respect to this weight:

The second derivative on the correct side of the equation is:

Due to this fact, we will write:

The variable δᵢ is named the of neuron i or for brief.

The Delta Rule

The establishes the connection between the delta terms in layer l and the delta terms in layer l + 1.

To derive the delta rule, we again use the chain rule of derivatives. The loss function will depend on the online input of neuron i only via the online inputs of all of the neurons it’s connected to in layer l + 1. Due to this fact we will write:

where the index j within the sum goes over all of the neurons in layer l + 1 that neuron i in layer l is connected to.

Once more we use the chain rule to decompose the second partial derivative contained in the brackets:

The primary partial derivative contained in the brackets is just the delta of neuron j in layer l + 1, due to this fact we will write:

The second partial derivative is straightforward to compute:

Due to this fact we get:

But aᵢˡ = f(zᵢˡ), where f is the activation function. Hence, the partial derivative outside the sum is just the derivative of the activation function f’(x) for x = zᵢˡ.

Due to this fact we will write:

The delta rule

This equation, often called the ,shows the connection between the deltas in layer l and the deltas in layer l + 1. More specifically, each delta in layer l is a linear combination of the deltas in layer l + 1, where the coefficients of the mixture are the connection weights between these layers. The delta rule allows us to compute all of the delta terms (and thus all of the gradients of the error) recursively, ranging from the deltas within the output layer and going back layer-by-layer until we reach the input layer.

The next diagram illustrates the flow of the error information:

The calculation of the delta of neuron i in layer l by backpropagation of the deltas from those neurons in layer l+1 to which it’s connected. The black arrows indicate the direction of flow during forward propagation, and the red arrows indicate the backward propagation of the error.

For specific activation functions, we will derive more explicit equations for the delta rule. For instance, if we use the sigmoid function then:

The derivative of the sigmoid function has an easy form:

Hence:

Then the delta rule for the sigmoid function gets the next form:

The delta rule for the sigmoid function

The Deltas within the Output Layer

The ultimate piece of the puzzle are the delta terms within the output layer, that are the primary ones that we want to compute.

The deltas within the output layer depend each on the loss function and the activation function utilized in the output neurons:

where f is the activation function used to compute the output.

  1. In regression problems, the activation function we use within the output is the identity function f(x) = x, whose derivative is 1, and the loss function is the squared loss. Due to this fact the delta is:

2. In binary classification problems, the activation function we use is sigmoid and the loss function is log loss, due to this fact we get:

In other words, the delta is solely the difference between the network’s output and the label.

3. In multiclass classification problems, we’ve got k output neurons (where k is the variety of classes) and we use softmax activation and the cross-entropy log loss. Much like the previous case, the delta term of the ith output neuron is surprisingly easy:

where oᵢ is the i-th component of the network’s prediction and yᵢ is the i-th component of the label. The proof is somewhat longer, and you will discover it on this well-written Medium article.

Once we finish computing all of the delta terms, we will use gradient descent to update the weights. In gradient descent, we take small steps in the other way of the gradient (i.e., within the direction of the steepest descent) to be able to catch up with to the minimum error:

Gradient descent

Do not forget that the partial derivative of the error function with respect to every weight is:

Due to this fact, we will write the gradient descent update rule as follows:

where α is a learning rate that controls the step size (0 < α < 1). In other words, we subtract from the burden between neuron j in layer l — 1 and neuron i in layer l the delta of neuron i multiplied by the activation of neuron j (scaled by the educational rate).

Gradient descent could be applied in certainly one of the next modes:

  1. — the weights are updated after we compute the error on the whole training set.
  2. — a gradient descent step is performed after every training example. Typically converges faster than batch gradient descent but is less stable.
  3. — a middle way between batch gradient descent and SGD. We use small batches of random training samples (normally between 10 to 1,000 examples) for the gradient updates. This reduces the noise in SGD but continues to be more efficient than full-batch updates, and it’s probably the most common form to coach neural networks.

We at the moment are able to present the whole algorithm in its full glory:

As an exercise, attempt to implement this algorithm in Python (or your favorite programming language).

Imagine that we’ve got a binary classification problem with two binary inputs and a single binary output. Our neural network has two hidden layers with the next weights:

The activation function within the hidden layers and within the output unit is the sigmoid function, and the educational rate is α = 0.5.

The network is presented with a training example with the inputs x₁ = 1 and x₂ = 0, and the goal label is y = 1. Let’s perform one iteration of the backpropagation algorithm to update the weights.

We start with forward propagation of the inputs:

The forward pass

The output of the network is 0.6718 while the true label is 1, hence we want to update the weights to be able to increase the network’s output and make it closer to the label.

We first compute the delta on the output node. Since this can be a binary classification problem we use the log loss function, and the delta on the output is oy.

The delta on the output neuron

We now propagate the deltas from the output neuron back to the input layer using the delta rule:

The backward pass

Note how the deltas grow to be increasingly smaller as we return within the layers, causing the early layers within the network to coach very slowly. This phenomenon, often called the , was certainly one of the essential explanation why backpropagation was not successful in training deep networks (until deep learning has appeared).

Finally, we perform one step of gradient descent:

Gradient descent step

Let’s do one other forward pass to see if the network’s output became closer to the goal:

One other forward pass

Indeed, the output has increased from 0.6718 to 0.7043!

All images unless otherwise noted are by the creator.

Thanks for reading!

References

[1] Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. “Learning representations by back-propagating errors.” nature 323.6088 (1986): 533–536.

admin

What are your thoughts on this topic?
Let us know in the comments below.

Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments

Share this article

Recent posts

Could We Achieve AGI Inside 5 Years? NVIDIA’s CEO Jensen Huang Believes It’s Possible

Within the dynamic field of artificial intelligence, the search for Artificial General Intelligence (AGI) represents a pinnacle of innovation, promising to redefine the interplay...

MS reveals a part of 'Customized Co-Pilot'… “Testing in progress… coming soon”

A few of the 'Customized Co-Pilot' that Microsoft (MS) announced in January has been released. In addition they announced that they plan to...

Impact of Rising Sea Levels on Coastal Residential Real Estate Assets

Using scenario based stress testing to discover medium (2050) and long run (2100) sea level rise risksThis project utilizes a scenario based qualitative stress...

Create a speaking and singing video with a single photo…”Produce mouth shapes, facial expressions, and movements.”

https://www.youtube.com/watch?v=9KuCy0W5s4o Alibaba introduced a man-made intelligence (AI) system that creates realistic speaking and singing videos from a single photo. It's the follow-up to the...

Recent comments

binance us registrácia on The Path to AI Maturity – 2023 LXT Report
Do NeuroTest work on The Stacking Ensemble Method
AeroSlim Weight loss price on NIA holds AI Ethics Idea Contest Awards Ceremony
skapa binance-konto on LLMs and the Emerging ML Tech Stack
бнанс рестраця для США on Model Evaluation in Time Series Forecasting
Bonus Pendaftaran Binance on Meet Our Fleet
Créer un compte gratuit on About Me — How I give AI artists a hand
To tài khon binance on China completely blocks ‘Chat GPT’
Regístrese para obtener 100 USDT on Reducing bias and improving safety in DALL·E 2
crystal teeth whitening on What babies can teach AI
binance referral bonus on DALL·E API now available in public beta
www.binance.com prihlásení on Neural Networks and Life
Büyü Yapılmışsa Nasıl Bozulur on Introduction to PyTorch: from training loop to prediction
yıldızname on OpenAI Function Calling
Kısmet Bağlılığını Çözmek İçin Dua on Examining Flights within the U.S. with AWS and Power BI
Kısmet Bağlılığını Çözmek İçin Dua on How Meta’s AI Generates Music Based on a Reference Melody
Kısmet Bağlılığını Çözmek İçin Dua on ‘이루다’의 스캐터랩, 기업용 AI 시장에 도전장
uçak oyunu bahis on Thanks!
para kazandıran uçak oyunu on Make Machine Learning Work for You
medyum on Teaching with AI
aviator oyunu oyna on Machine Learning for Beginners !
yıldızname on Final DXA-nation
adet kanı büyüsü on ‘Fake ChatGPT’ app on the App Store
Eşini Eve Bağlamak İçin Dua on LLMs and the Emerging ML Tech Stack
aviator oyunu oyna on AI as Artist’s Augmentation
Büyü Yapılmışsa Nasıl Bozulur on Some Guy Is Trying To Turn $100 Into $100,000 With ChatGPT
Eşini Eve Bağlamak İçin Dua on Latest embedding models and API updates
Kısmet Bağlılığını Çözmek İçin Dua on Jorge Torres, Co-founder & CEO of MindsDB – Interview Series
gideni geri getiren büyü on Joining the battle against health care bias
uçak oyunu bahis on A faster method to teach a robot
uçak oyunu bahis on Introducing the GPT Store
para kazandıran uçak oyunu on Upgrading AI-powered travel products to first-class
para kazandıran uçak oyunu on 10 Best AI Scheduling Assistants (September 2023)
aviator oyunu oyna on 🤗Hugging Face Transformers Agent
Kısmet Bağlılığını Çözmek İçin Dua on Time Series Prediction with Transformers
para kazandıran uçak oyunu on How China is regulating robotaxis
bağlanma büyüsü on MLflow on Cloud
para kazandıran uçak oyunu on Can The 2024 US Elections Leverage Generative AI?
Canbar Büyüsü on The reverse imitation game
bağlanma büyüsü on The NYU AI School Returns Summer 2023
para kazandıran uçak oyunu on Beyond ChatGPT; AI Agent: A Recent World of Staff
Büyü Yapılmışsa Nasıl Bozulur on The Murky World of AI and Copyright
gideni geri getiren büyü on ‘Midjourney 5.2’ creates magical images
Büyü Yapılmışsa Nasıl Bozulur on Microsoft launches the brand new Bing, with ChatGPT inbuilt
gideni geri getiren büyü on MemCon 2023: We’ll Be There — Will You?
adet kanı büyüsü on Meet the Fellow: Umang Bhatt
aviator oyunu oyna on Meet the Fellow: Umang Bhatt
abrir uma conta na binance on The reverse imitation game
código de indicac~ao binance on Neural Networks and Life
Larry Devin Vaughn Wall on How China is regulating robotaxis
Jon Aron Devon Bond on How China is regulating robotaxis
otvorenie úctu na binance on Evolution of Blockchain by DLC
puravive reviews consumer reports on AI-Driven Platform Could Streamline Drug Development
puravive reviews consumer reports on How OpenAI is approaching 2024 worldwide elections
www.binance.com Registrácia on DALL·E now available in beta