Advantage Actor Critic (A2C)

⚠️ A latest updated version of this text is accessible here 👉 https://huggingface.co/deep-rl-course/unit1/introduction

This text is an element of the Deep Reinforcement Learning Class. A free course from beginner to expert. Check the syllabus here.

⚠️ A latest updated version of this text is accessible here 👉 https://huggingface.co/deep-rl-course/unit1/introduction

This text is an element of the Deep Reinforcement Learning Class. A free course from beginner to expert. Check the syllabus here.

In Unit 5, we learned about our first Policy-Based algorithm called Reinforce.
In Policy-Based methods, we aim to optimize the policy directly without using a price function. More precisely, Reinforce is an element of a subclass of Policy-Based Methods called Policy-Gradient methods. This subclass optimizes the policy directly by estimating the weights of the optimal policy using Gradient Ascent.

We saw that Reinforce worked well. Nevertheless, because we use Monte-Carlo sampling to estimate return (we use a complete episode to calculate the return), we’ve significant variance in policy gradient estimation.

Do not forget that the policy gradient estimation is the direction of the steepest increase in return. Aka, the way to update our policy weights in order that actions that result in good returns have a better probability of being taken. The Monte Carlo variance, which we are going to further study on this unit, results in slower training since we’d like lots of samples to mitigate it.

Today we’ll study Actor-Critic methods, a hybrid architecture combining a value-based and policy-based methods that help to stabilize the training by reducing the variance:

An Actor that controls how our agent behaves (policy-based method)
A Critic that measures how good the motion taken is (value-based method)

We’ll study one in every of these hybrid methods called Advantage Actor Critic (A2C), and train our agent using Stable-Baselines3 in robotic environments. Where we’ll train two agents to walk:

A bipedal walker 🚶
A spider 🕷️

Robotics environments

Sounds exciting? Let’s start!

The Problem of Variance in Reinforce

In Reinforce, we would like to increase the probability of actions in a trajectory proportional to how high the return is.

Reinforce

If the return is high, we are going to push up the possibilities of the (state, motion) combos.
Else, if the return is low, it’s going to push down the possibilities of the (state, motion) combos.

This return $R (τ) R(tau)$

$R(tau) = R_{t+1} + gamma R_{t+2} + gamma^2 R_{t+3} + …$

The advantage of this method is that it’s unbiased. Since we’re not estimating the return, we use only the true return we obtain.

But the issue is that the variance is high, since trajectories can result in different returns as a consequence of stochasticity of the environment (random events during episode) and stochasticity of the policy. Consequently, the identical starting state can result in very different returns.
For this reason, the return starting at the identical state can vary significantly across episodes.

variance

The answer is to mitigate the variance by using a lot of trajectories, hoping that the variance introduced in anybody trajectory will probably be reduced in aggregate and supply a “true” estimation of the return.

Nevertheless, increasing the batch size significantly reduces sample efficiency. So we’d like to seek out additional mechanisms to scale back the variance.

Advantage Actor Critic (A2C)

Reducing variance with Actor-Critic methods

The answer to reducing the variance of Reinforce algorithm and training our agent faster and higher is to make use of a mix of policy-based and value-based methods: the Actor-Critic method.

To know the Actor-Critic, imagine you play a video game. You’ll be able to play with a friend that may provide you some feedback. You’re the Actor, and your friend is the Critic.

Actor Critic

You don’t know the way to play originally, so you are trying some actions randomly. The Critic observes your motion and provides feedback.

Learning from this feedback, you’ll update your policy and be higher at playing that game.

However, your friend (Critic) can even update their technique to provide feedback so it might probably be higher next time.

That is the concept behind Actor-Critic. We learn two function approximations:

A policy that controls how our agent acts: $pi_{theta}(s,a)$
A price function to help the policy update by measuring how good the motion taken is: $hat{q}_{w}(s,a)$

The Actor-Critic Process

Now that we’ve seen the Actor Critic’s big picture, let’s dive deeper to know how Actor and Critic improve together throughout the training.

As we saw, with Actor-Critic methods there are two function approximations (two neural networks):

Actor, a policy function parameterized by theta: $pi_{theta}(s,a)$
Critic, a value function parameterized by w: $hat{q}_{w}(s,a)$

Let’s examine the training process to know how Actor and Critic are optimized:

At each timestep, t, we get the present state $S_{t}$
Our Policy takes the state and outputs an motion $A_{t}$

Step 1 Actor Critic

The Critic takes that motion also as input and, using $S_{t}$

Step 2 Actor Critic

The motion $A_{t}$

Step 3 Actor Critic

The Actor updates its policy parameters using the Q value.

Step 4 Actor Critic

Because of its updated parameters, the Actor produces the subsequent motion to take at $A_{t+1}$
The Critic then updates its value parameters.

Step 5 Actor Critic

Advantage Actor Critic (A2C)

We will stabilize learning further by using the Advantage function as Critic as an alternative of the Motion value function.

The thought is that the Advantage function calculates how higher taking that motion at a state is in comparison with the typical value of the state. It’s subtracting the mean value of the state from the state motion pair:

Advantage Function

In other words, this function calculates the additional reward we get if we take this motion at that state in comparison with the mean reward we get at that state.

The additional reward is what’s beyond the expected value of that state.

If A(s,a) > 0: our gradient is pushed in that direction.
If A(s,a) < 0 (our motion does worse than the typical value of that state), our gradient is pushed in the wrong way.

The issue with implementing this advantage function is that it requires two value functions — $Q (s, a)$

Advantage Function

Advantage Actor Critic (A2C) using Robotics Simulations with PyBullet 🤖

Now that you have studied the speculation behind Advantage Actor Critic (A2C), you are able to train your A2C agent using Stable-Baselines3 in robotic environments.

Robotics environments

Start the tutorial here 👉 https://colab.research.google.com/github/huggingface/deep-rl-class/blob/essential/unit7/unit7.ipynb

The leaderboard to match your results together with your classmates 🏆 👉 https://huggingface.co/spaces/chrisjay/Deep-Reinforcement-Learning-Leaderboard

Conclusion

Congrats on ending this chapter! There was lots of information. And congrats on ending the tutorial. 🥳.

It’s normal in the event you still feel confused with all these elements. This was the identical for me and for all individuals who studied RL.

Take time to understand the fabric before continuing. Look also at the extra reading materials we provided in this text and the syllabus to go deeper 👉 https://github.com/huggingface/deep-rl-class/blob/essential/unit7/README.md

Don’t hesitate to coach your agent in other environments. The best technique to learn is to try things on your individual!

In the subsequent unit, we are going to learn to enhance Actor-Critic Methods with Proximal Policy Optimization.

And do not forget to share with your folks who need to learn 🤗!

Finally, together with your feedback, we would like to enhance and update the course iteratively. If you have got some, please fill this manner 👉 https://forms.gle/3HgA7bEHwAmmLfwh9

Continue learning, stay awesome 🤗,

Source link

Advantage Actor Critic (A2C)

The Problem of Variance in Reinforce

Advantage Actor Critic (A2C)