Reinforcement learning is a website in machine learning that introduces the concept of an agent learning optimal strategies in complex environments. The agent learns from its actions, which end in rewards, based on the environment’s state. Reinforcement learning is a difficult topic and differs significantly from other areas of machine learning.
What’s remarkable about reinforcement learning is that the identical algorithms could be used to enable the agent adapt to completely different, unknown, and sophisticated conditions.
Note. To totally understand the concepts included in this text, it is very really helpful to be accustomed to dynamic programming and Monte Carlo methods discussed in previous articles.
- In part 2, we explored the dynamic programming (DP) approach, where the agent iteratively updates V- / Q-functions and its policy based on previous calculations, replacing them with latest estimates.
- In parts 3 and 4, we introduced Monte Carlo (MC) methods, where the agent learns from experience acquired by sampling episodes.
Temporal-difference (TD) learning algorithms, on which we’ll focus in this text, mix principles from each of those apporaches:
- Just like DP, TD algorithms update estimates based on the knowledge of previous estimates. As seen partially 2, state updates could be performed without updated values of other states, a way referred to as bootstrapping, which is a key feature of DP.
- Just like MC, TD algorithms don’t require knowledge of the environment’s dynamics because they learn from experience as well.
This text relies on Chapter 6 of the book “Reinforcement Learning” written by Richard S. Sutton and Andrew G. Barto. I highly appreciate the efforts of the authors who contributed to the publication of this book.
As we already know, Monte Carlo algorithms learn from experience by generating an episode and observing rewards for each visited state. State updates are performed only after the episode ends.
Temporal-difference algorithms operate similarly, with the one key difference being that they don’t wait until the tip of episodes to update states. As a substitute, the updates of each state are performed after n time steps the state was visited (n is the algorithm’s parameter). During these observed n time steps, the algorithm calculates the received reward and uses that information to update the previously visited state.
Temporal-difference algorithm performing state updates after n time steps is denoted as TD(n).
The best version of TD performs updates in the following time step (n = 1), referred to as one-step TD.
At the tip of the previous part, we introduced the constant-α MC algorithm. It seems that the pseudocode for one-step TD is nearly similar, apart from the state update, as shown below:
Since TD methods don’t wait until the tip of the episode and make updates using current estimates, they’re said to make use of bootstrapping, like DP algorithms.
The expression within the brackets within the update formula known as TD error:
On this equation, γ is the discount factor which takes values between 0 and 1 and defines the importance weight of the present reward in comparison with future rewards.
TD error plays a crucial role. As we’ll see later, TD algorithms could be adapted based on the shape of TD error.
At first sight, it may appear unclear how using information only from the present transition reward and the state values of the present and next states could be indeed useful for optimal strategy search. It is going to be easier to know if we take a take a look at an example.
Allow us to imagine a simplified version of the famous “Copa America” soccer tournament, which repeatedly takes place in South America. In our version, in every Copa America tournament, our team faces 6 opponents in the identical order. Through the system shouldn’t be real, we’ll omit complex details to raised understand the instance.
We would really like to create an algorithm that can predict our team’s total goal difference after a sequence of matches. The table below shows the team’s results obtained in a recent edition of the Copa America.
To higher dive into the information, allow us to visualize the outcomes. The initial algorithm estimates are shown by the yellow line within the diagram below. The obtained cumulative goal difference (last table column) is depicted in black.
Roughly speaking, our objective is to update the yellow line in a way that can higher adapt changes, based on the recent match results. For that, we’ll compare how constant-α Monte Carlo and one-step TD algorithms deal with this task.
Constant-α Monte Carlo
The Monte Carlo method calculates the cumulative reward G of the episode, which is in our case is the full goal difference in any case matches (+3). Then, every state is updated proportionally to the difference between the full episode’s reward and the present state’s value.
As an example, allow us to take the state after the third match against Peru (we’ll use the training rate α = 0.5)
- The initial state’s value is v = 1.2 (yellow point corresponding to Chile).
- The cumulative reward is G = 3 (black dashed line).
- The difference between the 2 values G–v = 1.8 is then multiplied by α = 0.5 which supplies the update step equal to Δ = 0.9 (red arrow corresponding to Chile).
- The brand new value’s state becomes equal to v = v + Δ = 1.2 + 0.9 = 2.1 (red point corresponding to Chile).
One-step TD
For the instance demonstration, we’ll take the full goal difference after the fourth match against Chile.
- The initial state’s value is v[t] = 1.5 (yellow point corresponding to Chile).
- The subsequent state’s value is v[t+1]= 2.1 (yellow point corresponding to Ecuador).
- The difference between consecutive state values is v[t+1]–v[t] = 0.6 (yellow arrow corresponding to Chile).
- Since our team won against Ecuador 5 : 0, then the transition reward from state t to t + 1 is R = 5 (black arrow corresponding to Ecuador).
- The TD error measures how much the obtained reward is greater as compared to the state values’ difference. In our case, TD error = R –(v[t+1]–v[t]) = 5–0.6 = 4.4 (red transparent arrow corresponding to Chile).
- The TD error is multiplied by the training rate a = 0.5 which results in the update step β = 2.2 (red arrow corresponding to Chile).
- The brand new state’s value is v[t] = v[t] + β = 1.5 + 2.2 = 3.7 (red point corresponding to Chile).
Comparison
Convergence
We are able to clearly see that the Monte Carlo algorithm pushes the initial estimates towards the episode’s return. At the identical time, one-step TD uses bootstrapping and updates every estimate with respect to the following state’s value and its immediate reward which generally makes it quicker to adapt to any changes.
As an example, allow us to take the state after the primary match. We all know that within the second match our team lost to Argentina 0 : 3. Nonetheless, each algorithms react absolutely in a different way to this scenario:
- Despite the negative result, Monte Carlo only considers the general goal difference in any case matches and pushes the present state’s value up which shouldn’t be logical.
- One-step TD, then again, takes into consideration the obtained result and instanly updates the state’s value down.
This instance demonstrates that in the long run, one-step TD performs more adaptive updates, resulting in the higher convergence rate than Monte Carlo.
The idea guarantees convergence to the proper value function in TD methods.
Update
- Monte Carlo requires the episode to be ended to ultimately make state updates.
- One step-TD allows updating the state immediately after receiving the motion’s reward.
In lots of cases, this update aspect is a big advantage of TD methods because, in practice, episodes could be very long. In that case, in Monte Carlo methods, your complete learning process is delayed until the tip of an episode. That’s the reason TD algorithms learn faster.
After covering the fundamentals of TD learning, we will now move on to concrete algorithm implementations. In the next sections we’ll concentrate on the three hottest TD variations:
- Sarsa
- Q-learning
- Expected Sarsa
As we learned within the introduction to Monte Carlo methods in part 3, to seek out an optimal strategy, we want to estimate the state-action function Q relatively than the worth function V. To perform this effectively, we adjust the issue formulation by treating state-action pairs as states themselves. Sarsa is an algorithm that opeates on this principle.
To perform state updates, Sarsa uses the identical formula as for one-step TD defined above, but this time it replaces the variable with the Q-function values:
The Sarsa name is derived by its update rule which uses 5 variables within the order: (S[t], A[t], R[t + 1], S[t + 1], A[t + 1]).
Sarsa control operates similarly to Monte Carlo control, updating the present policy greedily with respect to the Q-function using ε-soft or ε-greedy policies.
Sarsa in an on-policy method since it updates Q-values based on the present policy followed by the agent.
Q-learning is one of the vital popular algorithms in reinforcement learning. It is nearly similar to Sarsa apart from the small change within the update rule:
The one difference is that we replaced the following Q-value by the utmost Q-value of the following state based on the optimal motion that results in that state. In practice, this substitution makes Q-learning is more performant than Sarsa in most problems.
At the identical time, if we fastidiously observe the formula, we will notice that your complete expression is derived from the Bellman optimality equation. From this angle, the Bellman equation guarantees that the iterative updates of Q-values result in their convergence to optimal Q-values.
Q-learning is an off-policy algorithm: it updates Q-values based on the very best possible decision that could be taken without considering the behaviour policy utilized by the agent.
Expected Sarsa is an algorithm derived from Q-learning. As a substitute of using the utmost Q-value, it calculates the expected Q-value of the following action-state value based on the possibilities of taking each motion under the present policy.
In comparison with normal Sarsa, Expected Sarsa requires more computations but in return, it takes into consideration more information at every update step. In consequence, Expected Sarsa mitigates the impact of transition randomness when choosing the following motion, particularly in the course of the initial stages of learning. Due to this fact, Expected Sarsa offers the advantage of greater stability across a broader range of learning step-sizes α than normal Sarsa.
Expected Sarsa is an on-policy method but could be adapted to an off-policy variant just by employing separate behaviour and goal policies for data generation and learning respectively.
Up until this text, we’ve got been discussing a set algorithms, all of which utilize the maximization operator during greedy policy updates. Nonetheless, in practice, the max operator over all values results in overestimation of values. This issue particularly arises firstly of the training process when Q-values are initialized randomly. Consequently, calculating the utmost over these initial noisy values often ends in an upward bias.
As an example, imagine a state S where true Q-values for each motion are equal to Q(S, a) = 0. Resulting from random initialization, some initial estimations will fall below zero and one other part will likely be above 0.
- The utmost of true values is 0.
- The utmost of random estimates is a positive value (which known as maximization bias).
Example
Allow us to consider an example from the Sutton and Barto book where maximization bias becomes an issue. We’re coping with the environment shown within the diagram below where C is the initial state, A and D are terminal states.
The transition reward from C to either B or D is 0. Nonetheless, transitioning from B to A ends in a reward sampled from a traditional distribution with a mean of -0.1 and variance of 1. In other words, this reward is negative on average but can occasionally be positive.
Principally, on this environment the agent faces a binary alternative: whether to maneuver left or right from C. The expected return is obvious in each cases: the left trajectory ends in an expected return G = -0.1, while the suitable path yields G = 0. Clearly, the optimal strategy consists of all the time going to the suitable side.
Then again, if we fail to handle the maximization bias, then the agent may be very more likely to prioritize the left direction in the course of the learning process. Why? The utmost calculated from the conventional distribution will end in positive updates to the Q-values in state B. In consequence, when the agent starts from C, it would greedily select to maneuver to B relatively than to D, whose Q-value stays at 0.
To achieve a deeper understanding of why this happens, allow us to perform several calculations using the folowing parameters:
- learning rate α = 0.1
- discount rate γ = 0.9
- all initial Q-values are set to 0.
Iteration 1
In the primary iteration, the Q-value for going to B and D are each equal to 0. Allow us to break the tie by arbitrarily selecting B. Then, the Q-value for the state (C, ←) is updated. For simplicity, allow us to assume that the utmost value from the defined distribution is a finite value of three. In point of fact, this value is bigger than 99% percentile of our distribution:
The agent then moves to A with the sampled reward R = -0.3.
Iteration 2
The agent reaches the terminal state A and a brand new episode begins. Ranging from C, the agent faces the alternative of whether to go to B or D. In our conditions, with an ε-greedy strategy, the agent will almost opt going to B:
The analogous update is then performed on the state (C, ←). Consequently, its Q-value gets only greater:
Despite sampling a negative reward R = -0.4 and updating B further down, it doesn’t alter the situation because the utmost all the time stays at 3.
The second iteration terminates and it has only made the left direction more prioritized for the agent over the suitable one. In consequence, the agent will proceed making its initial moves from C to the left, believing it to be the optimal alternative, when in truth, it shouldn’t be.
One essentially the most elegant solutions to eliminate maximization bias consists of using the double learning algorithm, which symmetrically uses two Q-function estimates.
Suppose we want to find out the maximizing motion and its corresponding Q-value to perform an update. The double learning approach operates as follows:
- Use the primary function Q₁ to seek out the maximizing motion a⁎ = argmaxₐQ₁(a).
- Use the second function Q₂ to estimate the worth of the chosen motion a⁎.
The each functions Q₁ and Q₂ could be utilized in reverse order as well.
In double learning, just one estimate Q (not each) is updated on every iteration.
While the primary Q-function selects the very best motion, the second Q-function provides its unbiased estimation.
Example
We will likely be taking a look at the instance of how double learning is applied to Q-learning.
Iteration 1
For example how double learning operates, allow us to consider a maze where the agent can move one step in any of 4 directions during each iteration. Our objective is to update the Q-function using the double Q-learning algorithm. We are going to use the training rate α = 0.1 and the discount rate γ = 0.9.
For the primary iteration, the agent starts at cell S = A2 and, following the present policy, moves one step right to S’ = B2 with the reward of R = 2.
We assume that we’ve got to make use of the second update equation within the pseudocode shown above. Allow us to rewrite it:
Since our agent moves to state S’ = B2, we want to make use of its Q-values. Allow us to take a look at the present Q-table of state-action pairs including B2:
We’d like to seek out an motion for S’ = B2 that maximizes Q₁ and ultimately use the respective Q₂-value for a similar motion.
- The utmost Q₁-value is achieved by taking the ← motion (q = 1.2, red circle).
- The corresponding Q₂-value for the motion ← is q = 0.7 (yellow circle).
Allow us to rewrite the update equation in an easier form:
Assuming that the initial estimate Q₂(A2, →) = 0.5, we will insert values and perform the update:
Iteration 2
The agent is now situated at B2 and has to decide on the following motion. Since we’re coping with two Q-functions, we’ve got to seek out their sum:
Depending on a style of our policy, we’ve got to sample the following motion from a distribution. As an example, if we use an ε-greedy policy with ε = 0.08, then the motion distribution may have the next form:
We are going to suppose that, with the 94% probability, we’ve got sampled the ↑ motion. Meaning the agent will move next to the S’ = B3 cell. The reward it receives is R = -3.
For this iteration, we assume that we’ve got sampled the primary update equation for the Q-function. Allow us to break it down:
We’d like to know Q-values for all actions corresponding to B3. Here they’re:
Since this time we use the primary update equation, we take the utmost Q₂-value (red circle) and use the respective Q₁-value (yellow circle). Then we will rewrite the equation in a simplified form:
After making all value substitutions, we will calculate the :
We’ve checked out the instance of double Q-learning, which mitigates the maximization bias within the Q-learning algorithm. This double learning approach will also be prolonged as well to Sarsa and Expected Sarsa algorithms.
As a substitute of selecting which update equation to make use of with the p = 0.5 probability on each iteration, double learning could be adapted to iteratively alternate between each equations.
Despite their simplicity, temporal difference methods are amongst essentially the most widely used techniques in reinforcement learning today. What can be interesting is which are also extensively applied in other prediction problems similar to time series evaluation, stock prediction, or weather forecasting.
To date, we’ve got been discussing only a selected case of TD learning when n = 1. As we’ll see in the following article, it may possibly be useful to set n to higher values in certain situations.
We’ve not covered it yet, nevertheless it seems that control for TD algorithms could be implemented via actor-critic methods which will likely be discussed on this series in the long run. For now, we’ve got only reused the thought of GPI introduced in dynamic programming algorithms.
All images unless otherwise noted are by the creator.