Reinforcement learning is a site in machine learning that introduces the concept of an agent learning optimal strategies in complex environments. The agent learns from its actions, which lead to rewards, based on the environment’s state. Reinforcement learning is a difficult topic and differs significantly from other areas of machine learning.
What’s remarkable about reinforcement learning is that the identical algorithms could be used to enable the agent adapt to completely different, unknown, and complicated conditions.
Note. To totally understand the concepts included in this text, it is extremely really useful to be conversant in concepts discussed in previous articles.

Reinforcement Learning
Up until now, now we have only been discussing tabular reinforcement learning methods. On this context, the word “tabular” indicates that every one possible actions and states could be listed. Subsequently, the worth function V or Q is represented in the shape of a table, while the last word goal of our algorithms was to search out that value function and use it to derive an optimal policy.
Nonetheless, there are two major problems regarding tabular methods that we’d like to handle. We’ll first take a look at them after which introduce a novel approach to beat these obstacles.
This text relies on Chapter 9 of the book “Reinforcement Learning” written by Richard S. Sutton and Andrew G. Barto. I highly appreciate the efforts of the authors who contributed to the publication of this book.
1. Computation
The primary aspect that must be clear is that tabular methods are only applicable to problems with a small variety of states and actions. Allow us to recall a blackjack example where we applied the Monte Carlo method partially 3. Despite the undeniable fact that there have been only 200 states and a couple of actions, we got good approximations only after executing several million episodes!
Imagine what colossal computations we would want to perform if we had a more complex problem. For instance, if we were coping with RGB images of size 128 × 128, then the entire variety of states could be 3 ⋅ 256 ⋅ 256 ⋅ 128 ⋅ 128 ≈ 274 billion. Even with modern technological advancements, it could be absolutely unattainable to perform the obligatory computations to search out the worth function!
In point of fact, most environments in reinforcement learning problems have an enormous variety of states and possible actions that could be taken. Consequently, value function estimation with tabular methods is not any longer applicable.
2. Generalization
Even when we imagine that there are not any problems regarding computations, we’re still more likely to encounter states which might be never visited by the agent. How can standard tabular methods evaluate v- or q-values for such states?
This text will propose a novel approach based on supervised learning that may efficiently approximate value functions regardless the variety of states and actions.
The concept of value-function approximation lies in using a parameterized vector w that may approximate a price function. Subsequently, any longer, we’ll write the worth function v̂ as a function of two arguments: state s and vector w:
Our objective is to search out v̂ and w. The function v̂ can take various forms but essentially the most common approach is to make use of a supervised learning algorithm. Because it seems, v̂ generally is a linear regression, decision tree, or perhaps a neural network. At the identical time, any state s could be represented as a set of features describing this state. These features function an input for the algorithm v̂.
Why are supervised learning algorithms used for v̂?
It is understood that supervised learning algorithms are superb at generalization. In other words, if a subset (X₁, y₁) of a given dataset D for training, then the model is anticipated to also perform well for unseen examples X₂.
At the identical time, we highlighted above the generalization problem for reinforcement learning algorithms. On this scenario, if we apply a supervised learning algorithm, then we should always now not worry about generalization: even when a model has not seen a state, it could still attempt to generate approximate value for it using available features of the state.
Example
Allow us to return to the maze and show an example of how the worth function can look. We’ll represent the present state of the agent by a vector consisting of two components:
- x₁(s) is the gap between the agent and the terminal state;
- x₂(s) is the variety of traps positioned across the agent.
For v, we are able to use the scalar product of s and w. Assuming that the agent is currently positioned at cell B1, the worth function v̂ will take the shape shown within the image below:
Difficulties
With the presented idea of supervised learning, there are two principal difficulties now we have to handle:
1. Learned state values are not any longer decoupled. In all previous algorithms we discussed, an update of a single state didn’t affect some other states. Nonetheless, now state values rely upon vector w. If the vector w is updated throughout the learning process, then it can change the values of all other states. Subsequently, if w is adjusted to enhance the estimate of the present state, then it is probably going that estimations of other states will develop into worse.
2. Supervised learning algorithms require targets for training that should not available. We would like a supervised algorithm to learn the mapping between states and true value functions. The issue is that we wouldn’t have any true state values. On this case, it will not be even clear learn how to calculate a loss function.
State distribution
We cannot completely do away with the primary problem, but what we are able to do is to specify how much each state is vital to us. This could be done by making a state distribution that maps every state to its importance weight.
This information can then be taken into consideration within the loss function.
More often than not, μ(s) is chosen proportionally to how often state s is visited by the agent.
Loss function
Assuming that v̂(s, w) is differentiable, we’re free to decide on any loss function we like. Throughout this text, we will likely be taking a look at the instance of the MSE (mean squared error). Other than that, to account for the state distribution μ(s), every error term is scaled by its corresponding weight:
Within the shown formula, we have no idea the true state values v(s). Nevertheless, we’ll give you the option to beat this issue in the following section.
Objective
After having defined the loss function, our ultimate goal becomes to search out the perfect vector w that may minimize the target VE(w). Ideally, we would really like to converge to the worldwide optimum, but in point of fact, essentially the most complex algorithms can guarantee convergence only to an area optimum. In other words, they will find the perfect vector w* only in some neighbourhood of w.
Despite this fact, in lots of practical cases, convergence to an area optimum is commonly enough.
Stochastic-gradient methods are amongst the most well-liked methods to perform function approximation in reinforcement learning.
Allow us to assume that on iteration t, we run the algorithm through a single state example. If we denote by wₜ a weight vector at step t, then using the MSE loss function defined above, we are able to derive the update rule:
We all know learn how to update the burden vector w but what can we use as a goal within the formula above? To start with, we’ll change the notation just a little bit. Since we cannot obtain exact true values, as a substitute of v(S), we’re going to use one other letter U, which is able to indicate that true state values are approximated.
The ways the state values could be approximated are discussed in the next sections.
Gradient Monte Carlo
Monte Carlo is the best method that could be used to approximate true values. What makes it great is that the state values computed by Monte Carlo are unbiased! In other words, if we run the Monte Carlo algorithm for a given environment an infinite variety of times, then the averaged computed state values will converge to the true state values:
Why can we care about unbiased estimations? In line with theory, if goal values are unbiased, then SGD is guaranteed to converge to an area optimum (under appropriate learning rate conditions).
In this fashion, we are able to derive the Gradient Monte Carlo algorithm, which uses expected returns Gₜ as values for Uₜ:
Once the entire episode is generated, all expected returns are computed for each state included within the episode. The respective expected returns are used throughout the weight vector w update. For the following episode, recent expected returns will likely be calculated and used for the update.
As in the unique Monte Carlo method, to perform an update, now we have to attend until the tip of the episode, and that generally is a problem in some situations. To beat this drawback, now we have to explore other methods.
Bootstrapping
At first sight, bootstrapping looks as if a natural alternative to gradient Monte Carlo. On this version, every goal is calculated using the transition reward R and the goal value of the following state (or n steps later in the final case):
Nonetheless, there are still several difficulties that should be addressed:
- Bootstrapped values are biased. At first of an episode, state values v̂ and weights w are randomly initialized. So it’s an obvious undeniable fact that on average, the expected value of Uₜ is not going to approximate true state values. As a consequence, we lose the guarantee of converging to an area optimum.
- Goal values rely upon the burden vector. This aspect will not be typical in supervised learning algorithms and might create complications when performing SGD updates. Consequently, we now not have the chance to calculate gradient values that will result in the loss function minimization, in response to the classical SGD theory.
The excellent news is that each of those problems could be overcome with semi-gradient methods.
Semi-gradient methods
Despite losing essential convergence guarantees, it seems that using bootstrapping under certain constraints on the worth function (discussed in the following section) can still result in good results.
As now we have already seen in part 5, in comparison with Monte Carlo methods, bootstrapping offers faster learning, enabling it to be online and will likely be preferred in practice. Logically, these benefits also hold for gradient methods.
Allow us to take a look at a specific case where the worth function is a scalar product of the burden vector w and the feature vector x(s):
That is the best form the worth function can take. Moreover, the gradient of the scalar product is just the feature vector itself:
Consequently, the update rule for this case is amazingly easy:
The alternative of the linear function is especially attractive because, from the mathematical perspective, value approximation problems develop into much easier to research.
As an alternative of the SGD algorithm, additionally it is possible to make use of the approach to least squares.
Linear function in gradient Monte Carlo
The alternative of the linear function makes the optimization problem convex. Subsequently, there is barely one optimum.
On this case, regarding gradient Monte Carlo (if its learning rate α is adjusted appropriately), a very important conclusion could be made:
Because the gradient Monte Carlo method is guaranteed to converge to an area optimum, it’s mechanically guaranteed that the found local optimum will likely be global when using the linear value approximation function.
Linear function in semi-gradient methods
In line with theory, under the linear value function, gradient one-step TD algorithms also converge. The one subtlety is that the convergence point (which is known as the TD fixed point) will likely be positioned near the worldwide optimum. Despite this, the approximation quality with the TD fixed point if often enough in most tasks.
In this text, now we have understood the scalability limitations of ordinary tabular algorithms. This led us to the exploration of value-function approximation methods. They permit us to view the issue from a rather different angle, which elegantly transforms the reinforcement learning problem right into a supervised machine learning task.
The previous knowledge of Monte Carlo and bootstrapping methods helped us elaborate their respective gradient versions. While gradient Monte Carlo comes with stronger theoretical guarantees, bootstrapping (especially the one-step TD algorithm) remains to be a preferred method on account of its faster convergence.
All images unless otherwise noted are by the creator.