Home Artificial Intelligence Solving Reinforcement Learning Racetrack Exercise with Off-policy Monte Carlo Control

Solving Reinforcement Learning Racetrack Exercise with Off-policy Monte Carlo Control

3
Solving Reinforcement Learning Racetrack Exercise with Off-policy Monte Carlo Control

Image generated by Midjourney with a paid subscription, which complies general industrial terms [1].

Within the section Off-policy Monte Carlo Control of the book Reinforcement Learning: An Introduction 2nd Edition (page 112), the writer left us with an interesting exercise: using the weighted importance sampling off-policy Monte Carlo method to search out the fastest way driving on each tracks. This exercise is comprehensive that asks us to contemplate and construct almost every component of a reinforcement learning task, just like the environment, agent, reward, actions, conditions of termination, and the algorithm. Solving this exercise is fun and helps us construct a solid understanding of the interaction between algorithm and environment, the importance of an accurate episodic task definition, and the way the worth initialization affects the training consequence. Through this post, I hope to share my understanding and solution to this exercise with everyone curious about reinforcement learning.

As mentioned above, this exercise asks us to search out a policy that makes a race automobile drive from the starting line to the ending line as fast as possible without running into gravel or off the track. After fastidiously reading the exercise descriptions, I listed some key points which are essential to finish this task:

  • Map representation: maps on this context are literally 2D matrices with (row_index, column_index) as coordinates. The worth of every cell represents the state of that cell; as an example, we are able to use 0 to explain gravel, 1 for the track surface, 0.4 for the starting region, and 0.8 for the ending line. Any row and column index outside the matrix could be regarded as out-of-boundary.
  • Automobile representation: we are able to directly use the matrix’s coordinates to represent the automobile’s position;
  • Velocity and control: the rate space is discrete and consists of horizontal and vertical speeds that could be represented as a tuple (row_speed, col_speed). The speed limit on each axes is (-5, 5) and incremented by +1, 0, and -1 on each axis in each step; due to this fact, there are a complete of 9 possible actions in each step. Literally, the speed can’t be each zero except on the starting line, and the vertical velocity, or row speed, can’t be negative as we don’t want our automobile to drive back to the starting line.
  • Reward and episode: the reward for every step before crossing the ending line is -1. When the automobile runs out of the track, it’ll be reset to certainly one of the starting cells. The episode ends ONLY when the automobile successfully crosses the ending line.
  • Starting states: we randomly select starting cell for the automobile from the starting line; the automobile’s initial speed is (0, 0) in keeping with the exercise’s description.
  • Zero-acceleration challenge: the writer proposes a small zero-acceleration challenge that, at every time step, with 0.1 probability, the motion won’t take effect and the automobile will remain at its previous speed. We will implement this challenge in training as an alternative of adding the feature to the environment.

The answer to the exercise is separated into two posts; on this post, we’ll give attention to constructing a racetrack environment. The file structure of this exercise is as follows:

|-- race_track_env
| |-- maps
| | |-- build_tracks.py // this file is used to generate track maps
| | |-- track_a.npy // track a knowledge
| | |-- track_b.npy // track b data
| |-- race_track.py // race track environment
|-- exercise_5_12_racetrack.py // the answer to this exercise

And the libraries utilized in this implementation are as follows:

python==3.9.16
numpy==1.24.3
matplotlib==3.7.1
pygame==2.5.0

We will represent track maps as 2D matrices with different values indicating track states. I need to be loyal to the exercise, so I’m attempting to construct the identical maps shown within the book by assigning matrix values manually. The maps might be saved as separate .npy files in order that the environment can read the file in training as an alternative of generating them within the runtime.

The 2 maps look as follows; the sunshine cells are gravel, the dark cells are track surfaces, the green-ish cells represent the ending line, and the light-blue-ish cells represent the starting line.

Figure 1 track maps with the 2D matrix representation. Source: Image by the writer

With the track maps ready, we now proceed to create a gym-like environment with which the algorithm can interact. The gymnasium, previously the OpenAI gym now belonging to the Farama Foundation, provides an easy interface for testing RL algorithms; we are going to use the gymnasium package to create the racetrack environment. Our surroundings will include the next components/features:

  • env.nS: the form of the statement space, which is (num_rows, num_cols, row_speed, col_speed) in this instance. The variety of rows and columns varies between maps, however the speed space are consistent across tracks. For the row speed, as we don’t want the automobile to return to the starting line, the row speed observations consist of [-4, -3, -2, -1, 0] (negative value because we expect the automobile to go upwards within the map), five spaces in total; the column speed has no such limit, so the observations range from -4 to 4, nine spaces in total. Subsequently, the form of the statement space within the racetrack example is (num_rows, num_cols, 5, 9).
  • env.nA: the variety of possible actions. There are 9 possible actions in our implementation; we may even create a dictionary within the environment class to map the integer motion to the (row_speed, col_speed) tuple representation to regulate the agent.
  • env.reset(): the function that takes the automobile back to certainly one of the starting cells when the episode finishes or the automobile runs out of the track;
  • env.step(motion): the step function enables the algorithm to interact with the environment by taking an motion and returning a tuple of (next_state, reward, is_terminated, is_truncated).
  • State-checking functions: there’re two private functions that help to ascertain if the automobile left the track or crossed the ending line;
  • Rendering functions: rendering function is crucial to the customized environment from my viewpoint; I all the time check if the environment has properly been built by rendering out the sport space and agent’s behaviors, which is more straightforward than simply observing logging information.

With these features in mind, we are able to start constructing the racetrack environment.

To date, with all the pieces needed for the racetrack environment ready, we are able to test/confirm our surroundings setup.

First, we render out the game-playing to ascertain if every component is working easily:

Figure 2 Agents driving on each tracks with random policy. Source: Gif by the writer

Then we turn off the render choice to make the environment run background to ascertain if the trajectory terminates properly:

// Track a
| Remark | reward | Terminated | Total reward |
| (1, 16, 0, 3) | -1 | True | -4984 |
| (2, 16, 0, 2) | -1 | True | -23376 |
| (3, 16, 0, 3) | -1 | True | -14101 |
| (1, 16, 0, 3) | -1 | True | -46467 |

// Track b
| Remark | reward | Terminated | Total reward |
| (1, 31, -2, 2) | -1 | True | -3567 |
| (0, 31, -4, 4) | -1 | True | -682 |
| (2, 31, -2, 1) | -1 | True | -1387 |
| (3, 31, -1, 3) | -1 | True | -2336 |

With the environment ready, we are able to prepare to implement the off-policy MC control algorithm with weighted importance sampling algorithm, as show below:

Figure 3 Off-policy Comte Carlo Control Algorithm. Source: Algorithm written in latex by the writer.

The Monte Carlo methods solve the reinforcement learning problem by averaging sample returns. In training, the strategy generates a trajectory based on a given policy and learns from the tail of every episode. The difference between on-policy and off-policy learning is that the on-policy methods use one policy to make decisions and enhancements, whereas the off-policy methods use separate policies for generating data and policy improvement. In response to the writer of the book, just about all off-policy methods utilize importance sampling to estimate expected values under one distribution given samples from one other. [2]

The next several sections will explain tricks of soppy policy and weighted importance sampling appearing within the algorithm above and the way essential a correct value initialization is to this task.

The off-policy approach to this algorithm uses two policies:

  • Goal policy: the policy being learned about. On this algorithm, the goal policy is greedy and deterministic, which implies the probability of the greedy motion A at time t is 1.0, with all other actions’ probability equal to 0.0. The goal policy follows the Generalized Policy Iteration (GPI) that improves after every step.
  • Behavior policy: the policy used to generate behavior. The behavior policy on this algorithm is defined as soft policy, meaning that every one actions in a given state have a non-zero probability. We’ll use the ε-greedy policy as our behavior policy here.

In soft policy, the probability of an motion is:

We must always also return this probability when generating actions for the importance sampling purpose. The code for the behavior policy looks as follows:

We estimate the relative probability of the trajectory generated by the goal policy under the trajectory from the behavior policy, and the equation for such probability is:

The worth function for the weighted importance sampling is:

We will reframe the equation to suit it to the episodic task with the shape of incremental updates:

There are a whole lot of excellent examples of how you can derivate the above equation, so we don’t spend time deriving it here again. But I do also want to elucidate the interesting tricks in regards to the last several lines of the algorithm:

Figure 4 The trick within the off-policy MC control algorithm. Source: Image by the writer

The trick relies on the setting that the goal policy here is deterministic. If the motion taken at a time step differs from that comes from the goal policy, then the probability of that motion w.r.t the goal policy is 0.0; thus, all of the proceeding steps now not update to the state-action value function of the trajectory. Quite the opposite, if the motion at the moment step is identical because the goal policy, then the probability is 1.0, and the action-value update continues.

Proper weight initialization plays a vital role in successfully solving this exercise. Let’s first take a have a look at the rewards on each tracks from a random policy:

// Track a
| Remark | reward | Terminated | Total reward |
| (1, 16, 0, 3) | -1 | True | -4984 |
| (2, 16, 0, 2) | -1 | True | -23376 |
| (3, 16, 0, 3) | -1 | True | -14101 |
| (1, 16, 0, 3) | -1 | True | -46467 |

// Track b
| Remark | reward | Terminated | Total reward |
| (1, 31, -2, 2) | -1 | True | -3567 |
| (0, 31, -4, 4) | -1 | True | -682 |
| (2, 31, -2, 1) | -1 | True | -1387 |
| (3, 31, -1, 3) | -1 | True | -2336 |

The reward at every time step before crossing the ending line is -1. On the early stage of coaching, the reward might be large in negative values; if we initialize a state-action value to 0 or random values from a standard distribution, say, with a mean of 0 and variance of 1, the worth then might be considered too optimistic that may take a really very long time for this system to search out an optimal policy.

With several tests, I discovered a standard distribution with a mean of -500 and a variance of 1 might be effective values for a faster convergence; you might be encouraged to try other values and may definitely find a greater initial value than this.

With the entire thoughts above in mind, we are able to program out the Off-policy Monte Carlo control function and use it to resolve the racetrack exercise. We’ll also add the zero-acceleration challenge that’s mentioned within the exercise description.

Then we train the policies for 1,000,000 episodes without/with zero acceleration challenge on each tracks and evaluate them with the next code:

By plotting out the training record, we discover that the policy converges across the 10,000th episode on each tracks without considering zero acceleration; adding zero acceleration makes the convergence slower and unstable before the 100,000th episode but still converges afterward.

Figure 5 Training reward history of track A. Source: Image by writer
Figure 6 Training reward history of track B. Source: Image by writer

Finally, we are able to ask the agents to play the sport with trained policies, and we also plot out the possible trajectories to further check the correctness of the policies.

Figure 7 Agents driving on each tracks based on trained policies. Source: Gif by the writer

Sample trajectories for track a:

Figure 8 Sample optimal trajectories for track A. Source: Image by the writer

Sample trajectories for track b:

Figure 9 Sample optimal trajectories for track B. Source: Image by the writer

To date, we’ve solved the racetrack exercise. This implementation could still have some problems, and also you’re very welcome to point them out and discuss a greater solution within the comment. Thanks for reading!

[1] Midjourney Terms of Service: https://docs.midjourney.com/docs/terms-of-service

[2] Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. MIT Press, 2018.

My GitHub repo of this exercise: [Link]; please check the “exercise section”.

3 COMMENTS

LEAVE A REPLY

Please enter your comment!
Please enter your name here