Dissecting “Reinforcement Learning” by Richard S. Sutton with custom Python implementations, Episode V
In our previous post, we wrapped up the introductory series on fundamental reinforcement learning (RL) techniques by exploring Temporal-Difference (TD) learning. TD methods merge the strengths of Dynamic Programming (DP) and Monte Carlo (MC) methods, leveraging their best features to form a few of an important RL algorithms, akin to Q-learning.
Constructing on that foundation, this post delves into n-step TD learning, a flexible approach introduced in Chapter 7 of Sutton’s book [1]. This method bridges the gap between classical TD and MC techniques. Like TD, n-step methods use bootstrapping (leveraging prior estimates), but additionally they incorporate the subsequent n
rewards, offering a singular mix of short-term and long-term learning. In a future post, we’ll generalize this idea even further with eligibility traces.
We’ll follow a structured approach, starting with the prediction problem before moving to control. Along the way in which, we’ll:
- Introduce n-step Sarsa,
- Extend it to off-policy learning,
- Explore the n-step tree backup algorithm, and
- Present a unifying perspective with n-step Q(σ).
As at all times, yow will discover all accompanying code on GitHub. Let’s dive in!