Introducing n-Step Temporal-Difference Methods

-

Dissecting “Reinforcement Learning” by Richard S. Sutton with custom Python implementations, Episode V

In our previous post, we wrapped up the introductory series on fundamental reinforcement learning (RL) techniques by exploring Temporal-Difference (TD) learning. TD methods merge the strengths of Dynamic Programming (DP) and Monte Carlo (MC) methods, leveraging their best features to form a few of an important RL algorithms, akin to Q-learning.

Constructing on that foundation, this post delves into n-step TD learning, a flexible approach introduced in Chapter 7 of Sutton’s book [1]. This method bridges the gap between classical TD and MC techniques. Like TD, n-step methods use bootstrapping (leveraging prior estimates), but additionally they incorporate the subsequent n rewards, offering a singular mix of short-term and long-term learning. In a future post, we’ll generalize this idea even further with eligibility traces.

We’ll follow a structured approach, starting with the prediction problem before moving to control. Along the way in which, we’ll:

  • Introduce n-step Sarsa,
  • Extend it to off-policy learning,
  • Explore the n-step tree backup algorithm, and
  • Present a unifying perspective with n-step Q(σ).

As at all times, yow will discover all accompanying code on GitHub. Let’s dive in!

ASK DUKE

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x