Benchmarking Tabular Reinforcement Learning Algorithms

-

posts, we explored Part I of the seminal book by Sutton and Barto [1] (*). In that section, we delved into the three fundamental techniques underlying nearly every modern Reinforcement Learning (RL) algorithm: Dynamic Programming (DP), Monte Carlo methods (MC), and Temporal Difference Learning (TD). We not only discussed algorithms from each field in depth but in addition implemented every one in Python.

Part I of the book focuses on tabular solution methods — approaches fitted to problems sufficiently small to be represented in table form. As an example, with Q-learning, we are able to compute and store an entire Q-table containing every possible state-action pair. In contrast, Part II of Sutton’s book tackles approximate solution methods. When state and motion spaces turn out to be too large — and even infinite — we must generalize. Consider the challenge of playing Atari games: the state space is solely too vast to model exhaustively. As an alternative, deep neural networks are used to compress the state right into a latent vector, which then serves as the premise for an approximated value function [2].

While we’ll enterprise into Part II in upcoming posts, I’m excited to announce a brand new series: we’ll benchmark all of the algorithms introduced in Part I against each other. This post serves each as a summary and an introduction to our benchmarking framework. We’ll evaluate each algorithm based on how quickly and effectively it will possibly solve increasingly larger Gridworld environments. In future posts, we plan to increase our experiments to more difficult scenarios, corresponding to two-player games, where the differences between these methods shall be much more apparent.

Summarized, on this post, we’ll:

  • Introduce the benchmark task and discuss our comparison criteria.
  • Provide a chapter-by-chapter summary of the methods introduced in Sutton’s book together with initial benchmarking results.
  • Discover the best-performing methods from each group and deploy them in a larger-scale benchmarking experiment.

Table of Contents

Introducing the Benchmark Task and Experiment Planning

On this post we’ll work with the Gymnasium [3] environment “Gridworld”. It’s essentially a maze-finding task, wherein the agent has to get from the top-left corner of the maze to the bottom-right tile (the current) — without falling into any of the icy lakes:

Image by writer

The state space is a number between 0 and N — 1, where N is the maximal variety of tiles (16 in our case). There are 4 actions the agent can execute in every step: going left, right, up or down. Reaching the goal yields reward 1, falling into the lake ends the episode with none reward.

The great thing about this environment is that one can generate random worlds of arbitrary size. Thus, what we’ll do with all methods, is plot the variety of steps / updates needed to resolve the environment versus the environment size. Actually, Sutton does this in some parts of the book, too, so we are able to consult with that.

Preliminaries

I’d like to begin with some general notes — hoping these will guide you thru my thought process.

It isn’t easy to match algorithms in a “fair” way. When implementing the algorithms I mainly searched for correctness, but in addition simplicity — that’s, I wanted readers to simply find a way to attach the Python code with pseudocode from Sutton’s book. For “real” use cases, one would surely optimize the code more, and likewise use a big array of tricks common in RL, corresponding to using decaying exploration, optimistic initialization, learning rate tuning, and more. Further, one would take great care to tune the hyperparameters of the deployed algorithm. 

Applied RL Tricks

On account of the big variety of algorithms under investigation, I cannot do that here. As an alternative, I identified two vital mechanisms, and measured their effectiveness on an algorithm known to work quite well: Q-learning. These are:

  • Intermediate rewards: as a substitute of only rewarding the agent for reaching the goal, we reward it for progress along the maze. We quantify this by the (normalized) difference in x and y coordinates between current and former state. This makes use of the incontrovertible fact that the goal in each Gridworld environment is at all times at the underside right, and thus higher x / y coordinates are likely to be higher (although one could still get “stuck”, in case an icy lake is between the agent and the goal). Since this difference is normalized by the variety of states, its contribution is small, s.t. it doesn’t overshadow the reward of reaching the goal.
  • Decaying exploration: throughout this post series, the exploration-exploration dilemma got here up incessantly. It describes the trade-off of exploiting states / actions already known to be good, and exploring less explored states — potentially finding even higher solutions, at the danger of wasting time in less optimal areas. One common technique addressing that is to decay exploration: starting with a high amount of exploration early, and slowly decaying this all the way down to lower levels. We do that by linearly decaying ε from 1 (100% random exploration) to 0.05 over 1000 steps.

Let’s have a take a look at how Q-learning performs with these techniques:

Image by writer

As we are able to see, within the baseline setup the variety of steps needed quickly grows, and reaches the maximal variety of allowed steps (100.000 ) — meaning the algorithm didn’t solve the environment within the allotted variety of steps. Also decaying ε alone didn’t contribute much. Nevertheless, adding intermediate rewards proved to be extremely effective — and the mix of this and decaying ε performed best.

Thus, for many methods to come back we start with the “naïve” environment, the baseline implementation. Later we show results for the “improved” environment consisting of intermediate rewards and decaying exploration.

Comparison Criteria

As was visible within the previous section I selected the variety of steps needed until the found policy solves the Gridworld environment because the default way of comparing methods. This seems a bit more fair than simply measuring elapsed time, since time will depend on the concrete implementation (just like the concept of Big O notation)— and, as mentioned above, I didn’t optimize for speed. Nevertheless, it will be significant to notice that also steps could be misleading, as e.g. one step in DP methods comprises a loop over all states, whereas one step in MC and TD methods is the generation in a single episode (actually for TD methods we normally count one step as one value update, i.e. an episode generation consists of multiple steps — nevertheless I made this more comparable to MC methods on purpose). On account of this we also show elapsed time sometimes.

Experiment Structure

To scale back the variance, for every Gridworld size we run each method thrice, after which save the bottom variety of steps needed.

The code required to run all following benchmarking could be found on GitHub.

Recap and Benchmarking of All Algorithms

With the basics out of the best way, let’s properly start. On this section we’ll recap all introduced algorithms from Part I of Sutton’s book. Further, we’ll benchmark them against one another on the previously introduced Gridworld task.

Dynamic Programming

We start with Chapter 4 of Sutton’s book, describing methods from DP. These could be applied to a wide selection of problems, at all times constructing on the principle of iteratively constructing larger solutions from smaller subproblems. Within the context of RL, DP methods maintain a Q-table which is filled out incrementally. For this, we require a model of the environment, after which, using this model, update the expected value of states or state-action pairs depending on the possible successor states. Sutton introduces two methods we picked up in our corresponding post: policy and value iteration.

Let’s start with policy iteration. This consists of two interleaved steps, namely policy evaluation and policy improvement. Policy evaluation uses DP to — because the name says — evaluate the present policy: we incrementally update the state estimates by utilizing the model and policy. Next comes policy improvement, which employs a fundamental concept of RL: in response to the policy improvement theorem, any policy gets higher when changing the expected motion in a single state to a greater motion. Following this, we construct the brand new policy from the Q-table in greedy fashion. That is repeated, until the policy has converged.

The corresponding pseudocode looks as follows:

Image from [1]

Let’s come to value iteration. This may be very just like policy iteration, but with an easy, yet crucial modification: in every loop, just one step of policy evaluation is run. It may be shown that this still converges to the optimal policy, and overall does so faster than policy iteration:

Image from [1]

For more details, here’s my corresponding post about DP methods.

Results

Now it’s time to see what these first two algorithms are really made from. We run the experiment sketched above, and get the next plot:

Image by writer

Each methods are in a position to solve all created Gridworld sizes within the minimal variety of steps, 100. Surprising? Well, this actually shows each a strength and in addition to a weakness of DP methods, as concurrently of our methodology: DP methods are “thorough”, they require an entire model of the world, after which iterate over all states — yielding a very good solution with just a couple of passes over all states. Nevertheless, which means that all states should be estimated till convergence — despite the fact that a few of them may be much less interesting — and this scales quite badly with the environment size. Actually, one measured step here comprises a full run over all states — indicating that for these methods time is a greater measure. 

For this, we get the next graph:

Image by writer

Now, we are able to see increase in compute needed w.r.t. to the variety of states. And, we also can see that, as claimed, value iteration converges much faster, and scales a lot better. Note that the x-axis labels denote n, with the Gridworld having a size of n x n.

Monte Carlo Methods

Next in our post series on RL we covered MC methods. These can learn from experience alone, i.e. one can run them in any sort of environment, without having a model of it — which is a shocking realization, and really useful: often, we don’t have this model, other times, it could be too complex and impractical to make use of. Consider the sport of Blackjack: while we are able to actually model all possible outcomes and corresponding probabilities, it’s a really tedious task — and learning to play by just doing that may be a very tempting thought. On account of not using a model, MC methods are unbiased — but on the downside their expectation has a high variance. 

One issue when implementing these methods is ensuring that every one state-action pairs are repeatedly visited, and thus updated. On account of not having a model, we cannot simply iterate over all possible mixtures (compare e.g. DP methods), but (in a way) randomly explore the environment. If as a result of this we missed some states entirely, we’d loose theoretical convergence guarantees, which might translate into practice.

A method of satisfying that is the exploring starts assumption (ES): we start each episode in a random state and select the primary motion randomly, too. Other than that, MC methods could be implemented slightly simply: we simply play out full episodes, and set the expected value of state-action pairs to the typical obtained reward.

MC with ES looks as follows:

Image from [1]

To remove the idea of ES, we are able to resort to 2 classes of algorithms: on- and off-policy methods. Let’s start with the on-policy one.

This is definitely not too different from the ES algorithm, we simply use an ε-greedy policy for generating episodes. That’s, we remove the idea of ES and use a “soft” as a substitute of a “hard” policy for generating episodes: the used policy in every iteration isn’t fully greedy, but ε-greedy — which guarantees that within the limit we see all possible state-action pairs:

Image from [1]

Off-policy methods follow the thought of splitting exploration and learning in two policies. We maintain a policy π, which we wish to optimize, and a behavior policy, b.

Nevertheless, we are able to’t simply use b all over the place in our algorithm. When generating an episode and computing returns, we obtain:

Image from [1]

I.e., the resulting value is the expected value of b, not π.

That is where importance sampling is available in. We are able to fix this expectation with the suitable ratio:

Image from [1]

This ratio is defined by:

Image from [1]

In our case, we obtain the next formula:

Image from [1]

(Note that this uses weighted importance sampling, as a substitute of “naïve” importance sampling.)

We could in fact compute these ratios naively in every step. Nevertheless, Sutton introduces a clever scheme updating these values (denoted by W) incrementally, which is far more efficient. Actually, in my original post I showed the naive version, too — I consider this helps with understanding. Nevertheless, since here we mainly care about benchmarking, and the “naïve” and the “incremental” version are equivalent, as much as performance — we here only list the marginally more complex incremental version.

In pseudocode the corresponding algorithm looks as follows:

Image from [1]

Note that, against our initial post introducing these methods, where the behavior policy was simply randomly, here we pick a greater one — namely an ε-greedy one w.r.t. to the present Q-table.

For more details here’s my corresponding post on MC methods.

Results

With that, let’s compare these three algorithms on small Gridworld environments. Note that one step here denotes one full episode generated:

We observe off-policy MC to already trip at a Gridword size of 5×5, and, despite the fact that MC with ES and on-policy MC perform higher, also they begin to struggle with larger sizes.

This may be somewhat surprising, and disappointing for MC fans. Don’t worry, we’ll manage to spice up this — nevertheless it shows a weakness of those algorithms: in “large” environments with sparse rewards, MC methods principally must hope to come across the goal by probability — which decreases exponentially with the dimensions of the environment.

Thus, let’s attempt to make the duty easier for the model, and use the previously introduced tricks empirically found to assist performance of TD-learning: adding intermediate rewards, and ε-decay — our “improved” setup.

Actually, with this all methods fare a lot better:

Nevertheless, now MC ES is causing problems. Thus, let’s put this aside and proceed without it: ES anyhow was a theoretical concept on the best way of developing MC methods, and clunky to make use of / implement (some might remember how I implemented having the environment start in random states …):

Here, at the least we get near the outcomes of DP. Note that I capped the maximal variety of steps to 100.000, so every time this number shows up within the graph it signifies that the algorithm couldn’t solve this environment within the given step limit. On-policy MC actually seems to perform very well, the variety of steps needed barely increases— but off-policy MC seems to perform worse.

Discussion

To me, MC methods perform surprisingly well — since they essentially stumble across the environment randomly at first, hoping to seek out the goal by exploration alone. Nevertheless, in fact this isn’t fully true — their performance (speaking of on-policy MC) becomes really good only after enabling intermediate rewards — which guide the model towards the goal. On this setup it seems MC methods perform very well — one potential reason being that they’re unbiased — and fewer sensitive to hyperparameter tuning and co.

Temporal-Difference Learning

Let’s come to TD methods. These could be seen as combining the strengths of each approaches previously introduced: just like MC, they don’t need a model of the environment — but still they construct upon previous estimates, they bootstrap — as in DP.

Let’s recap DP and MC models:

DP methods turn the Bellman equation into an update rule, and compute the worth of a state based on the estimated values of its successor states:

Image from [1]

MC methods, alternatively, play out full episodes after which update their value estimates based on the observed return:

Image from [1]

TD methods mix these two ideas. They play out full episodes, but after every step update value estimates with the observed return, and the previous estimate:

Image from [1]

A number of the most fundamental RL algorithms stem from this field — and we’ll discuss them in the next.

Let’s begin with Sarsa. First, we modify above introduced update rule to work with state-action pairs:

Image from [1]

With this, Sarsa is definitely introduced slightly quickly: we play episodes, and update values following our current policy. The name comes from the tuples utilized in the updates:

Image from [1]

In pseudocode this looks as follows:

Image from [1]

Next up now we have Q-learning. This may be very just like Sarsa, with one key difference: it’s an off-policy algorithm. As an alternative of simply following the executed transition in the course of the update, we take the utmost Q-value of all successor states:

Image from [1]

You may picture this as making a behavior policy b, which is the same as π, except being greedy within the transitions under query.

The pseudocode looks like this:

Image from [1]

One other algorithm is Expected Sarsa, which (you guessed it) — is an extension of Sarsa. As an alternative of following the one transition executed by the policy, we account for all possible successor states, and weigh them by how likely they’re given the present policy:

Image from [1]

The last algorithm on this chapter is an extension of Q-learning. Q-learning suffers from an issue referred to as maximization bias: because it uses a maximum over expected values, the resulting estimate can have positive bias. We are able to address this by utilizing two Q-tables: for every update we use one for choosing a value-maximizing motion, and the opposite for computing the update goal. Which is used where is set by a coin flip. The algorithm is referred to as Double Q-learning:

Image from [1]

Results

Let’s have a take a look at the outcomes, starting with the naïve environment:

Image by writer

We are able to see that each Q-learning methods begin to get problems with Gridworld sizes of 11 x 11.

Thus let’s apply our known tricks, yielding the “improved” setup:

Image by writer

All methods can now find solutions significantly quicker — just Expected Sarsa falls out. This might thoroughly be — it’s significantly less used than Q-learning or Sarsa, and possibly more a theoretical concept. 

Thus, let’s proceed without this method and see how large world sizes we are able to solve:

Image by writer

Q-learning can now also solve grid sizes of 25 x 25 effortlessly — but Sarsa and Double Q-learning begin to degrade.

More details could be present in my introductory post about TD methods.

Discussion

Within the improved setup, TD methods generally perform well. We only eliminated Expected Sarsa early, which anyhow isn’t such a standard algorithm.

“Easy” Sarsa and Double Q-learning struggle for larger environment sizes, while Q-learning performs well overall. The latter is somewhat surprising, since Double Q-learning should address a number of the shortcomings of ordinary Q-learning, specifically the high variance. Potentially, we already reduce the variance by running each experiment n times. One other hypothesis might be that Double Q-learning takes longer to converge, because the variety of parameters has also doubled — which might indicate that the facility of Double Q-learning shows higher for more complex problems with more time.

As mentioned performs Q-learning higher than Sarsa. This mirrors what can see in research / literature, namely that Q-learning is significantly more popular. This will probably explained by it being off-policy, which normally yields more powerful solution methods. Sarsa alternatively performs higher for stochastic or “dangerous” tasks: since in Sarsa the actual chosen motion is taken into consideration in the worth update, it higher understands the consequences of its actions, which turns out to be useful for stochastic environments and / or environments where one can, e.g., fall off a cliff. Despite the latter being the case here, the environment might be not complex or large enough, that this effect comes into play.

TD-n

TD-n methods in a way marry classical TD learning and MC methods. As Sutton so nicely puts it, they “free us from the tyranny of the timestep” [1]. In MC methods, we’re forced to attend a full episode before making any updates. In TD methods, we update estimates in every step — but are also forced to only look one step in the long run.

Thus, it is smart to introduce n-step returns:

Image from [1]

With that, we are able to simply introduce Sarsa-n:

Image from [1]

We play episodes following the present policy, after which update the worth estimates with the n-step return.

In my corresponding post, we also introduce an off-policy version of this. Nevertheless, to not blow up this post too long, and negative experience with off-policy MC methods, we give attention to the “classics” — corresponding to Sarsa-n — and tree-n tree backup, which we introduce next.

n-step tree backup is an extension of the previously seen Expected Sarsa. When computing the n-step return, the corresponding transition tree looks as follows:

Image from [1]

I.e., there’s a single path down the tree corresponding to the actual motion taken. Just as in Expected Sarsa, we now need to weigh actions in response to their probability determined by the policy. But since now now we have a tree of depth > 1, the cumulative value of later levels is weighted by the probability of the motion taken to succeed in these levels:

Image from [1]

The pseudocode looks as follows:

Image from [1]

Here’s my corresponding post on n-step TD methods.

Results

As usual, we start with the “naïve” setting, and acquire the next results:

Image by writer

Sarsa-n starts to struggle already with smaller grid world sizes. Let’s see if the improved setup changes this:

Image by writer

Now indeed Sarsa-n performs a lot better, but n-step tree backup doesn’t.

Discussion

I discovered this discovery unexpected and somewhat hard to clarify. I’d love to listen to your thoughts on this — but within the meantime I used to be chatting with my chat agent of alternative, and got here to this hypothesis: intermediate rewards potentially confuse the tree algorithm, because it must learn an identical return distribution over all possible actions. Further, the more ε decays, the more the expected distribution might differ from the behavior policy.

Model-Based Reinforcement Learning / Planning

Within the previous chapter we discussed the subject “planning” — within the RL context with this we mainly consult with model-based methods. That’s: now we have (or construct) a model of the environment, and use this model to explore further “virtually”, and specifically use these explorations for more and higher updates / learnings of the worth function. The next image displays the combination of planning into learning thoroughly:

Image from [1]

Within the top-right corner we see the “classical” RL training loop (also dubbed “direct” RL): starting with some value function / policy we act within the (real) environment, and use this experience to update our price function (or policy within the case of policy-gradient methods). When incorporating planning, we moreover also learn a model of the world from this experience, after which use this model to generate further (virtual) experience, and update our price or policy function from this.

This actually is strictly the Dyna-Q algorithm, which looks as follows in pseudocode:

Image from [1]

Steps (a) — (d) are our classical Q-learning, whereas the remainder of the algorithm adds the novel planning functionality, specifically the world model learning.

One other related algorithm is Prioritized Sweeping, which changes how we sample states for the “planning loop”: we explore and play in the true environment, while learning the model, and save state-action pairs with large expected value changes to a queue. Only with this queue we start the “planning loop”, i.e. something to the steps (e) and (f) above:

Image from [1]

More details could be present in my previous post on model-based RL methods.

Results

Let’s start with the naïve setting:

Image by writer

Dyna Q performs reasonably well, while Prioritized Sweeping struggles early on.

Within the improved setting we see an identical thing:

Image by writer

Discussion

Prioritized sweeping already performed poorly within the corresponding introductory post — I believe there either is a few issue, or more likely this simply is a “tuning” thing — i.e. using a flawed sampling distribution.

Dyna-Q yields solid results.

Benchmarking the Best Algorithms

We’ve got now seen the performance of all algorithms from Part I of Sutton’s book by benchmarking them per chapter and on Gridworlds of as much as size 25 x 25. Already here we saw higher and worse performing algorithms, and specifically already discarded a couple of candidates not fitted to larger environments.

Now we wish to benchmark the remaining ones — the perfect ones from each chapter — against each other, on Gridworlds as much as size 50 x 50.

Image by writer

These algorithms are:

  • value iteration
  • on-policy MC
  • Q-learning
  • Sarsa-n
  • Dyna-Q

Results

Here’s how they perform on Gridworld, this time with a maximal step limit of 200.000:

Image by writer

Let’s also plot the corresponding time needed (note that I plot unsuccessful runs — runs reaching the maximal variety of steps without generating a feasible policy — at 500s):

Image by writer

We are able to observe several interesting facts from these figures:

  1. The variety of steps vs. time needed is very correlated.
  2. Value iteration performs exceptionally well, solving even Gridworlds of size 50 x 50 with ease, and doing so magnitudes faster than the next-best algorithm.
  3. The rating for the remaining algorithms is (higher to worse): On-policy MC, Dyna-Q, Q-learning, Sarsa-n.

In the following section we discuss these in additional details.

Discussion

1. Steps vs. Time

We began this post with a discussion on which metrics / measurement to make use of, and — specifically — whether to make use of variety of steps or time needed to resolve the issue. Looking back, we are able to say that this discussion was not so relevant in spite of everything, and — somewhat surprisingly — these two numbers are highly correlated. That’s, despite the incontrovertible fact that, as initially described, one “step” can differ depending on the algorithm.

2. Value Iteration Dominates

Value Iteration performed remarkably well, solving even large Gridworlds (as much as 50×50) with ease—outpacing all other algorithms by a large margin. This may be surprising, considering that DP methods are sometimes considered theoretical tools, rarely utilized in practice. Real-world applications are inclined to favor methods like Q-learning [2], PPO [4], or MCTS [5].

So why does such a “textbook” method dominate here? Because this environment is tailor-made for it:

  • The model is fully known.
  • The dynamics are easy and deterministic.
  • The state space is comparatively small.

These are precisely the conditions under which DP thrives. In contrast, model-free methods like Q-learning are designed for settings where such information is available. Their strength lies in generality and scalability, not in exploiting small, well-defined problems. Q-learning incurs high variance and requires many episodes to converge—disadvantages which are magnified in small-scale environments. In brief, there’s a transparent trade-off between efficiency and generality. We’ll revisit this point in a future post once we introduce function approximation, where Q-learning has more room to shine.

3. A Rating Emerges

Beyond Value Iteration, we observed the next performance rating: On-policy MC > Dyna-Q > Q-learning > Sarsa-n

On-policy Monte Carlo emerged because the best-performing model-free algorithm. This suits with our earlier reasoning: MC methods are easy, unbiased, and well-suited to problems with deterministic goals—especially when episodes are relatively short. While not scalable to large or continuous problems, MC methods appear to be quite effective in small to medium-sized tasks like Gridworld.

Dyna-Q comes next. This result reinforces our expectations: Dyna-Q blends model-based planning with model-free learning. Although the model is learned (not given, as in Value Iteration), it’s still easy and deterministic here—making the learned model useful. This boosts performance substantially over pure model-free approaches.

Q-learning, while still powerful, underperforms on this context for the explanations discussed above: it’s a general-purpose algorithm that isn’t in a position to fully leverage the structure of easy environments.

Sarsa-n landed in last place. A probable explanation is the added bias introduced through bootstrapping in its multi-step updates. Unlike Monte Carlo methods, which estimate returns from full trajectories (unbiased), Sarsa-n uses bootstrapped estimates of future rewards. In small environments, this bias can outweigh the advantages of reduced variance.

Lastly, let’s compare our results vs. those from Sutton:

Image from [1]

Note that Sutton lists the whole variety of steps on the x-axis, whereas we list n, with the whole variety of states being n x n. For 376 states, Sutton report ~100k steps before the optimal solution is found, whereas we report 75k for 400 states (20 x 20), considering Dyna-Q. The numbers are highly comparable and supply a reassuring validation of our setup and implementation.

Conclusion

This post served each as a recap of our series on Part I of Sutton and Barto’s [1]and as an extension beyond the book’s scope—by benchmarking all introduced algorithms on increasingly larger Gridworld environments.

We began by outlining our benchmarking setup, then revisited the core chapters of Part I: Dynamic Programming, Monte Carlo methods, Temporal-Difference learning, and Model-Based RL / Planning. In each section, we introduced key algorithms corresponding to Q-learning, provided complete Python implementations, and evaluated their performance on Gridworlds as much as size 25×25. The goal of this initial round was to discover top performers from each algorithmic family. Based on our experiments, the standouts were:
Value Iteration, On-policy MC, Q-learning, Sarsa-n, and Dyna-Q. Python code to breed these results, and specifically implementations of all discussed methods, is offered on GitHub.

Next, we stress-tested these high-performers on larger environments (as much as 50×50) and observed the next rating:
Value Iteration > On-policy MC > Dyna-Q > Q-learning > Sarsa-n

While this result could also be surprising—given the widespread use of Q-learning and the relatively rare application of Value Iteration and MC methods—it is smart in context. Easy, fully-known, deterministic environments are perfect for Value Iteration and MC methods. In contrast, Q-learning is designed for more complex, unknown, and high-variance environments where function approximation becomes vital. As we discussed, there’s a trade-off between efficiency in structured tasks and generality in complex ones.

That brings us to what’s next. In upcoming posts, we’ll push the boundaries further:

  • First, by benchmarking these methods in more difficult environments corresponding to two-player games, where direct competition will expose their differences more starkly.
  • Then, we’ll dive into Part II of Sutton’s book, where function approximation is introduced. This unlocks the power to scale reinforcement learning to environments far beyond what tabular methods can handle.

If you happen to’ve made it this far—thanks for reading! I hope you enjoyed this deep dive, and I’d like to have you ever back for the following installment within the series.

Other Posts on this Series

References

[1] http://incompleteideas.net/book/RLbook2020.pdf

[2] https://arxiv.org/abs/1312.5602

[3] https://gymnasium.farama.org/index.html

[4] https://arxiv.org/abs/1707.06347

[5] https://arxiv.org/abs/1911.08265

(*) Images from [1] used with permission from the authors.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x