the basic concepts you’ll want to know to know Reinforcement Learning!
We’ll progress from absolutely the basics of “” to more advanced topics, including agent exploration, values and policies, and distinguish between popular training approaches. Along the way in which, we will even learn in regards to the various challenges in RL and the way researchers have tackled them.
At the top of the article, I will even share a YouTube video I made that explains all of the concepts in this text in a visually engaging way. When you are usually not much of a reader, you may try that companion video as an alternative!
Reinforcement Learning Basics
Suppose you would like to train an AI model to learn easy methods to navigate an obstacle course. RL is a branch of Machine Learning where our models learn by collecting experiences – taking actions and observing what happens. More formally, RL consists of two components – the agent and the environment.
The Agent
The training process involves two key activities that occur over and all over again: exploration and training. During exploration, the agent collects experiences within the environment by taking actions and checking out what happens. After which, through the training activity, the agent uses these collected experiences to enhance itself.
The Environment
Once the agent selects an motion, the environment updates. It also returns a reward depending on how well the agent is doing. The environment designer programs how the reward is structured.
For instance, suppose you’re working on an environment that teaches an AI to avoid obstacles and reach the goal. You’ll be able to program your environment to return a positive reward when the agent is moving closer to the goal. But when the agent collides with an obstacle, you may program it to receive a big negative reward.
In other words, the environment provides a (a high positive reward, for instance) when the agent does something and a (a negative reward for instance) when it does something
Although the agent is oblivious to how the environment actually operates, it might probably still determine from its reward patterns easy methods to pick optimal actions that result in maximum rewards.

Policy
At each step, the agent AI observes the present state of the environment and selects an motion. The goal of RL is to learn a mapping from observations to actions, i.e. “given the state I’m observing, what motion should I select”?
In RL terms, this mapping from the state to motion can also be called a policy.
This policy defines how the agent behaves in numerous states, and in reinforcement learning we learn this function by training some type of a neural network.
Reinforcement Learning

Understanding the distinctions and interplay between the agent, the policy, and the environment could be very integral to know Reinforcement Learning.
- The Agent is the learner that explores and takes actions inside the environment
- The Policy is the strategy (often a neural network) that the agent uses to find out which motion to take given a state. In RL, our ultimate goal is to coach this strategy.
- The Environment is the external system that the agent interacts with, which provides feedback in the shape of rewards and recent states.
Here’s a quick one-liner definition it is best to remember:
In Reinforcement Learning, the agent follows a policy to pick out actions inside the environment.
Observations and Actions
The agent explores the environment by taking a sequence of “steps”. Each step is one decision. The agent the environment’s state. Decideson an . Receives a . Observes the . On this section, let’s understand what observations and actions are.
Statement
Statement is what the agent sees from the environment – the data it receives in regards to the environment’s current state. In an obstacle navigation environment, the statement could be LiDAR projections to detect the obstacles. For Atari games, it could be a history of the previous few pixel frames. For text generation, it could be the context of the generated tokens to date. In chess, it’s the position of all of the pieces, whose move it’s, etc.
The statement ideally accommodates all the data the agent must take an motion.
The motion space is all of the available decisions the agent can take. Actions could be discrete or continuous. A discrete motion space is when the agent has to choose from a selected set of categorical decisions. For instance, in Atari games, the actions could be the buttons of an Atari controller. For text generation, it’s to choose from all of the tokens present within the model’s vocabulary. In chess, it could possibly be a listing of accessible moves.

The environment designer may select a continuous motion space – where the agent generates continuous values to take a “step” within the environment. For instance, in our obstacle navigation example, the agent can select the x and y velocities to get a superb grain control of the movement. In a human character control task, the motion is commonly to output the torque or goal angle for each joint within the character’s skeleton.
An important lesson
But here is something very essential to know: To the agent and the policy – the environment and its specifics generally is a complete black box. The agent will receive vector-state information as an statement, generate an motion, receive a reward, and later learn from it.
So in your mind, you may consider the agent and the environment as two separate entities. The environment defines the state space, the motion space, the reward strategies, and the principles.
These rules are decoupled from how the agent explores and the way the policy is trained on the collected experiences.
When studying a research paper, it will be important to make clear in our mind which aspect of RL we’re reading about. Is it a couple of recent environment? Is it a couple of recent policy training method? Is it about an exploration strategy? Depending on the reply, you may treat other things as a black box.
Exploration
How does the agent explore and collect experiences?
Every RL algorithm must solve one in all the biggest dilemmas in training RL agents – exploration vs exploitation.
Exploration means trying out recent actions to collect information in regards to the environment. Imagine you’re learning to fight a boss in a difficult video game. At first, you’ll try different approaches, different weapons, spells, random things simply to see what sticks and what doesn’t.
Nevertheless, once you begin seeing some rewards, like consistently deal damage to the boss, you’ll stop exploring and begin exploiting the strategy you could have already acquired. Exploitation means greedily picking actions you think that will get the very best rewards.
A great RL exploration strategy must balance exploration and exploitation.
A preferred exploration strategy is Epsilon-Greedy, where the agent explores with a random motion a fraction of the time (defined by a parameter epsilon), and exploits its best-known motion the remainder of the time. This epsilon value is frequently high at the beginning and is step by step decreased to favor exploitation because the agent learns.

Epsilon greedy only works in discrete motion spaces. In continuous spaces, exploration is commonly handled in two popular ways. A technique is so as to add a little bit of random noise to the motion the agent decides to take. One other popular technique is so as to add an entropy bonus to the loss function, which inspires the policy to be less certain about its selections, naturally resulting in more varied actions and exploration.
Another ways to encourage exploration are:
- Design the environment to make use of random initialization of states at first of the episodes.
- Intrinsic exploration methods where the agent acts out of its own “curiosity.” Algorithms like Curiosity and RND reward the agent for visiting novel states or taking actions where the consequence is difficult to predict.
I cover these fascinating methods in my Agentic Curiosity video, so make sure you check that out!
Training Algorithms
A majority of research papers and academic topics in Reinforcement Learning are about optimizing the agent’s strategy to select actions. The goal of optimization algorithms is to learn actions that maximize the long-term expected rewards.
Let’s take a have a look at the several algorithmic selections one after the other.
Model-Based vs Model-Free
Alright, so our agent has explored the environment and picked up a ton of experience. Now what?
Does the agent learn to act directly from these experiences? Or does it first attempt to model the environment’s dynamics and physics?
One approach is model-based learning. Here, the agent first uses its experience to construct its own internal simulation, or a world model. This model learns to predict the results of its actions, i.e., given a state and motion, what’s the resulting next state and reward? Once it has this model, it might probably practice and plan entirely inside its own imagination, running 1000’s of simulations to search out the very best strategy without ever taking a dangerous step in the true world.

This is especially useful in environments where collecting real world experience could be expensive – like robotics or self-driving cars. Examples of Model-Based RL are: Dyna-Q, World Models, Dreamer, etc. I’ll write a separate article someday to cover these models in additional detail.
The second known as model-free learning. That is what the remainder of the article goes to cover. Here, the agent treats the environment as a black box and learns a policy directly from the collected experiences. Let’s talk more about Model-free RL in the subsequent section.
Value-Based Learning
There are two primary approaches to model-free RL algorithms.
Value-based algorithms learn to judge how good each state is. Policy-based algorithms learn directly easy methods to act in each state.

In value-based methods, the RL agent learns the of being in a selected state. The worth of a state literally means how good the state is. The intuition is that if the agent knows which states are good, it might probably pick actions that result in those states more often.
And thankful there’s a mathematical way of doing this – the Bellman Equation.
V(s) = r + γ * max V(s’).
This reoccurrence equation mainly says the value V(s) of is the same as the immediate reward r of being within the state plus the value of the very best ‘ the agent can reach from s. Gamma (γ) is a reduced factor (between 0 and 1) that nerfs the goodness of the subsequent state. It essentially decides how much the agent cares about rewards within the distant future versus immediate rewards. A γ near 1 makes the agent “far-sighted,” whereas a γ near 0 makes the agent “short-sighted,” greedily caring almost only in regards to the very next reward.
Q-Learning
We learnt the intuition behind state values, but how will we use that information to learn actions? The Q-Learning equation answers this.
Q(s, a) = r + γ * max_a Q(s’, a’)
The Q-value Q(s,a) is the of the motion a in state s. The above equation mainly states: The standard of an motion a in state s is the immediate reward r you get from being in state s, plus the discounted quality value of the subsequent best motion.
So in summary:
- Q-values are the standard values of every motion in each state.
- V-values are the worth of a selected state; it is the same as the utmost Q-value of all actions in that state.
- Policy π at a selected state is the motion that has the best Q-value in that state.

To learn more about Q-Learning, you may research Deep Q Networks, and their descendants, like Double Deep Q Networks and Dueling Deep Q Networks.
Value-based learning trains RL agents by learning the worth of being in specific states. Nevertheless, is there a direct approach to learn optimal actions without having to learn state values? Yes.
Policy learning methods directly learn optimal motion strategies without explicitly learning state values. Before we find out how, we must learn one other essential concept first. Temporal Difference Learning vs Monte Carlo Sampling.
TD Learning vs MC Sampling
How does the agent consolidate future experiences to learn?
In Temporal Difference (TD) Learning, the agent updates its value estimates after each step using the Bellman equation. And it does so by seeing its own estimate of the Q-value in the subsequent state. This strategy known as 1-step TD Learning, or one-step Temporal Difference Learning. You’re taking one step and update your learning based in your past estimates.

The second option known as Monte-Carlo sampling. Here, the agent waits for the whole episode to complete before updating anything. After which it uses the whole return from the episode:
Q(s,a) = r₁ + γr₂ + γ²r₃ + … + γⁿrₙ

Trade-offs between TD Learning and MC Sampling
TD Learning is pretty cool coz the agent can learn something from each step, even before it completes an episode. Meaning you may save your collected experiences for a very long time and keep training even on old experiences, but with recent Q-values. Nevertheless, TD learning is heavily biased by the agent’s current estimate of the state. So if the agent’s estimates are improper, it’ll keep reinforcing those improper estimates. This known as the “.”
Then again, Monte Carlo learning is at all times accurate since it uses the true returns from actual episodes. But in most RL environments, rewards and state transitions could be random. Also, because the agent explores the environment, its own actions could be random, so the states it visits during rollout are also random. This ends in the pure TD-Learning method affected by high variance issues as returns can vary dramatically between episodes.
Policy Gradients
Alright, now that we have now understood the concept of TD-Learning vs MC Sampling, it’s time to get back to Policy-Based Learning methods.
Recall that value-based methods like DQN first need to explicitly calculate the worth, or Q-value, for each single possible motion, after which they pick the very best one. Nevertheless it is feasible to skip this step, and Policy Gradient methods like REINFORCE do exactly that.

In REINFORCE, the policy network outputs probabilities for every motion, and we train it to extend the probability of actions that result in good outcomes. For discrete spaces, PG methods output the probability of every motion as a categorical distribution. For continuous spaces, PG methods output as Gaussian distributions, predicting the mean and standard deviation of every element within the motion vector.
So the query is, how exactly do you train such a model that directly predicts motion probabilities from states?
Here is where the Policy Gradient Theorem is available in. In this text, I’ll explain the core idea intuitively.
- Our policy gradient model is commonly denoted within the literature as pi_theta(a|s). Here, theta denotes the weights of the neural network. pi_theta(a|s) is the anticipated probability of motion a in state s by neural network theta.
- From a newly initialized policy network, we let the agent play out a full episode and collect all of the rewards.
- For each motion it took, determine the entire discounted return that got here after it. This is completed using the Monte Carlo approach.
- Finally, to really train the model, the policy gradient theorem asks us to maximise the formula provided within the figure below.
- If the return was high, this update will make that motion more probable in the longer term by increasing pi(a|s). If the return was negative, this update will make the motion less probable by reducing the pi(a|s).

The excellence between Q-Learning and REINFORCE
One in every of the core differences between Q-Learning and REINFORCE is that Q-Learning uses 1-step TD Learning, and REINFORCE uses Monte Carlo Sampling.
Through the use of 1-step TD, Q-learning must determine the standard value Q of every state-action possibility. Because recall that in 1-step TD the agent can take only one step within the environment and determine a high quality rating of the state.
Then again, with Monte Carlo sampling, the agent doesn’t have to depend on an estimator to learn. As a substitute, it uses actual returns observed during exploration. This makes REINFORCE “unbiased” with the caveat that it requires multiple samples to appropriately estimate the worth of a trajectory. Moreover, the agent cannot train until it fully finishes a trajectory (that’s reach a terminal state), and it cannot reuse trajectories after the policy network updates.
In practice, REINFORCE often results in stability issues and sample inefficiency. Let’s speak about how Actor Critic addresses these limitations.
Advantage Actor Critic
When you try to make use of vanilla REINFORCE on most complex problems, it’ll struggle, and the explanation why is twofold.
The primary is since it suffers from high variance coz it’s a Monte Carlo sampling method. Second, it has no sense of baseline. Like, imagine an environment that at all times gives you a positive reward, then the returns won’t ever be negative, so REINFORCE will increase the chances of all actions, albeit in a disproportionate way.
We don’t need to reward actions only for getting a positive rating. We would like to reward them for being higher than average.
And that’s where the concept of becomes essential. As a substitute of just using the raw return to update our policy, we’ll subtract the expected return for that state. So our recent update signal becomes:
Advantage = The Return you bought – The Return you expected
While Advantage gives us a baseline for our observed returns, let’s also discuss the concept of Actor Critic methods.
Actor Critic combines the very best of Value-Based Methods (like DQN) and the very best of Policy-Based Methods (like REINFORCE). Actor Critic methods train a separate “critic” neural network that is simply trained to judge states, very similar to the Q-Network from earlier.
The actor method, however, learns the policy.

Combining Advantage and Actor critics, we will understand how the popular A2C algorithm works:
- Initialize 2 neural networks: the policy or actor network, and the worth or critic network. The actor network inputs a state and outputs motion probabilities. The critic network inputs a state and outputs a single float representing the state’s value.
- We generate some rollouts within the environment by querying the actor
- We update the critic network using either TD Learning or Monte Carlo Learning. There are more advanced approaches, like Generalized Advantage Estimates as well, that mix the 2 approaches for more stable learning.
- We evaluate the advantage by subtracting the observed return from the common return generated by the Critic Network
- Finally, we update the Policy network by utilizing the advantage and the policy gradient equation.
Actor-critic methods solve the variance problem in policy gradients by utilizing a worth function as a baseline. PPO (Proximal Policy Optimization) extends A2C by adding the concepts of “trust regions” into the training algorithm, which prevents excessive changes to the network weights during learning. We won’t get into details about PPO in this text; perhaps someday we’ll open that Pandora’s box.
Conclusion
This text is a companion piece to the YouTube video below I made. Be at liberty to ascertain it out, in case you enjoyed this read.
Every algorithm makes specific selections for every query, and these selections cascade through the whole system, affecting all the pieces from sample efficiency to stability to real-world performance.
In the long run, creating an RL algorithm is about answering these problems by making your selections. DQNs decide to learn values. policy methods directly learns a **policy**. Monte Carlo methods update after a full episode using actual returns – this makes them unbiased, but they’ve high variance due to the stochastic nature of RL exploration. TD Learning as an alternative chooses to learn at every step based on the agent’s own estimates. Actor Critic methods mix DQNs and Policy Gradients by learning an actor and a critic network individually.
Note that there’s loads we didn’t cover today. But that is a great base to get you began with Reinforcement Learning.
That’s the top of this text, see you in the subsequent one! You need to use the links below to find more of my work.
My Patreon:
https://www.patreon.com/NeuralBreakdownwithAVB
My YouTube channel:
https://www.youtube.com/@avb_fj
Follow me on Twitter:
https://x.com/neural_avb
Read my articles:
https://towardsdatascience.com/writer/neural-avb/
