⚠️ A recent updated version of this text is accessible here 👉 https://huggingface.co/deep-rl-course/unit1/introduction
This text is a component of the Deep Reinforcement Learning Class. A free course from beginner to expert. Check the syllabus here.
⚠️ A recent updated version of this text is accessible here 👉 https://huggingface.co/deep-rl-course/unit1/introduction
This text is a component of the Deep Reinforcement Learning Class. A free course from beginner to expert. Check the syllabus here.
Welcome to probably the most fascinating topic in Artificial Intelligence: Deep Reinforcement Learning.
Deep RL is a form of Machine Learning where an agent learns the best way to behave in an environment by performing actions and seeing the outcomes.
Since 2013 and the Deep Q-Learning paper, we’ve seen plenty of breakthroughs. From OpenAI five that beat a few of the very best Dota2 players of the world, to the Dexterity project, we live in an exciting moment in Deep RL research.

Furthermore, since 2018, you could have now, access to so many amazing environments and libraries to construct your agents.
That’s why that is the very best moment to begin learning, and with this course you’re in the best place.
Yes, because this text is the primary unit of Deep Reinforcement Learning Class, a free class from beginner to expert where you’ll learn the idea and practice using famous Deep RL libraries comparable to Stable Baselines3, RL Baselines3 Zoo and RLlib.
On this free course, you’ll:
- 📖 Study Deep Reinforcement Learning in theory and practice.
- 🧑💻 Learn to use famous Deep RL libraries comparable to Stable Baselines3, RL Baselines3 Zoo, and RLlib.
- 🤖 Train agents in unique environments comparable to SnowballFight, Huggy the Doggo 🐶, and classical ones comparable to Space Invaders and PyBullet.
- 💾 Publish your trained agents in a single line of code to the Hub. But in addition download powerful agents from the community.
- 🏆 Take part in challenges where you’ll evaluate your agents against other teams.
- 🖌️🎨 Learn to share your environments made with Unity and Godot.
So in this primary unit, you’ll learn the foundations of Deep Reinforcement Learning. After which, you may train your first lander agent to land appropriately on the Moon 🌕 and upload it to the Hugging Face Hub, a free, open platform where people can share ML models, datasets and demos.
It’s essential to master these elements before diving into implementing Deep Reinforcement Learning agents. The goal of this chapter is to present you solid foundations.
Should you prefer, you’ll be able to watch the 📹 video version of this chapter :
So let’s start! 🚀
What’s Reinforcement Learning?
To know Reinforcement Learning, let’s start with the large picture.
The large picture
The concept behind Reinforcement Learning is that an agent (an AI) will learn from the environment by interacting with it (through trial and error) and receiving rewards (negative or positive) as feedback for performing actions.
Learning from interaction with the environment comes from our natural experiences.
As an example, imagine putting your little brother in front of a video game he never played, a controller in his hands, and letting him alone.

Your brother will interact with the environment (the video game) by pressing the best button (motion). He got a coin, that’s a +1 reward. It’s positive, he just understood that on this game he must get the coins.

But then, he presses right again and he touches an enemy, he just died -1 reward.

By interacting along with his environment through trial and error, your little brother understood that he needed to get coins on this environment but avoid the enemies.
With none supervision, the kid will get well and higher at playing the sport.
That’s how humans and animals learn, through interaction. Reinforcement Learning is only a computational approach of learning from motion.
A proper definition
If we take now a proper definition:
Reinforcement learning is a framework for solving control tasks (also called decision problems) by constructing agents that learn from the environment by interacting with it through trial and error and receiving rewards (positive or negative) as unique feedback.
⇒ But how Reinforcement Learning works?
The Reinforcement Learning Framework
The RL Process

To know the RL process, let’s imagine an agent learning to play a platform game:

- Our Agent receives state from the Environment — we receive the primary frame of our game (Environment).
- Based on that state , the Agent takes motion — our Agent will move to the best.
- Environment goes to a recent state — recent frame.
- The environment gives some reward to the Agent — we’re not dead (Positive Reward +1).
This RL loop outputs a sequence of state, motion, reward and next state.

The agent’s goal is to maximise its cumulative reward, called the expected return.
The reward hypothesis: the central idea of Reinforcement Learning
⇒ Why is the goal of the agent to maximise the expected return?
Because RL is predicated on the reward hypothesis, which is that every one goals might be described because the maximization of the expected return (expected cumulative reward).
That’s why in Reinforcement Learning, to have the very best behavior, we’d like to maximize the expected cumulative reward.
Markov Property
In papers, you’ll see that the RL process is named the Markov Decision Process (MDP).
We’ll talk again in regards to the Markov Property in the next units. But when you’ll want to remember something today about it, Markov Property implies that our agent needs only the present state to make a decision what motion to take and not the history of all of the states and actions they took before.
Observations/States Space
Observations/States are the information our agent gets from the environment. Within the case of a video game, it may well be a frame (a screenshot). Within the case of the trading agent, it may well be the worth of a certain stock, etc.
There may be a differentiation to make between statement and state:
- State s: is an entire description of the state of the world (there isn’t a hidden information). In a totally observed environment.

In chess game, we receive a state from the environment since we now have access to the entire check board information.
With a chess game, we’re in a totally observed environment, since we now have access to the entire check board information.
- Commentary o: is a partial description of the state. In a partially observed environment.

In Super Mario Bros, we only see an element of the extent near the player, so we receive an statement.
In Super Mario Bros, we’re in a partially observed environment. We receive an statement since we only see an element of the extent.
In point of fact, we use the term state on this course but we’ll make the excellence in implementations.
To recap:

Motion Space
The Motion space is the set of all possible actions in an environment.
The actions can come from a discrete or continuous space:
- Discrete space: the variety of possible actions is finite.

In Super Mario Bros, we now have a finite set of actions since we now have only 4 directions and jump.
- Continuous space: the variety of possible actions is infinite.

A Self Driving Automotive agent has an infinite variety of possible actions since it may well turn left 20°, 21,1°, 21,2°, honk, turn right 20°…
To recap:
Taking this information into consideration is crucial because it can have importance when selecting the RL algorithm in the longer term.
Rewards and the discounting
The reward is prime in RL since it’s the one feedback for the agent. Because of it, our agent knows if the motion taken was good or not.
The cumulative reward at every time step t might be written as:

Which is akin to:

Nevertheless, in point of fact, we are able to’t just add them like that. The rewards that come sooner (at first of the sport) usually tend to occur since they’re more predictable than the long-term future reward.
Let’s say your agent is that this tiny mouse that may move one tile every time step, and your opponent is the cat (that may move too). Your goal is to eat the utmost amount of cheese before being eaten by the cat.

As we are able to see within the diagram, it’s more probable to eat the cheese near us than the cheese near the cat (the closer we’re to the cat, the more dangerous it’s).
Consequently, the reward near the cat, even when it is greater (more cheese), will probably be more discounted since we’re not likely sure we’ll give you the chance to eat it.
To discount the rewards, we proceed like this:
- We define a reduction rate called gamma. It have to be between 0 and 1. More often than not between 0.99 and 0.95.
-
The larger the gamma, the smaller the discount. This implies our agent cares more in regards to the long-term reward.
-
Then again, the smaller the gamma, the larger the discount. This implies our agent cares more in regards to the short term reward (the closest cheese).
2. Then, each reward will probably be discounted by gamma to the exponent of the time step. Because the time step increases, the cat gets closer to us, so the longer term reward is less and fewer more likely to occur.
Our discounted cumulative expected rewards is:

Form of tasks
A task is an instance of a Reinforcement Learning problem. We are able to have two forms of tasks: episodic and continuing.
Episodic task
On this case, we now have a start line and an ending point (a terminal state). This creates an episode: a listing of States, Actions, Rewards, and recent States.
As an example, take into consideration Super Mario Bros: an episode begin on the launch of a brand new Mario Level and ending once you’re killed otherwise you reached the top of the extent.

Continuing tasks
These are tasks that proceed eternally (no terminal state). On this case, the agent must learn the best way to select the very best actions and concurrently interact with the environment.
As an example, an agent that does automated stock trading. For this task, there isn’t a start line and terminal state. The agent keeps running until we determine to stop them.


Exploration/ Exploitation tradeoff
Finally, before taking a look at different methods to unravel Reinforcement Learning problems, we must cover yet another very essential topic: the exploration/exploitation trade-off.
-
Exploration is exploring the environment by trying random actions so as to find more information in regards to the environment.
-
Exploitation is exploiting known information to maximise the reward.
Remember, the goal of our RL agent is to maximise the expected cumulative reward. Nevertheless, we are able to fall into a typical trap.
Let’s take an example:

On this game, our mouse can have an infinite amount of small cheese (+1 each). But at the highest of the maze, there may be a big sum of cheese (+1000).
Nevertheless, if we only give attention to exploitation, our agent won’t ever reach the large sum of cheese. As an alternative, it can only exploit the closest source of rewards, even when this source is small (exploitation).
But when our agent does somewhat little bit of exploration, it may well discover the large reward (the pile of huge cheese).
That is what we call the exploration/exploitation trade-off. We’d like to balance how much we explore the environment and the way much we exploit what we all know in regards to the environment.
Due to this fact, we must define a rule that helps to handle this trade-off. We’ll see in future chapters alternative ways to handle it.
If it’s still confusing, consider an actual problem: the selection of a restaurant:

- Exploitation: You go on daily basis to the identical one which is nice and take the chance to miss one other higher restaurant.
- Exploration: Try restaurants you never went to before, with the chance of getting a foul experience however the probable opportunity of a implausible experience.
To recap:

The 2 essential approaches for solving RL problems
⇒ Now that we learned the RL framework, how will we solve the RL problem?
In other terms, the best way to construct an RL agent that may select the actions that maximize its expected cumulative reward?
The Policy π: the agent’s brain
The Policy π is the brain of our Agent, it’s the function that tell us what motion to take given the state we’re. So it defines the agent’s behavior at a given time.

Consider policy because the brain of our agent, the function that can tells us the motion to take given a state
This Policy is the function we wish to learn, our goal is to seek out the optimal policy π, the policy that* maximizes expected return when the agent acts based on it. We discover this π through training.*
There are two approaches to coach our agent to seek out this optimal policy π*:
- Directly, by teaching the agent to learn which motion to take, given the state is in: Policy-Based Methods.
- Not directly, teach the agent to learn which state is more invaluable after which take the motion that results in the more invaluable states: Value-Based Methods.
Policy-Based Methods
In Policy-Based Methods, we learn a policy function directly.
This function will map from each state to the very best corresponding motion at that state. Or a probability distribution over the set of possible actions at that state.

We have now two forms of policy:
- Deterministic: a policy at a given state will at all times return the identical motion.


- Stochastic: output a probability distribution over actions.


If we recap:


Value-based methods
In Value-based methods, as an alternative of coaching a policy function, we train a worth function that maps a state to the expected value of being at that state.
The worth of a state is the expected discounted return the agent can get if it starts in that state, after which act based on our policy.
“Act based on our policy” just signifies that our policy is “going to the state with the very best value”.

Here we see that our price function defined value for every possible state.

Because of our price function, at each step our policy will select the state with the most important value defined by the worth function: -7, then -6, then -5 (and so forth) to realize the goal.
If we recap:


The “Deep” in Reinforcement Learning
⇒ What we have talked about to date is Reinforcement Learning. But where does the “Deep” come into play?
Deep Reinforcement Learning introduces deep neural networks to unravel Reinforcement Learning problems — hence the name “deep”.
As an example, in the subsequent article, we’ll work on Q-Learning (classic Reinforcement Learning) after which Deep Q-Learning each are value-based RL algorithms.
You’ll see the difference is that in the primary approach, we use a standard algorithm to create a Q table that helps us find what motion to take for every state.
Within the second approach, we’ll use a Neural Network (to approximate the q value).

Should you will not be accustomed to Deep Learning you actually should watch the fastai Practical Deep Learning for Coders (Free)
That was plenty of information, if we summarize:
-
Reinforcement Learning is a computational approach of learning from motion. We construct an agent that learns from the environment by interacting with it through trial and error and receiving rewards (negative or positive) as feedback.
-
The goal of any RL agent is to maximise its expected cumulative reward (also called expected return) because RL is predicated on the reward hypothesis, which is that all goals might be described because the maximization of the expected cumulative reward.
-
The RL process is a loop that outputs a sequence of state, motion, reward and next state.
-
To calculate the expected cumulative reward (expected return), we discount the rewards: the rewards that come sooner (at first of the sport) are more probable to occur since they’re more predictable than the long run future reward.
-
To unravel an RL problem, you must find an optimal policy, the policy is the “brain” of your AI that can tell us what motion to take given a state. The optimal one is the one who gives you the actions that max the expected return.
-
There are two ways to seek out your optimal policy:
- By training your policy directly: policy-based methods.
- By training a worth function that tells us the expected return the agent will get at each state and use this function to define our policy: value-based methods.
-
Finally, we discuss Deep RL because we introduces deep neural networks to estimate the motion to take (policy-based) or to estimate the worth of a state (value-based) hence the name “deep.”
Now that you have studied the bases of Reinforcement Learning, you’re able to train your first lander agent to land appropriately on the Moon 🌕 and share it with the community through the Hub 🔥
Start the tutorial here 👉 https://github.com/huggingface/deep-rl-class/blob/essential/unit1/unit1.ipynb
And since the very best method to learn and avoid the illusion of competence is to check yourself. We wrote a quiz to allow you to find where you’ll want to reinforce your study.
Check your knowledge here 👉 https://github.com/huggingface/deep-rl-class/blob/essential/unit1/quiz.md
Congrats on ending this chapter! That was the most important one, and there was plenty of information. And congrats on ending the tutorial. You’ve just trained your first Deep RL agent and shared it on the Hub 🥳.
That’s normal if you happen to still feel confused with all these elements. This was the identical for me and for all individuals who studied RL.
Take time to essentially grasp the fabric before continuing. It’s essential to master these elements and having a solid foundations before entering the fun part.
We published additional readings within the syllabus if you must go deeper 👉 https://github.com/huggingface/deep-rl-class/blob/essential/unit1/README.md
Naturally, in the course of the course, we’re going to make use of and explain these terms again, however it’s higher to grasp them before diving into the subsequent chapters.
In the subsequent chapter, we’re going to study Q-Learning and dive deeper into the value-based methods.
And remember to share with your pals who wish to learn 🤗 !
Finally, we wish to enhance and update the course iteratively together with your feedback. If you could have some, please fill this kind 👉 https://forms.gle/3HgA7bEHwAmmLfwh9
