There are 4 sorts of Machine Learning:
- Supervised — when all of the observations within the dataset are labeled with a goal variable, and you may perform regression/classification to learn tips on how to predict them.
- Unsupervised — when there isn’t a goal variable, so you may perform clustering to segment and group the info.
- Semi-Supervised — when the goal variable isn’t complete, so the model has to learn tips on how to predict unlabeled data as well. On this case, a mixture of supervised and unsupervised models is used.
- Reinforcement — when there’s a reward as a substitute of a goal variable and also you don’t know what one of the best solution is, so it’s more of a means of trial and error to succeed in a particular goal.
More precisely, Reinforcement Learning studies how an AI takes motion in an interactive environment with the intention to maximize the reward. During supervised training, you already know the right answer (the goal variable), and you’re fitting a model to duplicate it. Quite the opposite, in a RL problem you don’t know apriori what’s the right answer, the one option to discover is by taking motion and getting the feedback (the reward), so the model learns by exploring and making mistakes.
RL is being widely used for training robots. An excellent example is the autonomous vacuum: when it passes on a dusty a part of the ground, it receives a reward (+1), but gets punished (-1) when it bumps into the wall. So the robot learns what’s the best motion to do and what to avoid.
In this text, I’m going to indicate tips on how to construct custom 3D environments for training a robot using different Reinforcement Learning algorithms. I’ll present some useful Python code that might be easily applied in other similar cases (just copy, paste, run) and walk through every line of code with comments so that you would be able to replicate this instance.
Setup
While a supervised usecase requires a goal variable and a training set, a RL problem needs:
- Environment — the environment of the agent, it assigns rewards for actions, and provides the brand new state as the results of the choice made. Mainly, it’s the space the AI can interact with (within the autonomous vacuum example can be the room to scrub).
- Motion — the set of actions the AI can do within the environment. The motion space might be “discrete” (when there are a set variety of moves, just like the game of chess) or “continuous” (infinite possible states, like driving a automobile and trading).
- Reward —the consequence of the motion (+1/-1).
- Agent — the AI learning what’s one of the best plan of action within the environment to maximise the reward.
Regarding the environment, essentially the most used 3D physics simulators are: (beginners) , (intermediate), (advanced), and (professionals). You need to use any of them as standalone software or through , a library made by OpenAI for developing Reinforcement Learning algorithms, built on top of various physics engines.
I’ll use (pip install gymnasium) to load one in every of the default environments made with (Multi-Joint dynamics with Contact, pip install mujoco).
import gymnasium as gym
env = gym.make("Ant-v4")
obs, info = env.reset()
print(f"--- INFO: {len(info)} ---")
print(info, "n")
print(f"--- OBS: {obs.shape} ---")
print(obs, "n")
print(f"--- ACTIONS: {env.action_space} ---")
print(env.action_space.sample(), "n")
print(f"--- REWARD ---")
obs, reward, terminated, truncated, info = env.step( env.action_space.sample() )
print(reward, "n")

The robot Ant is a 3D quadruped agent consisting of a torso and 4 legs attached to it. Each leg has two body parts, so in total it has 8 joints (flexible body parts) and 9 links (solid body parts). The goal of this environment is to use force (push/pull) and torque (twist/turn) to maneuver the robot in a certain direction.
Let’s try the environment by running one single episode with the robot doing random actions (an episode is an entire run of the agent interacting with the environment, from begin to termination).
import time
env = gym.make("Ant-v4", render_mode="human")
obs, info = env.reset()
reset = False #reset if the episode ends
episode = 1
total_reward, step = 0, 0
for _ in range(240):
## motion
step += 1
motion = env.action_space.sample() #random motion
obs, reward, terminated, truncated, info = env.step(motion)
## reward
total_reward += reward
## render
env.render() #render physics step (CPU speed = 0.1 seconds)
time.sleep(1/240) #slow all the way down to real-time (240 steps × 1/240 second sleep = 1 second)
if (step == 1) or (step % 100 == 0): #print first step and each 100 steps
print(f"EPISODE {episode} - Step:{step}, Reward:{reward:.1f}, Total:{total_reward:.1f}")
## reset
if reset:
if terminated or truncated: #print the last step
print(f"EPISODE {episode} - Step:{step}, Reward:{reward:.1f}, Total:{total_reward:.1f}")
obs, info = env.reset()
episode += 1
total_reward, step = 0, 0
print("------------------------------------------")
env.close()

Custom Environment
Often, environments have the same properties:
- Reset — to restart to an initial state or to a random point throughout the data.
- Render — to visualise what’s happening.
- Step — to execute the motion chosen by the agent and alter state.
- Calculate Reward — to present the suitable reward/penalties after an motion.
- Get Info — to gather information concerning the game after an motion.
- Terminated or Truncated — to choose whether the episode is finished after an motion (fail or success).
Having default environments loaded in is convenient, nevertheless it’s not at all times what you would like. Sometimes you could have to construct a custom environment that meets your project requirements. That is essentially the most delicate step for a Reinforcement Learning usecase. The standard of the model strongly relies on how well the environment is designed.
There are several ways to make your personal environment:
- Create from scratch: you design every part (i.e. the physics, the body, the environment). You might have total control nevertheless it’s essentially the most complicated way because you start with an empty world.
- Modify the present XML file: every simulated agent is designed by an XML file. You’ll be able to edit the physical properties (i.e. make the robot taller or heavier) however the logic stays the identical.
- Modify the present Python class: keep the agent and the physics as they’re, but change the principles of the sport (i.e. recent rewards and termination rules). One could even turn a continuous env right into a discrete motion space.
I’m going to customize the default Ant environment to make the robot jump. I shall change each the physical properties within the XML file and the reward function of the Python class. Mainly, I just need to present the robot stronger legs and a reward for jumping.
Initially, let’s locate the XML file, make a duplicate, and edit it.
import os
print(os.path.join(os.path.dirname(gym.__file__), "envs/mujoco/assets/ant.xml"))
Since my objective is to have a more “jumpy” Ant, I can reduce the density of the body to make it lighter…

…and add force to the legs so it may possibly jump higher (the gravity within the simulator stays the identical).

Yow will discover the full edited XML file on my GitHub.
Then, I need to modify the reward function of the environment. To create a custom env, you could have to construct a brand new class that overwrites the unique one where it’s needed (in my case, how the reward is calculated). After the brand new env is registered, it may possibly be used like every other env.
from gymnasium.envs.mujoco.ant_v4 import AntEnv
from gymnasium.envs.registration import register
import numpy as np
## modify the category
class CustomAntEnv(AntEnv):
def __init__(self, **kwargs):
super().__init__(xml_file=os.getcwd()+"/assets/custom_ant.xml", **kwargs) #specify xml_file provided that modified
def CUSTOM_REWARD(self, motion, info):
torso_height = float(self.data.qpos[2]) #torso z-coordinate = how high it's
reward = np.clip(a=torso_height-0.6, a_min=0, a_max=1) *10 #when the torso is high
terminated = bool(torso_height < 0.2 ) #if torso near the bottom
info["torso_height"] = torso_height #add info for logging
return reward, terminated, info
def step(self, motion):
obs, reward, terminated, truncated, info = super().step(motion) #override original step()
new_reward, new_terminated, new_info = self.CUSTOM_REWARD(motion, info)
return obs, new_reward, new_terminated, truncated, new_info #must return the identical things
def reset_model(self):
return super().reset_model() #keeping the reset because it is
## register the brand new env
register(id="CustomAntEnv-v1", entry_point="__main__:CustomAntEnv")
## test
env = gym.make("CustomAntEnv-v1", render_mode="human")
obs, info = env.reset()
for _ in range(1000):
motion = env.action_space.sample()
obs, reward, terminated, truncated, info = env.step(motion)
if terminated or truncated:
obs, info = env.reset()
env.close()

If the 3D world and its rules are well designed, you only need RL model, and the robot will do anything to maximise the reward. There are two families of models that dominate the RL scene: Q-Learning models (best for discrete motion spaces) and Actor-Critic models (best for continuous motion spaces). Besides those, there are some newer and more experimental approaches emerging, like Evolutionary algorithms and Imitation learning.
Q Learning
Q-Learning is essentially the most basic type of Reinforcement Learning and uses Q-values (the “Q” stands for “quality”) to represent how useful an motion is in gaining some future reward. To place it in easy terms, if at the tip of the sport the agent gets a certain reward after a set of actions, the initial Q-value is the discounted future reward.

Because the agent explores and receives feedback, it updates the Q-values stored within the Q-matrix (Bellman equation). The goal of the agent is to learn the optimal Q-values for every state/motion, in order that it may possibly make one of the best decisions and maximize the expected future reward for a particular motion in a particular state.
Through the learning process, the agent uses an exploration-exploitation trade-off. Initially, it explores the environment by taking random actions, allowing it to collect experience (information concerning the rewards related to different actions and states). Because it learns and the extent of exploration decays, it starts exploiting its knowledge by choosing the actions with the very best Q-values for every state.
Please note that the Q-matrix might be multidimensional and rather more complicated. For example, let’s consider a trading algorithm:

In 2013, there was a breakthrough in the sphere of Reinforcement Learning when Google introduced Deep Q-Network (DQN), designed to learn to play Atari games from raw pixels, combining the 2 concepts of Deep Learning and Q-Learning. To place it in easy terms, Deep Learning is used to approximate the Q-values as a substitute of explicitly storing them in a table. This is completed through a Neural Network trained to predict the Q-values for every possible motion, using the present state of the environment as input.

Q-Learning family was mainly designed for discrete environments, so it doesn’t really work on the robot Ant. An alternate solution can be to discretize the environment (even when it’s not essentially the most efficient option to approach a continuous problem). We just must create a wrapper for the Python class that expects a discrete motion (i.e. “move forward”), and consequently applies force to the joints based on that command.
class DiscreteEnvWrapper(gym.Env):
def __init__(self, render_mode=None):
super().__init__()
self.env = gym.make("CustomAntEnv-v1", render_mode=render_mode)
self.action_space = gym.spaces.Discrete(5) #can have 5 actions
self.observation_space = self.env.observation_space #same remark space
n_joints = self.env.action_space.shape[0]
self.action_map = [
## action 0 = stand still
np.zeros(n_joints),
## action 1 = push all forward
0.5*np.ones(n_joints),
## action 2 = push all backward
-0.5*np.ones(n_joints),
## action 3 = front legs forward + back legs backward
0.5*np.concatenate([np.ones(n_joints//2), -np.ones(n_joints//2)]),
## motion 4 = front legs backward + back legs forward
0.5*np.concatenate([-np.ones(n_joints//2), np.ones(n_joints//2)])
]
def step(self, discrete_action):
assert self.action_space.incorporates(discrete_action)
continuous_action = self.action_map[discrete_action]
obs, reward, terminated, truncated, info = self.env.step(continuous_action)
return obs, reward, terminated, truncated, info
def reset(self, **kwargs):
obs, info = self.env.reset(**kwargs)
return obs, info
def render(self):
return self.env.render()
def close(self):
self.env.close()
## test
env = DiscreteEnvWrapper()
obs, info = env.reset()
print(f"--- INFO: {len(info)} ---")
print(info, "n")
print(f"--- OBS: {obs.shape} ---")
print(obs, "n")
print(f"--- ACTIONS: {env.action_space} ---")
discrete_action = env.action_space.sample()
continuous_action = env.action_map[discrete_action]
print("discrete:", discrete_action, "-> continuous:", continuous_action, "n")
print(f"--- REWARD ---")
obs, reward, terminated, truncated, info = env.step( discrete_action )
print(reward, "n")

Now this environment, with just 5 possible actions, will certainly work with DQN. In Python, the best option to use Deep RL algorithms is thru (pip install stable-baselines3), a group of essentially the most famous models, already pre-implemented and able to go, all written in (pip install torch). Moreover, I find it very useful to take a look at the training progress on (pip install tensorboard). I created a folder named “logs”, and I can just run tensorboard --logdir=logs/ on the terminal to serve the dashboard locally (http://localhost:6006/).
import stable_baselines3 as sb
from stable_baselines3.common.vec_env import DummyVecEnv
# TRAIN
env = DiscreteEnvWrapper(render_mode=None) #no rendering to hurry up
env = DummyVecEnv([lambda:env])
model_name = "ant_dqn"
print("Training START")
model = sb.DQN(policy="MlpPolicy", env=env, verbose=0, learning_rate=0.005,
exploration_fraction=0.2, exploration_final_eps=0.05, #eps decays linearly from 1 to 0.05
tensorboard_log="logs/") #>tensorboard --logdir=logs/
model.learn(total_timesteps=1_000_000, #20min
tb_log_name=model_name, log_interval=10)
print("Training DONE")
model.save(model_name)
After the training is complete, we are able to load the brand new model and test it within the rendered environment. Now, the agent won’t be updating the popular actions anymore. As a substitute, it should use the trained model to predict the subsequent best motion given the present state.
# TEST
env = DiscreteEnvWrapper(render_mode="human")
model = sb.DQN.load(path=model_name, env=env)
obs, info = env.reset()
reset = False #reset if episode ends
episode = 1
total_reward, step = 0, 0
for _ in range(1000):
## motion
step += 1
motion, _ = model.predict(obs)
obs, reward, terminated, truncated, info = env.step(motion)
## reward
total_reward += reward
## render
env.render()
time.sleep(1/240)
if (step == 1) or (step % 100 == 0): #print first step and each 100 steps
print(f"EPISODE {episode} - Step:{step}, Reward:{reward:.1f}, Total:{total_reward:.1f}")
## reset
if reset:
if terminated or truncated: #print the last step
print(f"EPISODE {episode} - Step:{step}, Reward:{reward:.1f}, Total:{total_reward:.1f}")
obs, info = env.reset()
episode += 1
total_reward, step = 0, 0
print("------------------------------------------")
env.close()

As you may see, the robot learned that one of the best policy is to leap, however the movements aren’t fluid because we didn’t use a model designed for continuous actions.
Actor Critic
In practice, the Actor-Critic algorithms are essentially the most used as they're well suited to continuous environments. The fundamental idea is to have two systems working together: a policy function (“Actor”) for choosing actions, and a worth function (“Critic”) to estimate the expected reward. The model learns tips on how to adjust the choice making by comparing the actual rewards it receives with the predictions.
The primary stable Deep Learning algorithm was introduced by OpenAI in 2016: Advantage Actor-Critic (A2C). It goals to reduce the loss between the actual reward received after the Actor takes motion and the reward estimated by the Critic. The Neural Network is manufactured from an input layer shared by each the Actor and the Critic, but they return two separate outputs: actions’ Q-values (identical to DQN), and predicted reward (which is the addition of A2C).

Over time, the AC algorithms have been improving with more stable and efficient variants, like Proximal Policy Optimization (PPO), and Soft Actor Critic (SAC). The latter uses, not one, but two Critic networks to get a “second opinion”. Do not forget that we are able to use these models directly in the continual environment.
# TRAIN
env_name, model_name = "CustomAntEnv-v1", "ant_sac"
env = gym.make(env_name) #no rendering to hurry up
env = DummyVecEnv([lambda:env])
print("Training START")
model = sb.SAC(policy="MlpPolicy", env=env, verbose=0, learning_rate=0.005,
ent_coef=0.005, #exploration
tensorboard_log="logs/") #>tensorboard --logdir=logs/
model.learn(total_timesteps=100_000, #3h
tb_log_name=model_name, log_interval=10)
print("Training DONE")
## save
model.save(model_name)
The training of the SAC requires more time, but the outcomes are a lot better.
# TEST
env = gym.make(env_name, render_mode="human")
model = sb.SAC.load(path=model_name, env=env)
obs, info = env.reset()
reset = False #reset if the episode ends
episode = 1
total_reward, step = 0, 0
for _ in range(1000):
## motion
step += 1
motion, _ = model.predict(obs)
obs, reward, terminated, truncated, info = env.step(motion)
## reward
total_reward += reward
## render
env.render()
time.sleep(1/240)
if (step == 1) or (step % 100 == 0): #print first step and each 100 steps
print(f"EPISODE {episode} - Step:{step}, Reward:{reward:.1f}, Total:{total_reward:.1f}")
## reset
if reset:
if terminated or truncated: #print the last step
print(f"EPISODE {episode} - Step:{step}, Reward:{reward:.1f}, Total:{total_reward:.1f}")
obs, info = env.reset()
episode += 1
total_reward, step = 0, 0
print("------------------------------------------")
env.close()

Given the recognition of Q-Learning and Actor-Critic, there have been newer hybrid adaptations combining the 2 approaches. In this fashion, in addition they extend DQN to continuous motion spaces. For instance, Deep Deterministic Policy Gradient (DDPG) and Twin Delayed DDPG (TD3). But, beware that the more complex the model, the harder the training.
Experimental Models
Besides the major families (Q and AC), yow will discover other models which can be less utilized in practice, but no less interesting. Particularly, they might be powerful alternatives for tasks where rewards are sparse and hard to design. For instance:
- Evolutionary Algorithms evolve the policies through mutation and selection as a substitute of a gradient. Inspired by Darwin’s evolution, they're robust but computationally heavy.
- Imitation Learning skips exploration and trains agents to mimic expert demonstrations. It’s based on the concept of “behavioral cloning”, mixing supervised learning with RL ideas.
For experimental purposes, let’s try the primary one with , an open-source toolkit for neuroevolution. I’m selecting this because it really works well with and (pip install evotorch).
The most effective Evolutionary Algorithm for RL is Policy Gradients with Parameter Exploration (PGPE). Essentially, it doesn’t train one Neural Network directly, as a substitute it builds a probability distribution (Gaussian) over all possible weights (μ=average set of weights, σ=exploration across the center). In every generation, PGPE samples from the weights population, starting with a random policy. Then, the model adjusts the mean and variance based on the reward (evolution of the population). PGPE is taken into account Parallelized RL because, unlike classic methods like Q and AC, which update one policy using batches of samples, PGPE samples many policy variations in parallel.
Before running the training, now we have to define the “problem”, which is the duty to optimize (mainly our surroundings).
from evotorch.neuroevolution import GymNE
from evotorch.algorithms import PGPE
from evotorch.logging import StdOutLogger
## problem
train = GymNE(env=CustomAntEnv, #directly the category since it's custom env
env_config={"render_mode":None}, #no rendering to hurry up
network="Linear(obs_length, act_length)", #linear policy
observation_normalization=True,
decrease_rewards_by=1, #normalization trick to stabilize evolution
episode_length=200, #steps per episode
num_actors="max") #use all available CPU cores
## model
model = PGPE(problem=train, popsize=20, stdev_init=0.1, #keep it small
center_learning_rate=0.005, stdev_learning_rate=0.1,
optimizer_config={"max_speed":0.015})
## train
StdOutLogger(searcher=model, interval=20)
model.run(num_generations=100)

As a way to test the model, we'd like one other “problem” that renders the simulation. Then, we just extract the best-performing set of weights from the distribution center (that’s because through the training the Gaussian shifted toward higher regions of policy space).
## visualization problem
test = GymNE(env=CustomAntEnv, env_config={"render_mode":"human"},
network="Linear(obs_length, act_length)",
observation_normalization=True,
decrease_rewards_by=1,
num_actors=1) #only need 1 for visualization
## test best policy
population_center = model.status["center"]
policy = test.to_policy(population_center)
## render
test.visualize(policy)

Conclusion
This text has been a tutorial on tips on how to use Reinforcement Learning for Robotics. I showed tips on how to construct 3D simulations with and , tips on how to customize an environment, and what RL algorithms are more suited to different usecases. Latest tutorials with more advanced robots will come.
Full code for this text: GitHub
I hope you enjoyed it! Be at liberty to contact me for questions and feedback or simply to share your interesting projects.
👉 Let’s Connect 👈

(All images are by the creator unless otherwise noted)
