the way you’d teach a robot to land a drone without programming each move? That’s exactly what I got down to explore. I spent weeks constructing a game where a virtual drone has to work out tips on how to land on a platform—not by following pre-programmed instructions, but by learning from trial and error, identical to the way you learned to ride a motorbike.
That is Reinforcement Learning (RL), and it’s fundamentally different from other machine learning approaches. As an alternative of showing the AI hundreds of examples of “correct” landings, you give it feedback: “Hey, that was pretty good, but perhaps try being more gentle next time?” or “Yikes, you crashed—probably don’t try this again.” Through countless attempts, the AI figures out what works and what doesn’t.
On this post, I’m documenting my journey from RL basics to constructing a working system that (mostly!) teaches a drone to land. You’ll see the successes, the failures, and all of the weird behaviors I needed to debug along the way in which.
1. Reinforcement learning: Overview
A whole lot of the thought will be related to Pavlov’s dog and Skinner’s rat experiments. The thought is that you just give the topic a ‘‘ when it does something you would like it to do (positive reinforcement) and a ‘‘ when it does something bad (negative reinforcement). Through many repeated attempts, your subject learns from this feedback, regularly discovering which actions result in success—just like how Skinner’s rat learned which lever presses produced food rewards.
In the identical fashion, we wish a system that can learn to do things (or tasks) such that it might maximize the reward and minimize the penalty. Note this fact about maximizing reward, which is able to are available later.
1.1 Core Concepts
When talking about systems that will be implemented programmatically on computers, the perfect practice is to write down clear definitions for ideas that will be abstracted. Within the study of AI (and more specifically, Reinforcement learning), the core ideas will be boiled all the way down to the next:
- Agent (or Actor): That is our from the previous section. This will be the dog, a robot attempting to navigate an enormous factory, a video game NPC, etc.
- Environment (or the world): This generally is a place, a simulation with restrictions, a video game’s virtual game world, etc. I feel of this like, “A box, real or virtual, where the agent’s entire life is confined to; it only knows of what happens inside the box. We, because the overlords, can alter this box, while the agent will think that god is exacting his will on his world.”
- Policy: Identical to in governments, corporations, and plenty of more similar entities, ‘’ dictate “What actions ought to be taken when given a certain situation”.
- State: That is what the agent “sees” or “knows” about its current situation. Consider it because the agent’s snapshot of reality at any given moment—like the way you see the traffic light color, your speed, and the gap to the intersection when driving.
- Motion: Now that our agent can ‘see’ things in its environment, it should want to do something about its state. Perhaps it just woke up from a protracted night’s slumber, and now it desires to get a cup of coffee. On this case, the very first thing it is going to do is get away from bed. That is an motion that the agent will take to realize its goal, i.e., GET SOME COFFEE!
- Reward: Each time the actor executes an motion (of its own volition), something may change on the planet. For instance, our agent got away from bed and began walking towards the kitchen, but then, since it is so bad at walking, it tripped and fell. In this case, the god (us) rewards it with a punishment for being bad at walking (negative reward). But then the agent makes it to the kitchen and gets the coffee, so the god (us) rewards it with a cookie (positive reward).

As you’ll be able to imagine, most of those key components should be tailored for the precise task/problem that we wish the agent to unravel.
2. The Gym
Now that we understand the fundamentals, you may be wondering: how can we actually construct one in all these systems? Let me show you the sport I built.
For this post, I actually have written a bespoke video game that anyone can access and use to coach their very own machine learning agent to play the sport.
The complete code repository will be found on GitHub (please star this). I intend to make use of this repository for more games and simulation code, together with more advanced techniques that I’ll implement in my next installments of posts on RL.
Delivery Drone
The delivery drone is a game where the target is to fly a drone (likely containing deliveries) onto a platform. To win the sport, we now have to land. To land, we now have to fulfill the next criteria:
- Be in landing proximity to the platform
- Be slow enough
- Be upright (Landing the wrong way up is more like crashing than landing)
All information on tips on how to run the sport will be present in the GitHub repository.
Here’s what the sport looks like

If the drone flies off the screen or touches the bottom, it is going to be considered a ‘’ case and thus result in a failure.
State description
The drone observes 15 continuous values that completely describe its situation:

Landing Success Criteria: The drone must concurrently achieve:
- Horizontal alignment: inside platform bounds (|dx| < 0.0625)
- Secure approach speed: lower than 0.3
- Level orientation: tilt lower than 20° (|angle| < 0.111)
- Correct altitude: bottom of drone touching platform top
It’s like parallel parking—you would like the appropriate position, right angle, and moving slowly enough to not crash!
How can someone design a policy?
There are a lot of ways to design a policy. It might probably be Bayesian (maintaining probability distributions over beliefs), it might be an easy lookup table for discrete states, a hand-coded rule system (“if distance < 10, then brake”), a decision tree, or—as we’ll explore—a neural network that learns the mapping from states to actions through gradient descent.
Effectively, we wish something that takes within the aforementioned state, performs some computation using this state, and returns what motion ought to be performed.
Deep Learning to construct a policy?
So how can we design a policy that may handle continuous states (like exact drone positions) and learn complex behaviors? That is where neural networks are available.
In case of neural networks (or in deep learning), it is usually best to work with motion probabilities, i.e., “What motion is probably going the perfect given the present state?”. So, we are able to define a neural network that can absorb the state as a ‘vector’ or ‘collection of vectors’ as input. This vector or collection of vectors needs to be constructed from the observed state. For our delivery drone game, the state vector is:
State vector (from our 2D drone game)
The drone observes its absolute position, velocities, orientation, fuel, platform position, and derived metrics. Our continuous state is:

Where each component represents:

All components are normalized to roughly [0,1] or [-1,1] ranges for stable neural network training.
Motion space (three independent binary thrusters)
As an alternative of discrete motion combos, we treat each thruster independently:
- Most important thruster (upward thrust)
- Left thruster (clockwise rotation)
- Right thruster (counter-clockwise rotation)
Each motion is sampled from a Bernoulli distribution, giving us 3 independent binary decisions per timestep.
Neural-network policy (probabilistic with Bernoulli sampling)
Let fθ(s) be the network outputs after sigmoid activation. The policy uses independent Bernoulli distributions:

Minimal Python sketch (from our implementation)
# construct state vector from DroneState
s = np.array([
state.drone_x, state.drone_y,
state.drone_vx, state.drone_vy,
state.drone_angle, state.drone_angular_vel,
state.drone_fuel,
state.platform_x, state.platform_y,
state.distance_to_platform,
state.dx_to_platform, state.dy_to_platform,
state.speed,
float(state.landed), float(state.crashed)
])
# network outputs probabilities for every thruster (after sigmoid)
action_probs = policy(torch.tensor(s, dtype=torch.float32)) # shape: (3,)
# sample each thruster independently from Bernoulli
dist = Bernoulli(probs=action_probs)
motion = dist.sample() # shape: (3,), e.g., [1, 0, 1] means principal+right thrusters
This shows how we map the sport’s physical observations right into a 15-dimensional normalized state vector and produce independent binary decisions for every thruster.
Code setup (part 1): Imports and game socket setup
We first want our game’s socket listener to start out. For this, you’ll be able to navigate to the delivery_drone directory in my repository and run the next command:
pip install -r requirements.txt # run this once for organising the required modules
python socket_server.py --render human --port 5555 --num-games 1 # run this each time it's good to run the sport in socket mode
NOTE: You will want PyTorch to run the code. Please make certain that you might have set it up beforehand
import os
import torch
import torch.nn as nn
import math
import numpy as np
from torch.distributions import Bernoulli
# Import the sport's socket client
from delivery_drone.game.socket_client import DroneGameClient, DroneState
# setup the client and connect with the server
client = DroneGameClient()
client.connect()
design a reward function?
So what makes an excellent reward function? That is arguably the toughest a part of RL (and where I spent a LOT of my debugging time 🫠).
The reward function is the soul of any RL implementation (and trust me, get this unsuitable and your agent will do the weirdest things). In theory, it should define what ‘’ behaviour ought to be learnt and what ‘’ behaviour mustn’t be learnt. Each motion taken by our agent is characterised by the overall amassed reward for every behaviour trait exhibited by the motion. For instance, in the event you want the drone to land gently, you may give positive rewards for being near the platform and moving slowly, while penalizing crashes or running out of fuel—the agent then learns to maximise the sum of all these rewards over time.
Advantage: A greater approach to measure effective reward
When training our policy, we don’t just need to know if an motion rewarded us—we wish to know if it was . That is the intuition behind the advantage.
The advantage tells us: “Was this motion higher or worse than what we typically expect?”

In our implementation, we:
- Collect multiple episodes and calculate their returns (total discounted rewards)
- Compute the baseline because the mean return across all episodes
- Calculate advantage = return – baseline for every timestep
- Normalize benefits to have mean=0 and std=1 (for stable training)
Why this helps:
- Actions with positive advantage → higher than average → increase their probability
- Actions with negative advantage → worse than average → decrease their probability
- Reduces variance in gradient updates (more stable learning)
This easy baseline already gives us significantly better training than raw returns! It tries to weigh the complete sequence of actions against the outcomes (crashed or landed) such that the policy learns to take actions that lead to raised advantage.
After plenty of trial and error, I actually have designed the next reward function. The important thing insight was to condition rewards on each proximity AND vertical position – the drone have to be above the platform to receive positive rewards, stopping exploitation strategies like hovering below the platform.

Short note on inversely (and non-linearly) scaling reward
Often, we wish to reward behaviors inversely proportional to certain state values. For instance, distance to the platform ranges from 0 to ~1.41 (normalized by window width). We would like a high reward when the gap ≈ 0 and a low reward when far-off. I take advantage of various scaling functions for this:

Examples for other useful scaling functions
Helper functions:
def inverse_quadratic(x, decay=20, scaler=10, shifter=0):
"""Reward decreases quadratically with distance"""
return scaler / (1 + decay * (x - shifter)**2)
def scaled_shifted_negative_sigmoid(x, scaler=10, shift=0, steepness=10):
"""Sigmoid function scaled and shifted"""
return scaler / (1 + np.exp(steepness * (x - shift)))
def calc_velocity_alignment(state: DroneState):
"""
Calculate how well the drone's velocity is aligned with optimal direction to platform.
Returns cosine similarity: 1.0 = perfect alignment, -1.0 = wrong way
"""
# Optimal direction: from drone to platform
optimal_dx = -state.dx_to_platform
optimal_dy = -state.dy_to_platform
optimal_norm = math.sqrt(optimal_dx**2 + optimal_dy**2)
if optimal_norm < 1e-6: # Already at platform
return 1.0
optimal_dx /= optimal_norm
optimal_dy /= optimal_norm
# Current velocity direction
velocity_norm = state.speed
if velocity_norm < 1e-6: # Not moving
return 0.0
velocity_dx = state.drone_vx / velocity_norm
velocity_dy = state.drone_vy / velocity_norm
# Cosine similarity
return velocity_dx * optimal_dx + velocity_dy * optimal_dy
Code for the present reward function:
def calc_reward(state: DroneState):
rewards = {}
total_reward = 0
# 1. Time penalty - distance-based (penalize more when far)
minimum_time_penalty = 0.3
maximum_time_penalty = 1.0
rewards['time_penalty'] = -inverse_quadratic(
state.distance_to_platform,
decay=50,
scaler=maximum_time_penalty - minimum_time_penalty
) - minimum_time_penalty
total_reward += rewards['time_penalty']
# 2. Distance & velocity alignment - ONLY when above platform
velocity_alignment = calc_velocity_alignment(state)
dist = state.distance_to_platform
rewards['distance'] = 0
rewards['velocity_alignment'] = 0
# Key condition: drone have to be above platform (dy > 0) to get positive rewards
if dist > 0.065 and state.dy_to_platform > 0:
# Reward movement toward platform when velocity is aligned
if velocity_alignment > 0:
rewards['distance'] = state.speed * scaled_shifted_negative_sigmoid(dist, scaler=4.5)
rewards['velocity_alignment'] = 0.5
total_reward += rewards['distance']
total_reward += rewards['velocity_alignment']
# 3. Angle penalty - distance-based threshold
abs_angle = abs(state.drone_angle)
max_angle = 0.20
max_permissible_angle = ((max_angle - 0.111) * dist) + 0.111
excess = abs_angle - max_permissible_angle
rewards['angle'] = -max(excess, 0)
total_reward += rewards['angle']
# 4. Speed penalty - penalize excessive speed
rewards['speed'] = 0
speed = state.speed
max_speed = 0.4
if dist < 1:
rewards['speed'] = -2 * max(speed - 0.1, 0)
else:
rewards['speed'] = -1 * max(speed - max_speed, 0)
total_reward += rewards['speed']
# 5. Vertical position penalty - penalize being below platform
rewards['vertical_position'] = 0
if state.dy_to_platform > 0: # Drone is above platform (GOOD)
rewards['vertical_position'] = 0
else: # Drone is below platform (BAD!)
rewards['vertical_position'] = state.dy_to_platform * 4.0 # Negative penalty
total_reward += rewards['vertical_position']
# 6. Terminal rewards
rewards['terminal'] = 0
if state.landed:
rewards['terminal'] = 500.0 + state.drone_fuel * 100.0
elif state.crashed:
rewards['terminal'] = -200.0
# Extra penalty for crashing removed from goal
if state.distance_to_platform > 0.3:
rewards['terminal'] -= 100.0
total_reward += rewards['terminal']
rewards['total'] = total_reward
return rewards
And yes, those magic numbers like 4.5, 0.065, and 4.0? They got here from of trial and error. Welcome to RL, where hyperparameter tuning is half art, half science, and half luck (yes, I do know that’s three halves).
def compute_returns(rewards, gamma=0.99):
"""
Compute discounted returns (G_t) for every timestep based on the Bellman equation
G_t = r_t + γ*r_{t+1} + γ²*r_{t+2} + ...
"""
returns = []
G = 0
# Compute backwards (more efficient)
for r in reversed(rewards):
G = r + gamma * G
returns.insert(0, G)
return returns
The necessary thing to notice is that reward functions are subject to careful trial and error. One mistake or over-reward here, and the agent goes off in optimizing behaviour that exploits the mistakes. This leads us to reward hacking.
Reward hacking
Reward hacking occurs when an agent finds an unintended approach to maximize reward without actually solving the duty you wanted it to unravel. The agent isn’t “cheating” on purpose—it’s doing exactly what you told it to do, just not what you for it to do.
Classic example: In the event you reward a cleansing robot for “no visible dirt,” it'd learn to show off its camera as an alternative of cleansing!
My painful learning experience: I discovered this out the hard way. In an early version of my drone landing reward function, I gave the drone points for being “stable and slow” anywhere near the platform. Sounds reasonable, right? Flawed! Inside 50 training episodes, my drone learned to only hover in place perpetually, racking up free points. It was technically optimal for my badly-designed reward function—but actually landing? Nope! I watched it hover for five minutes straight before I noticed what was happening.
Here’s the problematic code I wrote:
# DO NOT COPY THIS!
# If drone is above the platform (|dx| < 0.0625) and shut (distance < 0.25):
corridor_reward = inverse_quadratic(distance, decay=20, scaler=15) # As much as 15 points
if stable and slow:
corridor_reward += 10 # Extra 10 points!
# Total possible: 25 points per step!
An example of reward hacking in motion:


Making a policy network
As discussed above, we're going to use a neural network because the policy that powers the brain of our agent. Here’s an easy implementation that takes within the state vector and computes a probability distribution over 3 independent actions:
- Activate the principal thruster
- Activate the left thruster
- Activate the appropriate thruster
def state_to_array(state):
"""Helper function to convert DroneState dataclass to numpy array"""
data = np.array([
state.drone_x,
state.drone_y,
state.drone_vx,
state.drone_vy,
state.drone_angle,
state.drone_angular_vel,
state.drone_fuel,
state.platform_x,
state.platform_y,
state.distance_to_platform,
state.dx_to_platform,
state.dy_to_platform,
state.speed,
float(state.landed),
float(state.crashed)
])
return torch.tensor(data, dtype=torch.float32)
class DroneGamerBoi(nn.Module):
def __init__(self, state_dim=15):
super().__init__()
self.network = nn.Sequential(
nn.Linear(state_dim, 128),
nn.LayerNorm(128),
nn.ReLU(),
nn.Linear(128, 128),
nn.LayerNorm(128),
nn.ReLU(),
nn.Linear(128, 64),
nn.LayerNorm(64),
nn.ReLU(),
nn.Linear(64, 3),
nn.Sigmoid()
)
def forward(self, state):
if isinstance(state, DroneState):
state = state_to_array(state)
return self.network(state)
Effectively, as an alternative of the motion space being a 23 = 8 space, I reduced it to decisions over the three independent thrusters using Bernoulli sampling. This reduction makes optimization easier by treating each thruster independently reasonably than as one big categorical selection (at the very least that's what I feel—I could also be unsuitable, however it worked for me!).
Training a policy with policy gradients
Learning Strategies: When Should We Update?
Here’s an issue that tripped me up early on: should we update the policy after each motion, or wait and see how the entire episode plays out? Seems, this selection matters lots.
While you attempt to optimize based purely on the reward received for an motion, it results in a high variance problem (mainly, the training signal is super noisy and the gradients point in random directions!). What I mean by “high variance” is that the optimization algorithm receives extremely mixed signals within the gradient that's used to update the parameters in our policy network. For a similar motion, the system may emit a particular gradient direction, but then for a rather different state (but same motion) might yield something completely opposite. This results in slow, and potentially no, training.
There are 3 ways we could update our policy:
Learning after every motion (Per-Step Updates)
The drone fires its thruster once, gets a small reward, and immediately updates its entire strategy. That is like adjusting your basketball form after each shot—way too reactive! One lucky motion that increases the reward doesn’t necessarily mean that the agent did good, and one unlucky motion doesn’t mean the agent did bad. The training signal is just too noisy.
My first attempt: I attempted this approach early on. The drone would wiggle around randomly, make one lucky move that got a tiny bit more reward, immediately overfit to that exact move, after which crash repeatedly attempting to reproduce it. It was painful to observe—like watching someone learn the unsuitable lesson from pure probability.
Learning after one complete attempt (Per-Episode Updates)
Higher! Now we let the drone attempt to land (or crash), see how the entire attempt went, after which update. That is like ending an episode after which eager about what to enhance. Not less than now we see the complete consequences of our actions. But here’s the issue: what if that one landing was just lucky? Or unlucky? We’re still basing our learning on a single data point.
Learning from multiple attempts (Multi-Episode Batch Updates)
That is the sweet spot. We run multiple (6 in my case) drone landing attempts concurrently, see how all of them went, after which update our policy based on the common performance. Some attempts might get lucky, some unlucky, but averaged together, we get a much clearer picture of what actually works. Although this is sort of heavy on the pc, in the event you can run it, it really works way higher than any of the previous methods. After all, this method is definitely not the perfect, however it is sort of easy to know and implement; there are other (and higher) methods.
Here’s the code to gather multiple episodes within the drone game:
def collect_episodes(client: DroneGameClient, policy: nn.Module, max_steps=300):
"""
Collect episodes with early stopping
Args:
client: The sport's socket client
policy: PyTorch module
max_steps: Maximum steps per episode (default: 300)
"""
num_games = client.num_games
# Initialize storage
all_episodes = [{'states': [], 'actions': [], 'log_probs': [], 'rewards': [], 'done': False}
for _ in range(num_games)]
# Reset all games
game_states = [client.reset(game_id) for game_id in range(num_games)]
step_counts = [0] * num_games # Track steps per game
while not all(ep['done'] for ep in all_episodes):
# Batch lively games
batch_states = []
active_game_ids = []
for game_id in range(num_games):
if not all_episodes[game_id]['done']:
batch_states.append(state_to_array(game_states[game_id]))
active_game_ids.append(game_id)
if len(batch_states) == 0:
break
# Batched inference
batch_states_tensor = torch.stack(batch_states)
batch_action_probs = policy(batch_states_tensor)
batch_dist = Bernoulli(probs=batch_action_probs)
batch_actions = batch_dist.sample()
batch_log_probs = batch_dist.log_prob(batch_actions).sum(dim=1)
# Execute actions
for i, game_id in enumerate(active_game_ids):
motion = batch_actions[i]
log_prob = batch_log_probs[i]
next_state, _, done, _ = client.step({
"main_thrust": int(motion[0]),
"left_thrust": int(motion[1]),
"right_thrust": int(motion[2])
}, game_id)
reward = calc_reward(next_state)
# Store data
all_episodes[game_id]['states'].append(batch_states[i])
all_episodes[game_id]['actions'].append(motion)
all_episodes[game_id]['log_probs'].append(log_prob)
all_episodes[game_id]['rewards'].append(reward['total'])
# Update state and step count
game_states[game_id] = next_state
step_counts[game_id] += 1
# Check done conditions
if done or step_counts[game_id] >= max_steps:
# Apply timeout penalty if hit max steps without landing
if step_counts[game_id] >= max_steps and never next_state.landed:
all_episodes[game_id]['rewards'][-1] -= 500 # Timeout penalty
all_episodes[game_id]['done'] = True
# Return episodes
return [(ep['states'], ep['actions'], ep['log_probs'], ep['rewards'])
for ep in all_episodes]
The Maximization-Minimization Puzzle
In typical deep learning (supervised learning), we minimize a loss function:

We would like to go “downhill” toward lower loss (higher predictions).
But in reinforcement learning, we wish to maximize total reward! Our goal is:

The issue: Deep learning frameworks are built for minimization, not maximization. How can we turn “maximize reward” into “minimize loss”?
The easy trick: Maximize J(θ) = Minimize -J(θ)
So our loss function becomes:

Now, gradient descent will climb up (more like Gradient Ascend) the reward landscape (because we’re taking place the negative reward)!
The REINFORCE Algorithm (Policy Gradient)
The policy gradient theorem (Williams, 1992) tells us tips on how to compute the gradient of expected reward:

(I do know, I do know—this looks intimidating. But persist with me, it’s actually quite elegant when you see what’s happening!)
Where:

In plain English (because that formula is dense):
- If motion at led to a high return Gt, increase its probability
- If motion at led to a low return Gt, decrease its probability
- The gradient tells us which direction to regulate the neural network weights
Adding a Baseline (Variance Reduction)
Using raw returns Gt results in high variance (noisy gradients). We improve this by subtracting a baseline b(st):

The best baseline is the mean return:

This offers us the advantage: At=Gt-b
- Positive advantage → motion was higher than average → increase probability
- Negative advantage → motion was worse than average → decrease probability
Why this helps: As an alternative of “this motion gave reward 100” (is that good?), we now have “this motion gave 100 when the common is 50” (that’s great!). Relative performance is clearer than absolute.
Our Implementation
In our drone landing code, we use REINFORCE with baseline:
# 1. Collect episodes and compute returns
returns = compute_returns(rewards, gamma=0.99) # G_t with discounting
# 2. Compute baseline (mean of all returns)
baseline = returns_tensor.mean()
# 3. Compute benefits
benefits = returns_tensor - baseline
# 4. Normalize benefits (extra variance reduction)
benefits = (benefits - benefits.mean()) / (benefits.std() + 1e-8)
# 5. Compute loss (note the negative sign!)
loss = -(log_probs_tensor * benefits).mean()
# 6. Gradient descent
optimizer.zero_grad()
loss.backward()
optimizer.step()
We repeat the above loop as persistently as we wish or till the drone learns to land properly. Have a have a look at this notebook for more code!
Current Results (reward function remains to be quite flawed)
After countless hours of tweaking rewards, adjusting hyperparameters, and watching my drone crash in creative recent ways, I finally got it working (mostly!). Regardless that my designed reward function will not be perfect, I do think that it's in a position to teach a policy network. Here’s a successful landing:

Pretty cool, right? But here’s where things get interesting (and frustrating)…
The persistent hovering problem: A fundamental limitation
Even with the improved reward function that conditions rewards on vertical position (dy_to_platform > 0). The trained policy still exhibits a frustrating behavior: when the drone misses the platform, it learns to descend toward it but then hovers below the platform reasonably than attempting to land.
I spent over per week looking at reward plots (and altering reward functions), wondering why my “fixed” reward function was still producing this hovering behavior. After I finally plotted the amassed rewards, the pattern became crystal clear—and truthfully, I couldn’t even be mad on the agent for locating this strategy.
What’s happening?
By analyzing the amassed rewards over an episode where the drone hovers below the platform, I discovered something interesting:


The plots reveal that:
- Distance reward (orange): Accumulates to ~+70 early, then plateaus (no more rewards)
- Velocity alignment (green): Accumulates to ~+30 early, then plateaus
- Time penalty (blue): Steadily accumulates to ~-250 (keeps getting worse)
- Vertical position (brown): Steadily accumulates to ~-200 (penalty for being below)
- Total reward: Ends around -400 to -600 (after timeout)
The important thing insight: The drone descends from above the platform (collecting distance and velocity rewards on the way in which down), passes through the platform height, after which settles into hovering below as an alternative of completing the landing. Once below, it stops getting positive rewards (notice how the gap and velocity lines plateau around step 50-60) but continues accumulating time penalties and vertical position penalties. Nevertheless, this strategy remains to be viable because attempting to land risks an instantaneous -200 crash penalty, whereas hovering below “only” costs ~-400 to -600 over the complete episode.
Why does this occur?
The basic issue is that our reward function r(s', a) can only see the current state, not the trajectory. Give it some thought: at any single timestep, the reward function can’t tell the difference between:
- A drone making progress toward landing (approaching from above with controlled descent)
- A drone exploiting the reward structure (oscillating to farm rewards)
Each may need dy_to_platform > 0 at a given moment, in order that they receive an identical rewards! The agent isn’t dumb—it’s just optimizing exactly what you told it to optimize.
So what would actually fix this?
To actually solve this problem, I personally think that rewards should depend upon state transitions: r(s, a, s') as an alternative of just r(s, a). This could allow you to reward based on (s being the present state, and s’ prime being the following state):
- Progress: Only reward if
distance(s') < distance(s)(actually getting closer!) - Vertical improvement: Only reward if the drone is consistently moving upward relative to the platform
- Trajectory consistency: Penalize rapid direction changes that indicate oscillation
It is a more principled solution than attempting to patch the present reward function with increasingly harsh penalties (which is essentially what I attempted for some time, and it didn’t really work). The oscillation exploit exists because we’re fundamentally missing information in regards to the trajectory.
In the following post, I’ll explore Actor-Critic methods and techniques that may incorporate temporal information to stop these exploitation strategies. Stay tuned!
In the event you discover a approach to fix this, please reach out to me!
This brings us to the tip of this post on “the only approach to do Deep Reinforcement Learning.”
Next on the list
- Actor-Critic systems
- DQL
- PPO & GRPO
- Applying this to systems that require vision 👀
References
Foundational Stuff
- Turing, A. M. (1950). “Computing Machinery and Intelligence.”.
- Original Turing Test paper
- Williams, R. J. (1992). “Easy Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning.” .
- Sutton, R. S., & Barto, A. G. (2018). . MIT Press.
Classical Conditioning & Behavioral Psychology
- Pavlov, I. P. (1927). . Oxford University Press.
- Classical conditioning experiments
- Skinner, B. F. (1938). . Appleton-Century-Crofts.
- Operant conditioning and the Skinner Box
Policy Gradient Methods
- Sutton, R. S., McAllester, D., Singh, S., & Mansour, Y. (1999). “Policy Gradient Methods for Reinforcement Learning with Function Approximation.” .
- Theoretical foundations of policy gradients
- Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2015). “High-Dimensional Continuous Control Using Generalized Advantage Estimation.” .
Neural Networks & Deep Learning
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). . MIT Press.
Online Resources
- Karpathy, A. “Deep Reinforcement Learning: Pong from Pixels.”
- Spinning Up in Deep RL by OpenAI
Code Repository
- Jumle, V. (2025). “Reinforcement Learning 101: Delivery Drone Landing.”
Friend
- Singh, Navroop Kaur. (2025): For providing “”. Thanks!
All images in this text are either AI-generated (using Gemini), personally made by me, or screenshots & plots that I made.
