make smart decisions when it starts out knowing nothing and may only learn through trial and error?
This is strictly what one in all the best but most vital models in reinforcement learning is all about:
A multi-armed bandit is an easy model for learning by trial and error.
Similar to we do.
We’ll explore why the choice between trying something recent (exploration) and sticking to what works (exploitation) is trickier than it seems. And what this has to do with AI, online ads and A/B testing.
Why is it essential to grasp this idea?
The multi-armed bandit introduces one in all the core dilemmas of reinforcement learning: Tips on how to make good decisions under uncertainty.
It shouldn’t be only relevant for AI, data science and behavioral models, but additionally since it reflects how we humans learn through trial and error.
What machines learn by trial and error shouldn’t be so different from what we humans do intuitively.
The difference?
Machines do it in a mathematically optimized way.
Let’s imagine a straightforward example:
We’re standing in front of a slot machine. This machine has 10 arms and every of those arms has an unknown likelihood of winning.
Some levers give higher rewards, others lower ones.
We are able to pull the levers as often as we like, but our goal is to win as much as possible.
Which means we have now to search out out which arm is one of the best (= yields essentially the most profit) without knowing from the beginning which one it’s.
The model could be very paying homage to what we frequently experience in on a regular basis life:
We test out different strategies. In some unspecified time in the future, we use the one which brings us essentially the most pleasure, enjoyment, money, etc. Whatever it’s that we’re aiming for.
In behavioral psychology, we speak of trial-and-error learning.
Or we also can consider reward learning in cognitive psychology: Animals in a laboratory experiment discover over time at which lever there’s food because they get the best gain at that specific lever.
Now back to the concept of multi-armed bandits:
It serves as an introduction to decision-making under uncertainty and is a cornerstone for understanding reinforcement learning.
I wrote about reinforcement learning (RL) intimately within the last article “Reinforcement Learning Made Easy: Construct a Q-Learning Agent in Python”. But at its core, it’s about an agent learning to make good decisions through trial and error. It’s a subfield of machine learning. The agent finds itself in an environment, decides on certain actions and receives rewards or penalties for them. The goal of the agent is to develop a method (policy) that maximizes the long-term overall profit.
So we have now to search out out within the multi-armed bandits:
- Which levers are worthwhile in the long run?
- When should we exploit a lever further (exploitation)?
- When should we check out a brand new lever (exploration)?
These last two questions leads us on to the central dilemma of reinforcement learning:
Central dilemma in Reinforcement Learning: Exploration vs. Exploitation
Have you ever ever held on to option? Only to search out out later that there’s a greater one? That’s exploitation winning over exploration.
That is the core problem of learning through experience:
- Exploration: We try something recent as a way to learn more. Perhaps we discover something higher. Or possibly not.
- Exploitation: We use one of the best of what we have now learned to date. With the aim of gaining as much reward as possible.
The issue with this?
We never know needless to say whether we have now already found one of the best option.
Selecting the arm with the very best reward to date means counting on what we all know. This known as exploitation. Nevertheless, if we commit too early to a seemingly good arm, we may overlook a fair higher option.
Trying a unique or rarely used arm gives us recent information. We gain more knowledge. That is exploration. We’d find a greater option. Nevertheless it may be that we discover a worse option.
That’s the dilemma at the center of reinforcement learning.

What we will conclude from this:
If we only exploit too early, we may miss out on the higher arms (here arm 3 as an alternative of arm 1). Nevertheless, an excessive amount of exploration also results in less overall yield (if we already know that arm 1 is sweet).
Let me explain the identical thing again in non-techy language (but somewhat simplified):
Let’s imagine we all know restaurant. We’ve gone to the identical restaurant for 10 years because we prefer it. But what if there’s a greater, cheaper place just across the corner? And we have now never tried it? If we never try something recent, we’ll never discover.
Interestingly, this isn’t just an issue in AI. It’s well-known in psychology and economics too:
The exploration vs. exploitation dilemma is a first-rate example of decision-making under uncertainty.
The psychologist and Nobel Prize winner Daniel Kahnemann and his colleague Amos Tversky have shown that individuals often don’t make rational decisions when faced with uncertainty. As a substitute, we follow heuristics, i.e. mental shortcuts.
These shortcuts often reflect either habit (=exploitation) or curiosity (=exploration). It’s precisely this dynamic that can also be visible within the Multi-Armed Bandit:
- Will we play it protected (=known arm with high reward)
or - will we risk something recent (=recent arm with unknown reward)?
Why does this matter for reinforcement learning?
We face the dilemma between exploration vs. exploitation all over the place in reinforcement learning (RL).
An RL agent must continuously determine whether it should follow what has worked best to date (=exploitation) or should try something recent to find even higher strategies (=exploration).
You’ll be able to see this trade-off in motion in advice systems: Should we keep showing users content they already like or risk suggesting something recent they could love?
And what strategies are there to pick out one of the best arm? Motion selection strategies
Motion selection strategies determine how an agent decides which arm to pick out in the following step. In other words, how an agent deals with the exploration vs. exploitation dilemma.
Each of the next strategies (also policies/rules) answers one easy query: How do we decide the following motion after we don’t know needless to say what’s best?
Strategy 1 – Greedy
That is the best strategy: We all the time select the arm with the very best estimated reward (= the very best Q(a)). In other words, all the time go for what seems best immediately.
The advantage of this strategy is that the reward is maximized within the short term and that the strategy could be very easy.
The drawback is that there isn’t a exploration. No risk is taken to try something recent, because the present best all the time wins. The agent might miss higher options that simply haven’t discovered yet.
The formal rule is as follows:

Let’s have a take a look at a simplified example:
Imagine we try two recent pizzerias. And the second is sort of good. From then on, we only return to that one, despite the fact that there are six more we’ve never tried. Perhaps we’re missing out on one of the best Pizzas on the town. But we’ll never know.
Strategy 2 – ε-Greedy:
As a substitute of all the time picking the best-known option, we allow on this strategy some randomness:
- With probability ε, we explore (try something recent).
- With probability 1-ε, we exploit (follow the present best).
This strategy deliberately mixes likelihood into the choice and is subsequently practical and infrequently effective.
- The upper ε is chosen, the more exploration happens.
- The lower ε is chosen, the more we exploit what we already know.
For instance, if ε = 0.1, exploration occurs in 10% of cases, while exploitation occurs in 90% of cases.
The advantage of ε-Greedy is that it is simple to implement and provides good basic performance.
The drawback is that selecting the fitting ε is difficult: If ε is chosen too large, quite a lot of exploration takes place and the lack of rewards might be too great. If ε is just too small, there’s little exploration.
If we stick with the pizza example:
We roll the dice before every restaurant visit. If we get a 6, we check out a brand new pizzeria. If not, we go to the regular pizza.
Strategy 3 – Optimistic Initial Values:
The purpose on this strategy is that each one Q0(a) start with artificially high values (e.g. 5.0 as an alternative of 0.0). Firstly, the agent assumes all options are great.
This encourages the agent to try every little thing (exploration). It desires to disprove the high initial value. As soon as an motion has been tried, the agent sees that it’s value less and adjusts the estimate downwards.
The advantage of this strategy is that exploration occurs robotically. This is especially suitable in deterministic environments where rewards don’t change.
The drawback is that the strategy works poorly if the rewards are already high.
If we take a look at the restaurant example again, we’d rate each recent restaurant with 5 stars originally. As we try them, we adjust the rankings based on real experience.
To place it simply, Greedy is pure habitual behavior. ε-Greedy is a combination of habit and curiosity behavior. Optimistic Initial Values is comparable to when a toddler initially thinks every recent toy is great – until it has tried it out.
How the agent learns which options are worthwhile: Estimating Q-values
For an agent to make good decisions, it must estimate how good each individual arm is. It needs to search out out which arm will bring the very best reward in the long run.
Nevertheless, the agent doesn’t know the true reward distribution.
This implies the agent must estimate the common reward of every arm based on experience. The more often an arm is drawn, the more reliable this estimate becomes.
We use an estimated value Q(a) for this:
Q(a) ≈ expected reward if we decide arm a
Our aim here is for our estimated value Qt(a) to recuperate and higher. So good until it comes as close as possible to the true value q∗(a):

The agent desires to learn from his experience in such a way that his estimated valuation Qt(a) corresponds in the long term to the common profit of arm a in the long run.
Let’s look again at our easy restaurant example:
We imagine that we would like to learn the way good a selected café is. Each time we go there, we get some feedback by giving it 3, 4 or 5 stars, for instance. Our goal is that the perceived average will eventually match the true average that we’d get if we went infinitely often.
There are two basic ways through which an agent calculates this Q value:

Method 1 – Sample average method
This method calculates the common of the observed rewards and is definitely so simple as it sounds.
All previous rewards for this arm are checked out and the common is calculated.

- n: Variety of times arm a was chosen
- Ri: Reward on the i-th time
The advantage of this method is that it is easy and intuitive. And it’s statistically correct for stable, stationary problems.
The drawback is that it reacts too slowly to changes. Especially in non-stationary environments, where conditions shift over time.
For instance, imagine a music advice system: A user might suddenly develop a brand new taste. The user used to prefer rock, but now they hearken to jazz. If the system keeps averaging over all past preferences, it reacts very slowly to this transformation.
Similarly, within the mult-armed bandit setting, if arm 3 suddenly starts giving significantly better rewards from round 100 onwards, the running average will probably be too sluggish to reflect that. The early data still dominates and hides the advance.
Method 2 – Incremental Implementation
Here the Q value is adjusted immediately with each recent reward – without saving all previous data:

- α: Learning rate (0 < αalphaα ≤ 1)
- Rn: Newly observed reward
- Qn(a): Previous estimated value
- Qn+1: Updated estimated value
If the environment is stable and rewards don’t change, the sample average method works best. But when things change over time, the incremental method with a relentless learning rate α adapts more quickly.

Final Thoughts: What do we’d like it for?
Multi-armed bandits are the idea for a lot of real-world applications akin to advice engines or internet marketing.
At the identical time, it’s the right stepping stone into reinforcement learning. It teaches us the mindset: Learning through feedback, acting under uncertainty and balancing exploration and exploitation.
Technically, multi-armed bandits are a simplified type of Reinforcement Learning: There are not any states, no future planning, but only the rewards immediately. However the logic behind them shows up repeatedly in advanced methods like Q-learning, policy gradients, and deep reinforcement learning.