Home Artificial Intelligence Multi-Armed Bandits Applied to Order Allocation amongst Execution Algorithms

Multi-Armed Bandits Applied to Order Allocation amongst Execution Algorithms

Multi-Armed Bandits Applied to Order Allocation amongst Execution Algorithms

Confused robot observing three one-armed slot machines in Picasso style. Source: DALL-E 2.

Making decisions under uncertainty is a standard challenge faced by professionals in various fields, including data science and asset management. Asset managers face this problem when choosing amongst multiple execution algorithms to perform their trades. The allocation of orders amongst algorithms resembles the multi-armed bandit problem that gamblers face when deciding which slot machines to play, as they have to determine the variety of times to play each machine, the order through which to play them, and whether to proceed with the present machine or switch to a different. In this text, we describe how an asset manager can best distribute orders amongst available algorithms based on realized execution cost.

Dummy example

For every order, we take an motion a to allocate to considered one of K algorithms

Eq. 1: Set of possible actions to allocate an order to considered one of K algorithms.

The worth of motion a is the expected execution cost for the algorithm

Eq. 2: (Unobserved) expected execution cost for motion a, i.e. selecting a certain algorithm.

Suppose that K = 3 and the expected execution cost for the algorithms are

Eq. 3: (Unobserved) expected execution cost for 3 algorithms.

When you would know the motion values a priori, it might be trivial to resolve the issue. You’ll all the time select the algorithm with the bottom expected execution cost. Suppose now that we start allocating orders among the many three algorithms as shown in Figure 1.

Figure 1: Example of order allocation amongst three algorithms and associated execution cost. Source: Writer.

We still have no idea the motion values with certainty, but we do have estimates after a while t:

Eq. 4: (Observed) expected execution cost for motion a conditional on the knowledge up until time t.

We are able to as an example construct the empirical distribution of the execution cost¹ for every algorithm, as shown in Figure 2.

Figure 2: Empirical distribution of execution cost per algorithm after a while t. Source: Writer.

Allocating all orders to the algorithm with the bottom expected execution cost may look like the perfect approach. Nevertheless, doing so would prevent us from gathering information on the performance of the opposite algorithms. This illustrates the classical multi-armed bandit dilemma:

  • Exploit the knowledge that has already been learned
  • Explore to learn which actions give the perfect outcomes

The target is to minimize the common execution cost after allocating N orders:

Eq. 5: Objective function for order allocation problem.

Solving the issue using policies

To unravel the issue, we’d like an motion selection policy that tells us easy methods to allocate each order based on current information S. We are able to define a policy as a map from S to a:

Eq. 6: Definition of an motion selection policy.

We discuss probably the most well-known policies² for the multi-armed bandit problem, which could be classified in the next categories:

  • Semi-uniform strategies: Greedy & ε-greedy
  • Probability matching strategies: Upper-Confidence-Certain & Thompson sampling


The greedy approach allocates all orders to the motion with the bottom estimated value. This policy all the time exploits current knowledge to maximise immediate reward:

Eq. 7: Motion selection policy for greedy approach.


The ε-greedy approach behaves greedily more often than not but with probability ε selects randomly among the many suboptimal actions:

Eq. 8: Motion selection policy for ϵ-greedy approach.

A bonus of this policy is that it converges to the optimal motion within the limit.


The Upper-Confidence-Certain (UCB) approach selects the motion with the bottom motion value minus a term that’s inversely proportional to the variety of times the trading algorithm is used, i.e. Nt(a). The approach thus selects among the many non-greedy actions in line with their potential for actually being optimal and the associated uncertainties in those estimates:

Eq. 9: Motion selection policy for Upper-Confidence-Certain (UCB) approach.

Thompson Sampling

The Thompson Sampling approach, as proposed by Thompson (1933), assumes a known initial distribution over the motion values and updates the distribution after each order allocation³. The approach selects actions in line with their posterior probability of being the perfect motion:

Eq. 10: Motion selection policy for Thompson sampling approach.

Evaluating policies

In practice, policies are commonly evaluated on regret which is the deviation from the optimal solution:

Eq. 11: Definition of regret as a function of a sequence of actions.

where μ* is the minimal execution cost mean:

Eq. 12: Expected execution cost for selecting the optimal motion.

Actions are a direct consequence of the policy, and we are able to due to this fact also define regret as a function of the chosen policy:

Eq. 13: Definition of regret as a function of an motion selection policy π.

In Figure 3, we simulate the regret for the aforementioned policies within the dummy example. We observe that the Upper-Confidence-Certain approach and Thompson sampling approach perform best.

Figure 3: Simulated regret for various motion selection policies for dummy order allocation problem. Source: Writer.

Allocating orders? Embrace uncertainty!

The dummy example simulation results strongly indicate that relying solely on a greedy approach may not yield optimal outcomes. It’s, due to this fact, crucial to include and measure the uncertainty within the execution cost estimates when developing an order allocation strategy.


¹ To make sure comparability of the empirical distribution of the execution cost, we’d like to either allocate similar orders or use order-agnostic cost metrics for evaluation.

² In situation where an algorithm’s execution cost are depending on the order characteristics, contextual bandits are a more suitable option. To learn more about this approach, we recommend Chapter 2.9 in Barto & Sutton (2018) for an introduction.

³ We strongly suggest Russo et al. (2018) as an impressive resource to study Thompson sampling.

Additional resources

The next tutorials / lectures were personally very helpful for my understanding of multi-armed bandit problems.


  1. Research Scientist Robert Schapire @ Microsoft
  2. Research Scientist Hado van Hasselt @ Deepmind


  1. Assistant Professor Christina Lee Yu @ Cornell
  2. Assistant Professor Emma Brunskill @ MIT


[1] Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.

[2] Russo, D. J., Van Roy, B., Kazerouni, A., Osband, I., & Wen, Z. (2018). A tutorial on Thompson sampling. Foundations and Trends® in Machine Learning, 11(1), 1–96.

[3] Thompson, W. 1933. On the likelihood that one unknown probability exceeds one other in view of the evidence of two samples. Biometrika. 25(3/4): 285–294.

[4] Thompson, W. R. 1935. On the speculation of apportionment. American Journal of Mathematics. 57(2): 450–456.

[5] Eckles, D. and M. Kaptein. 2014. Thompson sampling with the web bootstrap. arXiv preprint arXiv:1410.4009.


  1. I am currently writing a paper and a bug appeared in the paper. I found what I wanted from your article. Thank you very much. Your article gave me a lot of inspiration. But hope you can explain your point in more detail because I have some questions, thank you. 20bet


Please enter your comment!
Please enter your name here