## Finding the suitable balance between exploitation and exploration

Making decisions under uncertainty is a standard challenge faced by professionals in various fields, including data science and asset management. Asset managers face this problem when choosing amongst multiple execution algorithms to perform their trades. The allocation of orders amongst algorithms resembles the multi-armed bandit problem that gamblers face when deciding which slot machines to play, as they have to determine the variety of times to play each machine, the order through which to play them, and whether to proceed with the present machine or switch to a different. In this text, we describe how an asset manager can best distribute orders amongst available algorithms based on realized execution cost.

## Dummy example

For every order, we take an motion *a *to allocate to considered one of *K* algorithms

The worth of motion *a *is the expected execution cost for the algorithm

Suppose that *K = 3 *and the expected execution cost for the algorithms are

When you would know the motion values a priori, it might be trivial to resolve the issue. You’ll all the time select the algorithm with the bottom expected execution cost. Suppose now that we start allocating orders among the many three algorithms as shown in Figure 1.

We still have no idea the motion values with certainty, but we do have estimates after a while *t*:

We are able to as an example construct the empirical distribution of the execution cost¹ for every algorithm, as shown in Figure 2.

Allocating all orders to the algorithm with the bottom expected execution cost may look like the perfect approach. Nevertheless, doing so would prevent us from gathering information on the performance of the opposite algorithms. This illustrates the classical multi-armed bandit dilemma:

- Exploit the knowledge that has already been learned
- Explore to learn which actions give the perfect outcomes

The target is to **minimize the common execution cost **after allocating *N *orders:

## Solving the issue using policies

To unravel the issue, we’d like an motion selection policy that tells us easy methods to allocate each order based on current information *S. *We are able to define a policy as a map from* S *to* a:*

We discuss probably the most well-known policies² for the multi-armed bandit problem, which could be classified in the next categories:

**Semi-uniform strategies:***Greedy &**ε-greedy***Probability matching strategies:***Upper-Confidence-Certain & Thompson sampling*

*Greedy*

The *greedy approach *allocates all orders to the motion with the bottom estimated value. This policy all the time exploits current knowledge to maximise immediate reward:

## ϵ-Greedy

The *ε-greedy approach *behaves greedily more often than not but with probability *ε* selects randomly among the many suboptimal actions:

A bonus of this policy is that it converges to the optimal motion within the limit.

## Upper-Confidence-Certain

The *Upper-Confidence-Certain (UCB) approach *selects the motion with the bottom motion value *minus *a term that’s inversely proportional to the variety of times the trading algorithm is used, i.e. *Nt(a)*. The approach thus selects among the many non-greedy actions in line with their potential for actually being optimal and the associated uncertainties in those estimates:

## Thompson Sampling

The *Thompson Sampling approach, *as proposed by Thompson (1933), assumes a known initial distribution over the motion values and updates the distribution after each order allocation³. The approach selects actions in line with their posterior probability of being the perfect motion:

## Evaluating policies

In practice, policies are commonly evaluated on *regret* which is the deviation from the optimal solution:

where *μ* *is the minimal execution cost mean:

Actions are a direct consequence of the policy, and we are able to due to this fact also define regret as a function of the chosen policy:

In Figure 3, we simulate the regret for the aforementioned policies within the dummy example. We observe that the *Upper-Confidence-Certain approach *and *Thompson sampling approach* perform best.

## Allocating orders? Embrace uncertainty!

The dummy example simulation results strongly indicate that relying solely on a *greedy approach* may not yield optimal outcomes. It’s, due to this fact, crucial to include and measure the uncertainty within the execution cost estimates when developing an order allocation strategy.

## Footnotes

¹ To make sure comparability of the empirical distribution of the execution cost, we’d like to either allocate similar orders or use order-agnostic cost metrics for evaluation.

² In situation where an algorithm’s execution cost are depending on the order characteristics, contextual bandits are a more suitable option. To learn more about this approach, we recommend Chapter 2.9 in Barto & Sutton (2018) for an introduction.

³ We strongly suggest Russo et al. (2018) as an impressive resource to study Thompson sampling.

## Additional resources

The next tutorials / lectures were personally very helpful for my understanding of multi-armed bandit problems.

## Industry

**Academia**

## References

[1] Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction.* MIT press*.

[2] Russo, D. J., Van Roy, B., Kazerouni, A., Osband, I., & Wen, Z. (2018). A tutorial on Thompson sampling. *Foundations and Trends® in Machine Learning*, *11*(1), 1–96.

[3] Thompson, W. 1933. On the likelihood that one unknown probability exceeds one other in view of the evidence of two samples. *Biometrika*. 25(3/4): 285–294.

[4] Thompson, W. R. 1935. On the speculation of apportionment. *American Journal of Mathematics*. 57(2): 450–456.

[5] Eckles, D. and M. Kaptein. 2014. Thompson sampling with the web bootstrap. *arXiv preprint arXiv:1410.4009*.

Thank you for your sharing. I am worried that I lack creative ideas. It is your article that makes me full of hope. Thank you. But, I have a question, can you help me?

I am currently writing a paper and a bug appeared in the paper. I found what I wanted from your article. Thank you very much. Your article gave me a lot of inspiration. But hope you can explain your point in more detail because I have some questions, thank you. 20bet

elegant piano jazz

jazz music

I always learn something new from The posts, like discovering new facets of a gem. Thanks for the gems!