A straightforward example of hill climbing — and solving an issue that’s difficult to resolve without optimization techniques
Betting on the World Series is an old, interesting, and difficult puzzle. It’s also a pleasant problem to reveal an optimization technique called hill climbing, which I’ll cover in this text.
Hill climbing is a well-established, and comparatively straightforward optimization technique. There are numerous other examples online using it, but I assumed this problem allowed for an interesting application of the technique and is price .
One place the puzzle might be seen in on a page hosted by UC Davis. To save lots of you looking it up, I’ll repeat it here:
[E. Berlekamp] Betting on the World Series. You might be a broker; your job is to accommodate your client’s wishes without placing any of your personal capital in danger. Your client wishes to put a fair $1,000 bet on the final result of the World Series, which is a baseball contest decided in favor of whichever of two teams first wins 4 games. That’s, the client deposits his $1,000 with you prematurely of the series. At the tip of the series he must receive from you either $2,000 if his team wins, or nothing if his team loses. No market exists for bets on all the world series. Nonetheless, you may place even bets, in any amount, on each game individually. What’s your strategy for putting bets on the person games to be able to achieve the cumulative result demanded by your client?
So, it’s essential to bet on the games one by one (though also possible to abstain from betting on some games, simply betting $0 on those). After each game, we’ll either gain or lose exactly what we bet on that game. We start with the $1000 provided by our client. Where our team wins the complete series, we would like to finish with $2000; where they lose, we would like to finish with $0.
For those who’ve not seen this problem before and want to try to resolve it manually, now’s your probability before we go into an outline of solving this programmatically. It’s a pleasant problem in itself, and might be worthwhile solving it directly before proceeding with a hill-climbing solution.
For this problem, I’m going to assume it’s okay to temporarily go negative. That’s, if we’re ever, through the world series, below zero dollars, that is okay (we’re a bigger brokerage and might ride this out), as long as we are able to reliably end with either $0 or $2000. We then return to the client the $0 or $2000.
It’s relatively easy to give you solutions for this that work more often than not, but not necessarily for each scenario. In truth, I’ve seen a number of descriptions of this puzzle online that provide some sketches of an answer, but appear to not be completely tested for each possible sequence of wins and losses.
An example of a policy to bet on the (at most) seven games could also be to bet: $125, $250, $500, $125, $250, $500, $1000. On this policy, we bet $125 on the primary game, $250 on the second game, and so forth, as much as as many games as are played. If the series lasts five games, for instance, we bet: $125, $250, $500, $125, $250. This policy will work, actually, typically, though not all.
Consider the next sequence: 1111, where 0 indicates Team 0 wins a single game and 1 indicates Team 1 wins a single game. On this sequence, Team 1 wins all 4 games, so wins the series. Let’s say, our team is Team 1, so we want to finish with $2000.
the games, bets, and dollars held after each game, we have now:
Game Bet Consequence Money Held
---- --- ---- ----------
Start - - 1000
1 125 1 1125
2 250 1 1375
3 500 1 1875
4 125 1 2000
That’s, we start with $1000. We bet $125 on the primary game. Team 1 wins that game, so we win $125, and now have $1125. We then bet $250 on the second game. Team 1 wins this, so we win $250 and now have $1375. And so forth for the subsequent two games, betting $500 and $125 on these. Here, we appropriately end with $2000.
Testing the sequence 0000 (where Team 0 wins in 4 games):
Game Bet Consequence Money Held
---- --- ---- ----------
Start - - 1000
1 125 0 875
2 250 0 625
3 500 0 125
4 125 0 0
Here we appropriately (given Team 0 wins the series) end with $0.
Testing the sequence 0101011 (where Team 1 wins in seven games):
Game Bet Consequence Money Held
---- --- ---- ----------
Start - - 1000
1 125 0 875
2 250 1 1125
3 500 0 625
4 125 1 750
5 250 0 500
6 500 1 1000
7 1000 1 2000
Here we again appropriately end with $2000.
Nonetheless, with the sequence 1001101, this policy doesn’t work:
Game Bet Consequence Money Held
---- --- ---- ----------
Start - - 1000
1 125 1 1125
2 250 0 875
3 500 0 375
4 125 1 500
5 250 1 750
6 500 0 250
7 1000 1 1250
Here, though Team 1 wins the series (with 4 wins in 7 games), we end with only $1250, not $2000.
Since there are lots of possible sequences of games, that is difficult to check manually (and pretty tedious once you’re testing many possible polices), so we’ll next develop a function to check if a given policy works properly: if it appropriately ends with not less than $2000 for all possible series where Team 1 wins the series, and not less than $0 for all possible series where Team 0 wins the series.
This takes a policy in the shape of an array of seven numbers, indicating how much to bet on each of the seven games. In series with only 4, five, or six games, the values within the last cells of the policy are simply not used. The above policy might be represented as [125, 250, 500, 125, 250, 500, 1000].
def evaluate_policy(policy, verbose=False):
if verbose: print(policy)
total_violations = 0for i in range(int(math.pow(2, 7))):
s = str(bin(i))[2:]
s = '0'*(7-len(s)) + s # Pad the string to make sure it covers 7 games
if verbose:
print()
print(s)
money = 1000
number_won = 0
number_lost = 0
winner = None
for j in range(7):
current_bet = policy[j]
# Update the cash
if s[j] == '0':
number_lost += 1
money -= current_bet
else:
number_won += 1
money += current_bet
if verbose: print(f"Winner: {s[j]}, bet: {current_bet}, now have: {money}")
# End the series if either team has won 4 games
if number_won == 4:
winner = 1
break
if number_lost == 4:
winner = 0
break
if verbose: print("winner:", winner)
if (winner == 0) and (money < 0):
total_violations += (0 - money)
if (winner == 1) and (money < 2000):
total_violations += (2000 - money)
return total_violations
This starts by making a string representation of every possible sequence of wins and losses. This creates a set of 2⁷ (128) strings, starting with ‘0000000’, then ‘0000001’, and so forth, to ‘1111111’. A few of these are redundant, since some series will end before all seven games are played — once one team has won 4 games. In production, we’d likely clean this up to scale back execution time, but for simplicity, we simply loop through all 2⁷ mixtures. This does have some profit later, because it treats all 2⁷ (equally likely) mixtures equally.
For every of those possible sequences, we apply the policy to find out the bet for every game within the sequence, and keep a running count of the cash held. That’s, we loop through all 2⁷ possible sequences of wins and losses (quitting once one team has won 4 games), and for every of those sequences, we loop through the person games within the sequence, betting on each of the games one by one.
In the long run, if Team 0 won the series, we ideally have $0; and if Team 1 won the series, we ideally have $2000, though there isn’t any penalty (or profit) if we have now more.
If we don’t end a sequence of games with the right amount of cash, we determine what number of dollars we’re short; that’s the associated fee of that sequence of games. We sum these shortages up over all possible sequences of games, which provides us an evaluation of how well the policy works overall.
To find out if any given policy works properly or not, we are able to simply call this method with the given policy (in the shape of an array) and check if it returns 0 or not. Anything higher indicates that there’s a number of sequences where the broker ends with too little money.
I won’t go into an excessive amount of detail about hill climbing, because it’s fairly well-understood, and well documented many places, but will describe the essential idea in a short time. Hill climbing is an optimization technique. We typically start by generating a candidate solution to an issue, then modify this in small steps, with each step getting to raised and higher solutions, until we eventually reach the optimal point (or get stuck in an area optima).
To resolve this problem, we are able to start with any possible policy. For instance, we are able to start with: [-1000, -1000, -1000, -1000, -1000, -1000, -1000]. This particular policy is definite to work poorly — we’d actually bet heavily against Team 1 all seven games. But, that is okay. Hill climbing works by starting anywhere after which progressively moving towards higher solutions, so even starting with a poor solution, we’ll ideally eventually reach a powerful solution. Though, in some cases, we may not, and it’s sometimes essential (or not less than useful) to re-run hill climbing algorithms from different starting points. On this case, starting with a really poor initial policy works wonderful.
Twiddling with this puzzle manually before coding it, we may conclude that a policy must be a bit more complex than a single array of seven values. That type of policy determines the scale of every bet entirely based on which game it’s, ignoring the numbers of wins and losses to date. What we want to represent the policy is definitely a 2nd array, corresponding to:
[[-1000, -1000, -1000, -1000, -1000, -1000, -1000],
[-1000, -1000, -1000, -1000, -1000, -1000, -1000],
[-1000, -1000, -1000, -1000, -1000, -1000, -1000],
[-1000, -1000, -1000, -1000, -1000, -1000, -1000]]
There are other ways to do that, but, as we’ll show below, this method works quite well.
Here, the rows represent the variety of wins to date for Team 1: either 0, 1, 2, or 3. The columns, as before, indicate the present game number: either 1, 2, 3, 4, 5, 6, or 7.
Again, with the policy shown, we’d bet $1000 against Team 1 every game regardless of what, so almost any random policy is certain to be not less than barely higher.
This policy has 4×7, or 28, values. Though, some are unnecessary and this might be optimized somewhat. I’ve opted for simplicity over efficiency here, but generally we’d optimize this a bit more in a production environment. On this case, we are able to remove some unimaginable cases, like having 0 wins by games 5, 6, or 7 (with no wins for Team 1 by game 5, Team 0 should have 4 wins, thus ending the series). Twelve of the 28 cells are effectively unreachable, with the remaining 16 relevant.
For simplicity, it’s not utilized in this instance, however the fields which are actually relevant are the next, where I’ve placed a -1000:
[[-1000, -1000, -1000, -1000, n/a, n/a, n/a ],
[ n/a, -1000, -1000, -1000, -1000, n/a, n/a ],
[ n/a, n/a, -1000, -1000, -1000, -1000, n/a ],
[ n/a, n/a, n/a, -1000, -1000, -1000, -1000]]
The cells marked ‘n/a’ should not relevant. For instance, on the primary game, it’s unimaginable to have already had 1, 2, or 3 wins; only 0 wins is feasible at that time. Alternatively, by game 4, it is feasible to have 0, 1, 2, or 3 previous wins.
Also twiddling with this manually before coding anything, it’s possible to see that every bet is probably going a multiple of either halves of $1000, quarters of $1000, eights, sixteenths, and so forth. Though, this just isn’t necessarily the optimal solution, I’m going to assume that each one bets are multiples of $500, $250, $125, $62.50, or $31.25, and that they could be $0.
I’ll, though, assume that there may be never a case to bet against Team 1; while the initial policy starts out with negative bets, the method to generate recent candidate policies uses only bets between $0 and $1000, inclusive.
There are, then, 33 possible values for every bet (each multiple of $31.25 from $0 to $1000). Given the complete 28 cells, and assuming bets are multiples of 31.25, there are 33²⁸ possible mixtures for the policy. So, testing all of them is infeasible. Limiting this to the 16 used cells, there are still 33¹⁶ possible mixtures. There could also be further optimizations possible, but there would, nevertheless, be an especially large variety of mixtures to ascertain exhaustively, way over could be feasible. That’s, directly solving this problem could also be possible programmatically, but a brute-force approach, using only the assumptions stated here, could be intractable.
So, an optimization technique corresponding to hill climbing might be quite appropriate here. By starting at a random location on the answer landscape (a random policy, in the shape of a 4×7 matrix), and always (metaphorically) moving uphill (each step we move to an answer that’s higher, even when only barely, than the previous), we eventually reach the very best point, on this case a workable policy for the World Series Betting Problem.
Provided that the policies shall be represented as 2nd matrices and never 1d arrays, the code above to find out the present bet will modified from:
current_bet = policy[j]
to:
current_bet = policy[number_won][j]
That’s, we determine the present bet based on each the variety of games won to date and the number of the present game. Otherwise, the evaluate_policy() method is as above. The code above to judge a policy is definitely the majority of the code.
We next show the fundamental code, which starts with a random policy, after which loops (as much as 10,000 times), every time modifying and (hopefully) improving this policy. Each iteration of the loop, it generates 10 random variations of the current-best solution, takes the most effective of those as the brand new current solution (or keeps the present solution if none are higher, and easily keeps looping until we do have a greater solution).
import numpy as np
import math
import copypolicy = [[-1000, -1000, -1000, -1000, -1000, -1000, -1000],
[-1000, -1000, -1000, -1000, -1000, -1000, -1000],
[-1000, -1000, -1000, -1000, -1000, -1000, -1000],
[-1000, -1000, -1000, -1000, -1000, -1000, -1000]]
best_policy = copy.deepcopy(policy)
best_policy_score = evaluate_policy(policy)
print("starting rating:", best_policy_score)
for i in range(10_000):
if i % 100 == 0: print(i)
# Each iteration, generate 10 candidate solutions much like the
# current best solution and take the most effective of those (if any are higher
# than the present best).
for j in range(10):
policy_candidate = vary_policy(policy)
policy_score = evaluate_policy(policy_candidate)
if policy_score <= best_policy_score:
best_policy_score = policy_score
best_policy = policy_candidate
policy = copy.deepcopy(best_policy)
print(best_policy_score)
display(policy)
if best_policy_score == 0:
print(f"Breaking after {i} iterations")
break
print()
print("FINAL")
print(best_policy_score)
display(policy)
Running this, the fundamental loop executed 1,541 times before finding an answer. Each iteration, it calls vary_policy() (described below) ten times to generate ten variations of the present policy. It then calls evaluate_policy() to judge each. This was defined above, and provides a rating (in dollars), of how short the broker can come up using this policy in a median set of 128 instances of the world series (we are able to divide this by 128 to get the expected loss for any single world series). The lower the rating, the higher.
The initial solution had a rating of 153,656.25, so quite poor, as expected. It rapidly improves from there, quickly dropping to around 100,000, then 70,000, then 50,000, and so forth. Printing the most effective policies found up to now because the code executes also presents increasingly more sensible policies.
The next code generates a single variation on the present policy:
def vary_policy(policy):
new_policy = copy.deepcopy(policy)
num_change = np.random.randint(1, 10)
for _ in range(num_change):
win_num = np.random.alternative(4)
game_num = np.random.alternative(7)
new_val = np.random.alternative([x*31.25 for x in range(33)])
new_policy[win_num][game_num] = new_val
return new_policy
Here we first select the variety of cells within the 4×7 policy to alter, between 1 and 10. It’s possible to switch fewer cells, and this may improve performance when the scores are getting near zero. That’s, once we have now a powerful policy, we likely wish to alter it lower than we’d near the start of the method, where the solutions are inclined to be weak and there may be more emphasis on exploring the search space.
Nonetheless, consistently modifying a small, fixed variety of cells does allow getting stuck in local optima (sometimes there isn’t any modification to a policy that modifies exactly, say, 1 or 2 cells that can work higher, and it’s essential to alter more cells to see an improvement), and doesn’t all the time work well. Randomly choosing quite a few cells to switch avoids this. Though, setting the utmost number here to 10 is only for demonstration, and just isn’t the results of any tuning.
If we were to limit ourselves to the 16 relevant cells of the 4×7 matrix for changes, this code would want only minor changes, simply skipping updates to those cells, and marking them with a special symbol (comparable to ‘n/a’, corresponding to np.NaN) for clarity when displaying the matrices.
In the long run, the algorithm was capable of find the next policy. That’s, in the primary game, we could have no wins, so will bet $312.50. Within the second game, we could have either zero or one win, but in either case shall be $312.50. Within the third game, we could have either zero, one, or two wins, so will bet $250, $375, or $250, and so forth, as much as, at most, seven games. If we reach game 7, we should have 3 wins, and can bet $1000 on that game.
[[312.5, 312.5, 250.0, 125.0, 718.75, 31.25, 281.25],
[375.0, 312.5, 375.0, 375.0, 250.0, 312.5, 343.75],
[437.5, 156.25, 250.0, 375.0, 500.0, 500.0, 781.25],
[750.0, 718.75, 343.75, 125.0, 250.0, 500.0, 1000.0]]
I’ve also created a plot of how the scores for the most effective policy found to date drops (that’s, improves — smaller is best) over the 1,541 iterations:
This can be a bit hard to see for the reason that rating is initially quite large, so we plot this again, skipping first 15 steps:
We will see the rating initially continuing to drop quickly, even after the primary 15 steps, then going into a protracted period of little improvement until it will definitely finds a small modification to the present policy that improves it, followed by more drops until we eventually reach an ideal rating of 0 (being $0 short for any possible sequence of wins & losses).
The issue we worked on here is an example of what’s referred to as a constraints satisfaction problem, where we simply wish to search out an answer that covers all of the given constraints (on this case, we take the constraints as hard constraints — it’s essential to finish appropriately with either $0 or $2000 for any possible valid sequence of games).
Given two or more full solutions to the issue, there isn’t any sense of 1 being higher than the opposite; any that works is sweet, and we are able to stop once we have now a workable policy. The N Queens problem and Sudoku, are two other examples of problems of this kind.
Other sorts of problems could have a way of optimality. For instance, with the Travelling Salesperson Problem, any solution that visits every city exactly once is a legitimate solution, but each solution has a distinct rating, and a few are strictly higher than others. In that sort of problem, it’s never clear once we’ve reached the most effective possible solution, and we normally simply try for a set variety of iterations (or period of time), or until we’ve reached an answer with not less than some minimal level of quality. Hill climbing will also be used with a majority of these problems.
It’s also possible to formulate an issue where it’s essential to search out, not only one, but all workable solutions. Within the case of the Betting on World Series problem, it was easy to search out a single workable solution, but finding all solutions could be much harder, requiring an exhaustive search (though optimized to quickly remove cases which are equivalent, or to quit evaluation early where policies have a transparent final result).
Similarly, we could re-formulate Betting on World Series problem to easily require a great, but not perfect, solution. For instance, we could accept solutions where the broker comes out even more often than not, and only barely behind in other cases. In that case, hill climbing can still be used, but something like a random search or grid search are also possible — taking the most effective policy found after a set variety of trials, may go sufficiently in that case.
In problems harder than the Betting on World Series problem, easy hill climbing as we’ve used here will not be sufficient. It might be essential, for instance, to take care of a memory of previous policies, or to incorporate a process called simulated annealing (where we take, every so often, a sub-optimal next step — a step which will even have lower quality than the present solution — to be able to help break away from local optima).
For more complex problems, it could be higher to make use of Bayesian Optimization, Evolutionary Algorithms, Particle Swarm Intelligence, or other more advanced methods. I’ll hopefully cover these in future articles, but this was a comparatively easy problem, and straight-forward hill climbing worked quite well (though as indicated, can easily be optimized to work higher).
This text provided a straightforward example of hill climbing. The issue was relatively straight-forward, so hopefully easy enough to undergo for anyone not previously conversant in hill climbing, or as a pleasant example even where you might be conversant in the technique.
What’s interesting, I believe, is that despite this problem being solvable otherwise, optimization techniques corresponding to used listed below are likely the only and only means to approach this. While tricky to resolve otherwise, this problem was quite easy to resolve using hill climbing.
All images by writer