Dynamic Pricing with Multi-Armed Bandit: Learning by Doing

Artificial Intelligence

Dynamic Pricing with Multi-Armed Bandit: Learning by Doing

admin

August 17, 2023

Dynamic Pricing with Multi-Armed Bandit: Learning by Doing

Applying Reinforcement Learning strategies to real-world use cases, especially in dynamic pricing, can reveal many surprises

Within the vast world of decision-making problems, one dilemma is especially owned by Reinforcement Learning strategies: exploration versus exploitation. Imagine walking right into a casino with rows of slot machines (also often called “one-armed bandits”) where each machine pays out a unique, unknown reward. Do you explore and play each machine to find which one has the best payout, or do you keep on with one machine, hoping it’s the jackpot? This metaphorical scenario underpins the concept of the Multi-armed Bandit (MAB) problem. The target is to search out a technique that maximizes the rewards over a series of plays. While exploration offers latest insights, exploitation leverages the data you already possess.

Now, transpose this principle to dynamic pricing in a retail scenario. Suppose you might be an e-commerce store owner with a latest product. You aren’t certain about its optimal selling price. How do you set a price that maximizes your revenue? Do you have to explore different prices to grasp customer willingness to pay, or must you exploit a price that has been performing well historically? Dynamic pricing is actually a MAB problem in disguise. At every time step, every candidate price point could be seen as an “arm” of a slot machine and the revenue generated from that price is its “reward.” One other strategy to see that is that the target of dynamic pricing is to swiftly and accurately measure how a customer base’s demand reacts to various price points. In simpler terms, the aim is to pinpoint the demand curve that best mirrors customer behavior.

In this text, we’ll explore 4 Multi-armed Bandit algorithms to judge their efficacy against a well-defined (though not straightforward) demand curve. We’ll then dissect the first strengths and limitations of every algorithm and delve into the important thing metrics which can be instrumental in gauging their performance.

Traditionally, demand curves in economics describe the connection between the worth of a product and the amount of the product that buyers are willing to purchase. They often slope downwards, representing the common remark that as price rises, demand typically falls, and vice-versa. Consider popular products akin to smartphones or concert tickets. If prices are lowered, more people are inclined to buy, but when prices skyrocket, even the ardent fans might think twice.

Yet in our context, we’ll model the demand curve barely in another way: we’re putting price against probability. Why? Because in dynamic pricing scenarios, especially digital goods or services, it’s often more meaningful to think by way of the likelihood of a sale at a given price than to take a position on exact quantities. In such environments, each pricing attempt could be seen as an exploration of the likelihood of success (or purchase), which could be easily modeled as a Bernoulli random variable with a probability p depending on a given test price.

Here’s where it gets particularly interesting: while intuitively one might think the duty of our Multi-armed Bandit algorithms is to unearth that ideal price where the probability of purchase is highest, it’s not quite so straightforward. In reality, our ultimate goal is to maximise the revenue (or the margin). This implies we’re not looking for the worth that gets essentially the most people to click ‘buy’ — we’re looking for the worth that, when multiplied by its associated purchase probability, gives the best expected return. Imagine setting a high price that fewer people buy, but each sale generates significant revenue. On the flip side, a really low price might attract more buyers, but the entire revenue might still be lower than the high price scenario. So, in our context, talking concerning the ‘demand curve’ is somewhat unconventional, as our goal curve will primarily represent the probability of purchase quite than the demand directly.

Now, attending to the maths, let’s start by saying that consumer behavior, especially when coping with price sensitivity, isn’t at all times linear. A linear model might suggest that for each incremental increase in price, there’s a relentless decrement in demand. In point of fact, this relationship is usually more complex and nonlinear. One strategy to model this behavior is through the use of logistic functions, which may capture this nuanced relationship more effectively. Our chosen model for the demand curve is then:

Here, a denotes the utmost achievable probability of purchase, while b modulates the sensitivity of the demand curve against price changes. The next value of b means a steeper curve, approaching more rapidly to lower purchase probabilities as the worth increases.

4 examples of demand curves with different combos of parameters a and b

For any given price point, we’ll be then capable of obtain an associated purchase probability, p. We will then input p right into a Bernoulli random variable generator to simulate the response of a customer to a specific price proposal. In other words, given a price, we are able to easily emulate our reward function.

Next, we are able to multiply this function by the worth with a purpose to get the expected revenue for a given price point:

Unsurprisingly, this function doesn’t reach its maximum in correspondence with the best probability. Also, the worth related to the utmost doesn’t rely on the worth of the parameter a, while the utmost expected return does.

Expected revenue curves with related maxima

With some recollection from calculus, we may also derive the formula for the derivative (you’ll need to make use of a mix of each the product and the chain rule). It’s not exactly a calming exercise, nevertheless it’s nothing too difficult. Here is the analytical expression of the derivative of the expected revenue:

This derivative allows us to search out the precise price that maximizes our expected revenue curve. In other words, through the use of this specific formula in tandem with some numerical algorithms, we are able to easily determine the worth that sets it to 0. This, in turn, is the worth that maximizes the expected revenue.

And this is precisely what we want, since by fixing the values of a and b, we’ll immediately know the goal price that our bandits could have to search out. Coding this in Python is a matter of a number of lines of code:

For our use case, we’ll set a = 2 and b = 0.042, which is able to give us a goal price of about 30.44, related to an optimal probability of 0.436 ( → optimal average reward is 30.44*0.436=13.26). This price is clearly unknown typically and it is precisely the worth that our Multi-armed Bandit algorithms will seek.

Now that we’ve identified our objectives, it’s time to explore various strategies for testing and analyzing their performance, strengths, and weaknesses. While several algorithms exist in MAB literature, on the subject of real-world scenarios, 4 primary strategies (together with their variations) predominantly form the backbone. On this section, we’ll provide a temporary overview of those strategies. We assume the reader has a foundational understanding of them; nonetheless, for those eager about a more in-depth exploration, references are provided at the top of the article. After introducing each algorithm, we’ll also present its Python implementation. Although each algorithm possesses its unique set of parameters, all of them commonly utilize one key input: the arm_avg_reward vector. This vector denotes the typical reward garnered from each arm (or motion/price) as much as the present time step t. This critical input guides all of the algorithms in making informed decisions concerning the subsequent price setting.

The algorithms I’m going to use to our dynamic pricing problem are the next:

Greedy: This strategy is like at all times going back to the machine that gave you essentially the most coins the primary few times you played. After trying out each machine a bit, it sticks with the one which seemed one of the best. But there could be an issue. What if that machine was just lucky firstly? The Greedy strategy might miss out on higher options. On the brilliant side, the code implementation is absolutely easy:

It’s essential to distinguish the initial scenario (when all rewards are 0) from the regular one. Often, you’ll find only the ‘else’ part implemented, which indeed works even when all rewards are at 0. Yet, this approach can result in a bias toward the primary element. For those who make this oversight, you may find yourself paying that bias, particularly if the optimal reward happens to be tied to the primary arm (yes, I’ve been there). The Greedy approach is usually the least-performing one and we’ll primarily use it as our performance baseline.

ϵ-greedy: The ε-greedy (epsilon-greedy) algorithm is a modification to tackle the predominant drawback of the greedy approach. It introduces a probability ε (epsilon), typically a small value, to pick a random arm, promoting exploration. With a probability 1−ε, it chooses the arm with the best estimated reward, favoring exploitation. By balancing between random exploration and exploitation of known rewards, the ε-greedy strategy goals to realize higher long-term returns in comparison with purely greedy methods. Again, the implementation is immediate, it’s simply an extra ‘if’ on top of the Greedy code.

UCB1 (Upper Confidence Certain): The UCB1 strategy is sort of a curious explorer trying to search out one of the best restaurant in a latest city. While there’s a favourite spot they’ve enjoyed, the allure of doubtless discovering a fair higher place grows with each passing day. In our context, UCB1 combines the rewards of known price points with the uncertainty of those less explored. Mathematically, this balance is achieved through a formula: the typical reward of a price point plus an “uncertainty bonus” based on how long because it was last tried. This bonus is calculated as

and represents the “growing curiosity” concerning the untried price. The hyperparameter C controls the balance between exploitation and exploration, with higher values of C encouraging more exploration of less-sampled arms. By at all times choosing the worth with the best combined value of known reward and curiosity bonus, UCB1 ensures a mixture of sticking to what’s known and venturing into the unknown, aiming to uncover the optimal price point for optimum revenue. I’ll start with the by-the-book implementation of this approach, but we’ll soon see that we want to tweak it a bit.

Thompson Sampling: This Bayesian approach addresses the exploration-exploitation dilemma by probabilistically choosing arms based on their posterior reward distributions. When these rewards adhere to a Bernoulli distribution, representing binary outcomes like success/failure, Thompson Sampling (TS) employs the Beta distribution as a conjugate prior (see this table for reference). Initiating with a non-informative Beta(1,1) prior for each arm, the algorithm updates the distribution’s parameters upon observing rewards: a hit increases the alpha parameter, while a failure augments the beta. During each play, TS draws from the present Beta distribution of every arm and opts for the one with the highest sampled value. This technique allows TS to dynamically adjust based on gathered rewards, adeptly balancing between the exploration of uncertain arms and the exploitation of those known to be rewarding. In our specific scenario, although the foundational reward function follows a Bernoulli distribution (1 for a purchase order and 0 for a missed purchase), the actual reward of interest is the product of this basic reward and the present price under test. Hence, our implementation of TS will need a slight modification (which may even introduce some surprises).

The change is definitely pretty easy: to find out essentially the most promising next arm, samples extracted from the posterior estimates are multiplied by their respective price points (line 3). This modification ensures decisions are anchored on the anticipated average revenue, shifting the main focus from the best purchase probability.

At this point, having gathered all the important thing ingredients to construct a simulation comparing the performance of the 4 algorithms in our dynamic pricing context, we must ask ourselves: what exactly will we be measuring? The metrics we decide are pivotal, as they are going to guide us within the strategy of each comparing and improving the algorithm implementation. On this endeavor, I’m zeroing in on three key indicators:

Regret: This metric measures the difference between the reward obtained by the chosen motion and the reward that may have been obtained by taking one of the best possible motion. Mathematically, regret at time t is given by: Regret(t)=Optimal Reward(t)−Actual Reward(t). Regret, when accrued over time, provides insight into how much we’ve “lost” by not at all times selecting one of the best motion. It’s preferred over cumulative reward since it provides a clearer indication of the algorithm’s performance relative to the optimal scenario. Ideally, a regret value near 0 indicates proximity to optimal decision-making.
Reactivity: This metric gauges the speed at which an algorithm approaches a goal average reward. Essentially, it’s a measure of the algorithm’s adaptability and learning efficiency. The quicker an algorithm can achieve the specified average reward, the more reactive it’s, implying a swifter adjustment to the optimal price point. In our case the goal reward is ready at 95% of the optimal average reward, which is 13.26. Nevertheless, initial steps can exhibit high variability. For example, a lucky early alternative might lead to a hit from a low probability arm related to a high price, quickly achieving the edge. As a consequence of such fluctuations, I’ve opted for a stricter definition of reactivity: the variety of steps required to realize 95% of the optimal average reward ten times, excluding the initial 100 steps.
Arms Allocation: This means the frequency with which each algorithm utilizes the available arms. Presented as a percentage, it reveals the algorithm’s propensity to pick each arm over time. Ideally, for essentially the most efficient pricing strategy, we’d want an algorithm to allocate 100% of its selections to the best-performing arm and 0% to the remainder. Such an allocation would inherently result in a regret value of 0, denoting optimal performance.

Evaluating MAB algorithms poses challenges resulting from the highly stochastic nature of their outcomes. Because of this due to the inherent randomness in determining quantities, the outcomes can greatly vary from one run to a different. For a strong evaluation, essentially the most effective approach is to execute the goal simulation multiple times, accumulate the outcomes and metrics from each simulation, after which compute the typical.

The initial step involves making a function to simulate the decision-making process. This function will implement the feedback loop represented within the below image.

Feedback loop implemented within the simulation function

That is the implementation of the simulation loop:

The inputs to this function are:

prices: An inventory of candidate prices we wish to check (essentially our “arms”).
nstep: The overall variety of steps within the simulation.
strategy: The algorithm we aim to check for making decisions on the subsequent price.

Lastly, we want to jot down the code for the outer loop. For each goal strategy, this loop will call run_simulation multiple times, collect and aggregate the outcomes from each execution, after which display the outcomes.

For our evaluation, we’ll use the next configuration parameters:

prices: Our price candidates → [20, 30, 40, 50, 60]
nstep: Variety of time steps for each simulation → 10000
nepoch: Variety of simulation executions → 1000

Moreover, by setting our price candidates, we are able to promptly obtain the associated purchase probabilities, that are (roughly) [0.60, 0.44, 0.31, 0.22, 0.15].

After running the simulation we’re finally capable of see some results. Let’s start from the plot of the cumulative regret:

From the graph, we are able to see that TS is the winner by way of mean cumulative regret, nevertheless it takes around 7,500 steps to surpass ε-greedy. Then again, we have now a transparent loser, which is UCB1. In its basic configuration, it essentially performs on par with the greedy approach (we’ll get back to this later). Let’s try to grasp the outcomes higher by exploring the opposite available metrics. In all 4 cases, the reactivity shows very large standard deviations, so we’ll give attention to the median values as a substitute of the means, as they’re more immune to outliers.

The initial remark from the plots reveals that while TS surpasses ε-greedy by way of the mean, it barely lags behind by way of the median. Nevertheless, its standard deviation is smaller. Particularly interesting is the reactivity bar plot, which shows how TS struggles to rapidly achieve a good average reward. At first, this was counterintuitive to me, however the mechanism behind TS on this scenario clarified matters. We previously mentioned that TS estimates purchase probabilities. Yet, decisions are made based on the product of those probabilities and the costs. Having knowledge of the actual probabilities (that, as mentioned, are [0.60, 0.44, 0.31, 0.22, 0.15]) allows us to calculate the expected rewards TS is actively navigating: [12.06, 13.25, 12.56, 10.90, 8.93]. In essence, although the underlying probabilities differ considerably, the expected revenue values are relatively close from its perspective, especially in proximity to the optimal price. This implies TS requires more time to discern the optimal arm. While TS stays the top-performing algorithm (and its median eventually drops below that of the ε-greedy one if the simulation is prolonged), it demands an extended period to discover one of the best strategy on this context. Below, the arm allocation pies show how TS and ε-greedy do pretty much in identifying one of the best arm (price=30) and using it more often than not through the simulation.

Now let’s get back to UCB1. Regret and reactivity confirm that it’s mainly acting as a totally exploitative algorithm: quick to get a very good level of average reward but with big regret and high variability of the consequence. If we take a look at the arm allocations that’s much more clear. UCB1 is barely barely smarter than the Greedy approach since it focuses more on the three arms with higher expected rewards (prices 20, 30, and 40). Nevertheless, it essentially doesn’t explore in any respect.

Enter hyperparameter tuning. It’s clear that we want to find out the optimal value of the burden C that balances exploration and exploitation. Step one is to change the UCB1 code.

On this updated code, I’ve incorporated the choice to normalize the typical reward before adding the “uncertainty bonus”, which is weighted by the hyperparameter C. The rationale for that is to permit for a consistent search range for one of the best hyperparameter (say 0.5–1.5). Without this normalization, we could achieve similar results, however the search interval would want adjustments based on the range of values we’re coping with every time. I’ll spare you the boredom of finding one of the best C value; it could possibly be easily determined through a grid search. It seems that the optimal value is 0.7. Now, let’s rerun the simulation and examine the outcomes.

That’s quite the plot twist, isn’t it? Now, UCB1 is clearly one of the best algorithm. Even by way of reactivity, it has only barely deteriorated in comparison with the previous rating.

Moreover, from the attitude of arm allocation, UCB1 is now the undisputed leader.

Theory vs. Experience: Starting with book-based learning is a vital first step when delving into latest topics. Nevertheless, the earlier you immerse yourself in hands-on experiences, the faster you’ll transform information into knowledge. The nuances, subtleties, and corner cases you encounter when applying algorithms to real-world use cases will offer insights far beyond any data science book you may read.
Know Your Metrics and Benchmarks: For those who can’t measure what you’re doing, you’ll be able to’t improve it. Never begin any implementations without understanding the metrics you plan to make use of. Had I only considered regret curves, I might need concluded, “UCB1 doesn’t work.” By evaluating other metrics, especially arm allocation, it became evident that the algorithm simply wasn’t exploring sufficiently.
No One-Size-Matches-All solutions: While UCB1 emerged as the highest alternative in our evaluation, it doesn’t imply it’s the universal solution on your dynamic pricing challenge. On this scenario, tuning was straightforward because we knew the optimal value we sought. In real life, situations are never so clear-cut. Do you possess enough domain knowledge or the means to check and adjust your exploration factor for the UCB1 algorithm? Perhaps you’d lean towards a reliably effective option like ε-greedy that guarantees immediate results. Or, you could be managing a bustling e-commerce platform, showcasing a product 10000 times per hour, and also you’re willing to be patient, confident that Thompson Sampling will attain the utmost cumulative reward eventually. Yeah, life ain’t easy.

Finally, let me say that if this evaluation seemed daunting, unfortunately, it already represents a really simplified situation. In real-world dynamic pricing, prices and buy probabilities don’t exist in a vacuum — they really exist in ever-changing environments they usually’re influenced by various aspects. For instance, it’s highly improbable that purchase probability stays consistent all year long, across all customer demographics and regions. In other words, to optimize pricing decisions, we must consider our customers’ contexts. This consideration can be the focus of my next article, where I’ll delve deeper into the issue by integrating customer information and discussing Contextual Bandits. So, stay tuned!

Applying Reinforcement Learning strategies to real-world use cases, especially in dynamic pricing, can reveal many surprises

LEAVE A REPLY Cancel reply