Estimating Product-Level Price Elasticities Using Hierarchical Bayesian

-

In this text, I’ll introduce you to hierarchical Bayesian (HB) modelling, a versatile approach to mechanically mix the outcomes of multiple sub-models. This method enables estimation of individual-level effects by optimally combining information across different groupings of knowledge through Bayesian updating. This is especially worthwhile when individual units have limited observations but share common characteristics/behaviors with other units.

The next sections will introduce the concept, implementation, and alternative use cases for this method.

The Problem with Traditional Approaches

As an application, imagine that we’re a big food market trying to maximise product-level revenue by setting prices. We would want to estimate the demand curve (elasticity) for every product, then optimize some profit maximization function. As step one to this workstream, we would want to estimate the value elasticity of demand (how responsive demand is to a 1% change in price) given some longitudinal data with $i in N$ products over $t in T$ periods. Do not forget that the value elasticity of demand is defined as:

$$beta=frac{partial log{textrm{Units}}_{it}}{partial log textrm{Price}_{it}}$$

Assuming no confounders, we will use a log-linear fixed-effect regression model to estimate our parameter of interest:

$$log(textrm{Units}_{it})= beta log(textrm{Price})_{it} +gamma_{c(i),t}+ delta_i+ epsilon_{it}$$

$gamma_{c(i),t}$ is a set of category-by-time dummy variables to capture the typical demand in each category-time period and $delta_i$ is a set of product dummies to capture the time-invariant demand shifter for every product. This “fixed-effect” formulation is standard and customary in lots of regression-based models to regulate for unobserved confounders. This (pooled) regression model allows us to get better the typical elasticity $beta$ across all $N$ units. This is able to mean that the shop could goal a mean price level across all products of their store to maximise the revenue:

$$underset{textrm{Price}_t}{max} ;;; textrm{Price}_{t}cdotmathbb{E}(textrm{Quantity}_{t} | textrm{Price}_{t}, beta)$$

If these units have a natural grouping (product categories), we would have the opportunity to discover the typical elasticity of every category by running separate regressions (or interacting the value elasticity with the product category) for every category using only units from that category. This is able to mean that the shop could goal average prices in each category to maximise category-specific revenue, such that:

$$underset{textrm{Price}_{c(i),t}}{max} ;;; textrm{Price}_{c(i),t}cdotmathbb{E}(textrm{Quantity}_{c(i),t} | textrm{Price}_{c(i),t}, beta_{c(i)})$$

With sufficient data, we could even run separate regressions for every individual product to acquire more granular elasticities.

Nevertheless, real-world data often presents challenges: some products have minimal price variation, short sales histories, or class imbalance across categories. Under these restrictions, running separate regressions to discover product elasticity would likely result in large standard errors and weak identification of $beta$. HB models addresses these issues by allowing us to acquire granular estimates of the coefficient of interest by sharing statistical strength each across different groupings while preserving heterogeneity. With the HB formulation, it is feasible to run one single regression (just like the pooled case) while still recovering elasticities on the product level, allowing for granular optimizations.

Understanding Hierarchical Bayesian Models

At its core, HB is about recognizing the natural structure in our data. Reasonably than treating all observations as completely independent (many separate regressions) or forcing them to follow an identical patterns (one pooled regression), we acknowledge that observations can cluster into groups, with products inside each group sharing similar patterns. The “hierarchical” aspect refers to how we organize our parameters in several levels. In its most elementary format, we could have:

  • A Global parameter that applies to all data.
  • Group-level parameters that apply to observations inside that group.
  • Individual-level parameters that apply to every specific individual.

This system is flexible enough so as to add or remove hierarchies as needed, depending on the specified level of pooling. For instance, if we expect there aren’t any similarities across categories, we could remove the worldwide parameter. If we expect that these products don’t have any natural groupings, we could remove the group-level parameters. If we only care in regards to the group-level effect, we will remove the individual-level parameter and have the group-level coefficients as our most granular parameter. If there exists the presence of subgroups nested inside the groups, we will add one other hierarchical layer. The chances are limitless!

The “Bayesian” aspect refers to how we update our beliefs about these parameters based on observed data. We first start with a proposed prior distribution that represent our initial belief of those parameters, then update them iteratively to get better a posterior distributions that comes with the knowledge from the information. In practice, which means we use the global-level estimate to tell our group-level estimates, and the group-level parameters to tell the unit-level parameters. Units with a bigger variety of observations are allowed to deviate more from the group-level means, while units with a limited variety of observations are pulled closer to the means.

Let’s formalize this with our price elasticity example, where we (ideally) wish to get better the unit-level price elasticity. We estimate:

$$log(textrm{Units}_{it})= beta_i log(textrm{Price})_{it} +gamma_{c(i),t} + delta_i + epsilon_{it}$$

Where:

  • $beta_i sim textrm{Normal}(beta_{cleft(iright)},sigma_i)$
  • $beta_{c(i)}sim textrm{Normal}(beta_g,sigma_{c(i)})$
  • $beta_gsim textrm{Normal}(mu,sigma)$

The one difference from the primary equation is that we replace the worldwide $beta$ term with product-level betas $beta_i$. We specify that the unit level elasticity $beta_i$ is drawn from a traditional distribution centered across the category-level elasticity average $beta_{c(i)}$, which is drawn from a shared global elasticity $beta_g$ for all groups. For the spread of the distribution $sigma$, we will assume a hierarchical structure for that too, but in this instance, we just set basic priors for them to take care of simplicity. For this application, we assume a previous belief of: ${ mu= -2, sigma= 1, sigma_{c(i)}=1, sigma_i=1}$. This formulation of the prior assumes that the worldwide elasticity is elastic, 95% of the elasticities fall between -4 and 0, with a typical deviation of 1 at each hierarchical level. To check whether these priors are accurately specified, we’d do a prior predictive checks (not covered in this text) to see whether our prior beliefs can get better the information that we observe.

This hierarchical structure allows information to flow between products in the identical category and across categories. If a specific product has limited price variation data, its elasticity can be pulled toward the category elasticity $beta_{c(i)}$. Similarly, categories with fewer products can be influenced more by the worldwide elasticity, which derives its mean from all category elasticities. The great thing about this approach is that the degree of “pooling” happens mechanically based on the information. Products with a number of price variation will maintain estimates closer to their individual data patterns, while those with sparse data will borrow more strength from their group.

Implementation

On this section, we implement the above model using the Numpyro package in Python, a light-weight probabilistic programming language powered by JAX for autograd and JIT compilation to GPU/TPU/CPU. We start off by generating our synthetic data, defining the model, and fitting the model to the information. We close out with some visualizations of the outcomes.

Data Generating Process

We simulate sales data where demand follows a log-linear relationship with price and the product-level elasticity is generated from a Gaussian distribution $beta_i sim textrm{Normal}(-2, 0.7)$. We add in a random price change each time period with a $50%$ probability, category-specific time trends, and random noise. This adds in multiplicatively to generate our log expected demand. From the log expected demand, we exponentiate to get the actual demand, and draw realized units sold from a Poisson distribution. We then filter to maintain only units with greater than 100 units sold (helps accuracy of estimates, not a essential step), and are left with $N=11,798$ products over $T = 156$ periods (weekly data for 3 years). From this dataset, the true global elasticity is $beta_g = -1.6$, with category-level elasticities starting from $beta_{c(i)} in [-1.68, -1.48]$.

Take into accout that this DGP ignores numerous real-world intricacies. We don’t model any aspects that would jointly affect each prices and demand (equivalent to promotions), and we don’t model any confounders. This instance is only meant to indicate that we will get better the product-specific elasticity under a wells-specified model, and doesn’t aim to cover the best way to accurately discover that price is exogenous. Nevertheless, I suggest that interested readers discuss with Causal Inference for the Brave and True for an introduction to causal inference.


import numpy as np
import pandas as pd

def generate_price_elasticity_data(N: int = 1000,
                                   C: int = 10,
                                   T: int = 50,
                                   price_change_prob: float = 0.2,
                                   seed = 42) -> pd.DataFrame:
    """
    Generate synthetic data for price elasticity of demand evaluation.
    Data is generated by
    """
    if seed just isn't None:
        np.random.seed(seed)
    
    # Category demand and trends
    category_base_demand = np.random.uniform(1000, 10000, C)
    category_time_trends = np.random.uniform(0, 0.01, C)
    category_volatility = np.random.uniform(0.01, 0.05, C)  # Random volatility for every category
    category_demand_paths = np.zeros((C, T))
    category_demand_paths[:, 0] = 1.0
    shocks = np.random.normal(0, 1, (C, T-1)) * category_volatility[:, np.newaxis]
    trends = category_time_trends[:, np.newaxis] * np.ones((C, T-1))
    cumulative_effects = np.cumsum(trends + shocks, axis=1)
    category_demand_paths[:, 1:] = category_demand_paths[:, 0:1] + cumulative_effects

    # product effects
    product_categories = np.random.randint(0, C, N)
    product_a = np.random.normal(-2, .7, size=N)
    product_a = np.clip(product_a, -5, -.1)
    
    # Initial prices for every product
    initial_prices = np.random.uniform(100, 1000, N)
    prices = np.zeros((N, T))
    prices[:, 0] = initial_prices
    
    # Generate random values and whether prices modified
    random_values = np.random.rand(N, T-1)
    change_mask = random_values < price_change_prob
    
    # Generate change factors (-20% to +20%)
    change_factors = 1 + np.random.uniform(-0.2, 0.2, size=(N, T-1))
    
    # Create a matrix to hold multipliers
    multipliers = np.ones((N, T-1))
    
    # Apply change factors only where changes should occur
    multipliers[change_mask] = change_factors[change_mask]
    
    # Apply the changes cumulatively to propagate prices
    for t in range(1, T):
        prices[:, t] = prices[:, t-1] * multipliers[:, t-1]
    
    # Generate product-specific multipliers
    product_multipliers = np.random.lognormal(3, 0.5, size=N)
    # Get time effects for each product's category (shape: N x T)
    time_effects = category_demand_paths[product_categories][:, np.newaxis, :].squeeze(1)
    
    # Ensure time effects don't go negative
    time_effects = np.maximum(0.1, time_effects)
    
    # Generate period noise for all products and time periods
    period_noise = 1 + np.random.uniform(-0.05, 0.05, size=(N, T))
    
    # Get category base demand for each product
    category_base = category_base_demand[product_categories]
    
    # Calculate base demand
    base_demand = (category_base[:, np.newaxis] *
                   product_multipliers[:, np.newaxis] *
                   time_effects *
                   period_noise)

    # log demand
    alpha_ijt = np.log(base_demand)

    # log price
    log_prices = np.log(prices)

    # log expected demand
    log_lambda = alpha_ijt + product_a[:, np.newaxis] * log_prices

    # Convert back from log space to get rate parameters
    lambda_vals = np.exp(log_lambda)

    # Generate units sold
    units_sold = np.random.poisson(lambda_vals)  # Shape: (N, T)
    
    # Create index arrays for all combinations of products and time periods
    product_indices, time_indices = np.meshgrid(np.arange(N), np.arange(T), indexing='ij')
    product_indices = product_indices.flatten()
    time_indices = time_indices.flatten()
    
    # Get categories for all products
    categories = product_categories[product_indices]
    
    # Get all prices and units sold
    all_prices = np.round(prices[product_indices, time_indices], 2)
    all_units_sold = units_sold[product_indices, time_indices]
    
    # Calculate elasticities
    product_elasticity = product_a[product_indices]

    df = pd.DataFrame({
        'product': product_indices,
        'category': categories,
        'time_period': time_indices,
        'price': all_prices,
        'units_sold': all_units_sold,
        'product_elasticity': product_elasticity
    })
    
    return df

# Keep only units with >X sales
def filter_dataframe(df, min_units = 100):
    temp = df[['product','units_sold']].groupby('product').sum().reset_index()
    unit_filter = temp[temp.units_sold>min_units]['product'].unique()
    filtered_df = df[df['product'].isin(unit_filter)].copy()

    # Summary
    original_product_count = df['product'].nunique()
    remaining_product_count = filtered_df['product'].nunique()
    filtered_out = original_product_count - remaining_product_count
    
    print(f"Filtering summary:")
    print(f"- Original variety of products: {original_product_count}")
    print(f"- Products with > {min_units} units: {remaining_product_count}")
    print(f"- Products filtered out: {filtered_out} ({filtered_out/original_product_count:.1%})")

    # Global and category elasticity
    global_elasticity = filtered_df['product_elasticity'].mean()
    filtered_df['global_elasticity'] = global_elasticity

    # Category elasticity
    category_elasticities = filtered_df.groupby('category')['product_elasticity'].mean().reset_index()
    category_elasticities.columns = ['category', 'category_elasticity']
    filtered_df = filtered_df.merge(category_elasticities, on='category', how='left')

    # Summary
    print(f"nElasticity Information:")
    print(f"- Global elasticity: {global_elasticity:.3f}")
    print(f"- Category elasticities range: {category_elasticities['category_elasticity'].min():.3f} to {category_elasticities['category_elasticity'].max():.3f}")
    
    return filtered_df

df = generate_price_elasticity_data(N = 20000, T = 156, price_change_prob=.5, seed=42)
df = filter_dataframe(df)
df.loc[:,'cat_by_time'] = df['category'].astype(str) + '-' + df['time_period'].astype(str)
df.head()

Filtering summary:

  • Original variety of products: 20000
  • Products with > 100 units: 11798
  • Products filtered out: 8202 (41.0%)

Elasticity Information:

  • Global elasticity: -1.598
  • Category elasticities range: -1.681 to -1.482
product category time_period price units_sold product_elasticity category_elasticity global_elasticity cat_by_time
0 8 0 125.95 550 -1.185907 -1.63475 -1.597683 8-0
0 8 1 125.95 504 -1.185907 -1.63475 -1.597683 8-1
0 8 2 149.59 388 -1.185907 -1.63475 -1.597683 8-2
0 8 3 149.59 349 -1.185907 -1.63475 -1.597683 8-3
0 8 4 176.56 287 -1.185907 -1.63475 -1.597683 8-4

Model

We start by creating indices for products, categories, and category-time combos using pd.factorize(). This permits us to seelct the right parameter for every remark. We then convert the value (logged) and units series into JAX arrays, then create plates that corresponds to every of our parameter groups. These plates store the parameters values for every level of the hierarchy, together with storing the parameters representing the fixed effects.

The model uses NumPyro’s plate to define the parameter groups:

  • global_a: 1 global price elasticity parameter with a $textrm{Normal}(-2, 1)$ prior.
  • category_a: $C=10$ category-level elasticities with priors centered on the worldwide parameter and standard deviation of 1.
  • product_a: $N=11,798$ product-specific elasticities with priors centered on their respective category parameters and standard deviation of 1.
  • product_effect: $N=11,798$ product-specific baseline demand effects with a typical deviation of three.
  • time_cat_effects: $(T=156)cdot(C=10)$ time-varying effects specific to every category-time combination with a typical deviation of three.

We then reparameterize the parameters using The LocScaleReparam() argument to enhance sampling efficiency and avoid funneling. After creating the parameters, we calculate log expected demand, then convert it back to a rate parameter with clipping for numerical stability. Finally, we call on the information plate to sample from a Poisson distribution with the calculated rate parameter. The optimization algorithm will then find the values of the parameters that best fit the information using stochastic gradient descent. Below is a graphical representation of the model to indicate the connection between the parameters.


import jax
import jax.numpy as jnp
import numpyro
import numpyro.distributions as dist
from numpyro.infer.reparam import LocScaleReparam

def model(df: pd.DataFrame, consequence: None):
    # Define indexes
    product_idx, unique_product = pd.factorize(df['product'])
    cat_idx, unique_category = pd.factorize(df['category'])
    time_cat_idx, unique_time_cat = pd.factorize(df['cat_by_time'])

    # Convert the value and units series to jax arrays
    log_price = jnp.log(df.price.values)
    consequence = jnp.array(consequence) if consequence just isn't None else None

    # Generate mapping
    product_to_category = jnp.array(pd.DataFrame({'product': product_idx, 'category': cat_idx}).drop_duplicates().category.values, dtype=np.int16)

    # Create the plates to store parameters
    category_plate = numpyro.plate("category", unique_category.shape[0])
    time_cat_plate = numpyro.plate("time_cat", unique_time_cat.shape[0])
    product_plate = numpyro.plate("product", unique_product.shape[0])
    data_plate = numpyro.plate("data", size=consequence.shape[0])

    # DEFINING MODEL PARAMETERS
    global_a = numpyro.sample("global_a", dist.Normal(-2, 1), infer={"reparam": LocScaleReparam()})

    with category_plate:
        category_a = numpyro.sample("category_a", dist.Normal(global_a, 1), infer={"reparam": LocScaleReparam()})

    with product_plate:
        product_a = numpyro.sample("product_a", dist.Normal(category_a[product_to_category], 2), infer={"reparam": LocScaleReparam()})
        product_effect = numpyro.sample("product_effect", dist.Normal(0, 3))

    with time_cat_plate:
        time_cat_effects = numpyro.sample("time_cat_effects", dist.Normal(0, 3))

    # Calculating expected demand
    def calculate_demand():
        log_demand = product_a[product_idx]*log_price + time_cat_effects[time_cat_idx] + product_effect[product_idx]
        expected_demand = jnp.exp(jnp.clip(log_demand, -4, 20)) # clip for stability 
        return expected_demand

    demand = calculate_demand()

    with data_plate:
        numpyro.sample(
            "obs",
            dist.Poisson(demand),
            obs=consequence
        )
    
numpyro.render_model(
    model=model,
    model_kwargs={"df": df,"consequence": df['units_sold']},
    render_distributions=True,
    render_params=True,
)

Estimation

While there are multiple ways to estimate this equation, we use Stochastic Variational Inference (SVI) for this particular application. As an summary, SVI is a gradient-based optimization method to attenuate the KL-divergence between a proposed posterior distribution to the true posterior distribution by minimizing the ELBO. That is a distinct estimation technique from Markov-Chain Monte Carlo (MCMC), which samples directly from the true posterior distribution. In real-world applications, SVI is more efficient and simply scales to large datasets. For this application, we set a random seed, initialize the guide (family of posterior distribution, assumed to be a Diagonal Normal), define the training rate schedule and optimizer in Optax, and run the optimization for 1,000,000 (takes ~1 hour) iterations. While the model might need converged beforehand, the loss still improves by a minor amount even after running the optimization for 1,000,000 iterations. Finally, we plot the (log) losses.


from numpyro.infer import SVI, Trace_ELBO, autoguide, init_to_sample
import optax
import matplotlib.pyplot as plt

rng_key = jax.random.PRNGKey(42)
guide = autoguide.AutoNormal(model, init_loc_fn=init_to_sample)
# Define a learning rate schedule
learning_rate_schedule = optax.exponential_decay(
    init_value=0.01,
    transition_steps=1000,
    decay_rate=0.99,
    staircase = False,
    end_value = 1e-5,
)

# Define the optimizer
optimizer = optax.adamw(learning_rate=learning_rate_schedule)
svi = SVI(model, guide, optimizer, loss=Trace_ELBO(num_particles=8, vectorize_particles = True))

# Run SVI
svi_result = svi.run(rng_key, 1_000_000, df, df['units_sold'])
plt.semilogy(svi_result.losses);

Recovering Posterior Samples

Once the model has been trained, we will can get better the posterior distribution of the parameters by feeding within the resulting parameters and the initial dataset. We cannot call the parameters svi_result.params directly since Numpyro uses an affline transformation on the back-end for non-Normal distributions. Due to this fact, we sample 1000 times from the posterior distribution and calculate the mean and standard deviation of every parameter in our model. The ultimate a part of the next code creates a dataframe with the estimated elasticity for every product at each hierarchical level, which we then join back to our original dataframe to check whether the algorithm recovers the true elasticity.


predictive = numpyro.infer.Predictive(
    autoguide.AutoNormal(model, init_loc_fn=init_to_sample),
    params=svi_result.params,
    num_samples=1000
)

samples = predictive(rng_key, df, df['units_sold'])

# Extract means and std dev
results = {}
excluded_keys = ['product_effect', 'time_cat_effects']
for k, v in samples.items():
    if k not in excluded_keys:
        results[f"{k}"] = np.mean(v, axis=0)
        results[f"{k}_std"] = np.std(v, axis=0)

# product elasticity estimates
prod_elasticity_df = pd.DataFrame({
    'product': df['product'].unique(),
    'product_elasticity_svi': results['product_a'],
    'product_elasticity_svi_std': results['product_a_std'],
})
result_df = df.merge(prod_elasticity_df, on='product', how='left')

# Category elasticity estimates
prod_elasticity_df = pd.DataFrame({
    'category': df['category'].unique(),
    'category_elasticity_svi': results['category_a'],
    'category_elasticity_svi_std': results['category_a_std'],
})
result_df = result_df.merge(prod_elasticity_df, on='category', how='left')

# Global elasticity estimates
result_df['global_a_svi'] = results['global_a']
result_df['global_a_svi_std'] = results['global_a_std']
result_df.head()
product category time_period price units_sold product_elasticity category_elasticity global_elasticity cat_by_time product_elasticity_svi product_elasticity_svi_std category_elasticity_svi category_elasticity_svi_std global_a_svi global_a_svi_std
0 8 0 125.95 550 -1.185907 -1.63475 -1.597683 8-0 -1.180956 0.000809 -1.559872 0.027621 -1.5550271 0.2952548
0 8 1 125.95 504 -1.185907 -1.63475 -1.597683 8-1 -1.180956 0.000809 -1.559872 0.027621 -1.5550271 0.2952548
0 8 2 149.59 388 -1.185907 -1.63475 -1.597683 8-2 -1.180956 0.000809 -1.559872 0.027621 -1.5550271 0.2952548
0 8 3 149.59 349 -1.185907 -1.63475 -1.597683 8-3 -1.180956 0.000809 -1.559872 0.027621 -1.5550271 0.2952548
0 8 4 176.56 287 -1.185907 -1.63475 -1.597683 8-4 -1.180956 0.000809 -1.559872 0.027621 -1.5550271 0.2952548

Results

The next code plots the true and estimated elasticities for every product. Each point is ranked by their true elasticity value (black), and the estimated elasticity from the model can be shown. We will see that the estimated elasticities follows the trail of the true elasticities, with a Mean Absolute Error of around 0.0724. Points in red represents products whose 95% CI doesn’t contain the true elasticity, while points in blue represent products whose 95% CI incorporates the true elasticity. Provided that the worldwide mean is -1.598, this represents a mean error of 4.5% on the product level. We will see that the SVI estimates closely follow the pattern of the true elasticities but with some noise, particularly because the elasticities turn out to be an increasing number of negative. On the highest right panel, we plot the connection between the error of the estimated elasticities and the true elasticity values. As true elasticities turn out to be an increasing number of negative, our model becomes less accurate.

For the category-level and global-level elasticities, we will create the posteriors using two methods. We will either boostrap all product-level elasticities inside the category, or we will get the category-level estimates directly from the posterior parameters. After we take a look at the category-level elasticity estimates on the underside left, we will see that the each the category-level estimates recovered from the model and the bootstrapped samples from the product-level elasticities are also barely biased towards zero, with an MAE of ~.033. Nevertheless, the boldness interval given by the category-level parameter covers the true parameter, unlike the bootstrapped product-level estimates. This means that when determining group-level elasticities, we must always directly use the group-level parameters as a substitute of bootstrapping the more granular estimates. When the worldwide level, each methods incorporates the true parameter estimate within the 95% confidence bounds, with the worldwide parameter out-performing the product-level bootstrapping, at the fee of getting larger standard errors.

Considerations

  1. HB underestimates posterior variance: One drawback of using SVI for the estimation is that it underestimates the posterior variance. While we’ll cover this topic intimately in a later article, the target function for SVI only takes under consideration the difference in expectation of our posited distribution and the true distribution. Which means that it doesn’t consider the complete correlation structure between parameters within the posterior. The mean-field approximation commonly utilized in SVI assumes conditional (on the previous hierarchy’s draw) independence between parameters, which ignores any covariances between products inside the same hierarchy. Which means that if are any spillover effects (equivalent to cannibalization or cross-price elasticity), it will not be accounted for in the boldness bounds. Because of this mean-field assumption, the uncertainty estimates are likely to be overly confident, leading to confidence intervals which are too narrow and fail to properly capture the true parameter values on the expected rate. We will see in the highest left figure that only 9.7% of the product-level elasticities cover their true elasticity. In a later post, we’ll include some solutions to this problem.
  2. Importance of priors: When using HB, priors matter significantly more compared to plain Bayesian approaches. While large datasets typically allow the likelihood to dominate priors when estimating global parameters, hierarchical structures changes this dynamic and reduce the effective sample sizes at each level. In our model, the worldwide parameter only sees 10 category-level observations (not the complete dataset), categories only draw from their contained products, and products rely solely on their very own observations. This reduced effective sample size causes shrinkage, where outlier estimates (like very negative elasticities) get pulled toward their category means. This highlights the importance of prior predictive checks, since misspecified priors can have outsized influence on the outcomes.

def elasticity_plots(result_df, results=None):
    # Create the figure with 2x2 grid
    fig = plt.figure(figsize=(12, 10))
    gs = fig.add_gridspec(2, 2)
    
    # product elasticity
    ax1 = fig.add_subplot(gs[0, 0])
    
    # Data prep
    df_product = result_df[['product','product_elasticity','product_elasticity_svi','product_elasticity_svi_std']].drop_duplicates()
    df_product['product_elasticity_svi_lb'] = df_product['product_elasticity_svi'] - 1.96*df_product['product_elasticity_svi_std']
    df_product['product_elasticity_svi_ub'] = df_product['product_elasticity_svi'] + 1.96*df_product['product_elasticity_svi_std']
    df_product = df_product.sort_values('product_elasticity')
    mae_product = np.mean(np.abs(df_product.product_elasticity-df_product.product_elasticity_svi))
    colours = []
    for i, row in df_product.iterrows():
        if (row['product_elasticity'] >= row['product_elasticity_svi_lb'] and 
            row['product_elasticity'] <= row['product_elasticity_svi_ub']):
            colours.append('blue')  # Inside CI bounds
        else:
            colours.append('red')   # Outside CI bounds
    
    # Percentage of points inside bounds
    within_bounds_pct = colours.count('blue') / len(colours) * 100
    
    # Plot data
    ax1.scatter(range(len(df_product)), df_product['product_elasticity'], 
                color='black', label='True Elasticity', s=20, zorder=3)
    
    ax1.scatter(range(len(df_product)), df_product['product_elasticity_svi'], 
                color=colours, label=f'SVI Estimate (MAE: {mae_product:.4f}, Coverage: {within_bounds_pct:.1f}%)', 
                s=3, zorder=2)
    ax1.set_xlabel('Product Index (sorted by true elasticity)')
    ax1.set_ylabel('Elasticity Value')
    ax1.set_title('SVI Estimates of True Product Elasticities')
    ax1.legend()
    ax1.grid(alpha=0.3)
    
    # Relationship between MAE and true elasticity
    ax2 = fig.add_subplot(gs[0, 1])
    
    # Calculate MAE for every product
    temp = result_df[['product','product_elasticity', 'product_elasticity_svi']].drop_duplicates().copy()
    temp['product_error'] = temp['product_elasticity'] - temp['product_elasticity_svi']
    temp['product_mae'] = np.abs(temp['product_error'])
    correlation = temp[['product_mae', 'product_elasticity']].corr()
    
    # Plot data
    ax2.scatter(temp['product_elasticity'], temp['product_error'], alpha=0.5, s=5, color = colours)
    ax2.set_xlabel('True Elasticity')
    ax2.set_ylabel('Error (True - Estimated)')
    ax2.set_title('Relationship Between True Elasticity and Estimation Accuracy')
    ax2.grid(alpha=0.3)
    ax2.text(0.5, 0.95, f"Correlation: {correlation.iloc[0,1]:.3f}", 
                transform=ax2.transAxes, ha='center', va='top',
                bbox=dict(boxstyle='round', facecolor='white', alpha=0.7))
    
    # Category Elasticity
    ax3 = fig.add_subplot(gs[1, 0])
    
    # Unique categories and elasticities
    category_data = result_df[['category', 'category_elasticity', 'category_elasticity_svi', 'category_elasticity_svi_std']].drop_duplicates()
    category_data = category_data.sort_values('category_elasticity')
    
    # Bootstrapped means from product elasticities inside each category
    bootstrap_means = []
    bootstrap_ci_lower = []
    bootstrap_ci_upper = []
    
    for cat in category_data['category']:
        # Get product elasticities for this category
        prod_elasticities = result_df[result_df['category'] == cat]['product_elasticity_svi'].unique()
        
        # Bootstrap means
        boot_means = [np.mean(np.random.choice(prod_elasticities, size=len(prod_elasticities), replace=True)) 
                     for _ in range(1000)]
        
        bootstrap_means.append(np.mean(boot_means))
        bootstrap_ci_lower.append(np.percentile(boot_means, 2.5))
        bootstrap_ci_upper.append(np.percentile(boot_means, 97.5))
    
    category_data['bootstrap_mean'] = bootstrap_means
    category_data['bootstrap_ci_lower'] = bootstrap_ci_lower
    category_data['bootstrap_ci_upper'] = bootstrap_ci_upper
    
    # Calculate MAE
    mae_category_svi = np.mean(np.abs(category_data['category_elasticity_svi'] - category_data['category_elasticity']))
    mae_bootstrap = np.mean(np.abs(category_data['bootstrap_mean'] - category_data['category_elasticity']))
        
    # Plot the information
    left_offset = -0.2
    right_offset = 0.2
    x_range = range(len(category_data))
    ax3.scatter(x_range, category_data['category_elasticity'], 
                color='black', label='True Elasticity', s=50, zorder=3)
    
    # Bootstrapped product elasticity
    ax3.scatter([x + left_offset for x in x_range], category_data['bootstrap_mean'], 
                color='green', label=f'Bootstrapped Product Estimate (MAE: {mae_bootstrap:.4f})', s=30, zorder=2)
    for i in x_range:
        ax3.plot([i + left_offset, i + left_offset], 
                [category_data['bootstrap_ci_lower'].iloc[i], category_data['bootstrap_ci_upper'].iloc[i]], 
                color='green', alpha=0.3, zorder=1)
    
    # category-level SVI estimates
    ax3.scatter([x + right_offset for x in x_range], category_data['category_elasticity_svi'], 
                color='blue', label=f'Category SVI Estimate (MAE: {mae_category_svi:.4f})', s=30, zorder=2)
    for i in x_range:
        ci_lower = category_data['category_elasticity_svi'].iloc[i] - 1.96 * category_data['category_elasticity_svi_std'].iloc[i]
        ci_upper = category_data['category_elasticity_svi'].iloc[i] + 1.96 * category_data['category_elasticity_svi_std'].iloc[i]
        ax3.plot([i + right_offset, i + right_offset], [ci_lower, ci_upper], color='blue', alpha=0.3, zorder=1)

    ax3.set_xlabel('Category Index (sorted by true elasticity)')
    ax3.set_ylabel('Elasticity')
    ax3.set_title('Comparison with True Category Elasticity')
    ax3.legend()
    ax3.grid(alpha=0.3)

    # global elasticity
    ax4 = fig.add_subplot(gs[1, 1])
    temp = result_df[['product','product_elasticity_svi','global_elasticity']].drop_duplicates()
    bootstrap_means = [np.mean(np.random.choice(np.array(temp['product_elasticity_svi']), 100)) for i in range(10000)]
    global_means = np.random.normal(result_df['global_a_svi'].iloc[0], result_df['global_a_svi_std'].iloc[0], 10000)
    true_global = np.unique(temp.global_elasticity)[0]
    p_mae = np.abs(np.mean(bootstrap_means) - true_global)
    g_mae = np.abs(np.mean(global_means) - true_global)
    
    # Plot data
    ax4.hist(bootstrap_means, bins=100, alpha=0.3, density=True, 
             label=f'Product Elasticity Bootstrap (MAE: {p_mae:.4f})')
    ax4.hist(global_means, bins=100, alpha=0.3, density=True, 
             label=f'Global Elasticity Distribution (MAE: {g_mae:.4f})')
    ax4.axvline(x=true_global, color='black', linestyle='--', linewidth=2, 
                label=f'Global Elasticity: {true_global:.3f}', zorder=0)
    ax4.set_xlabel('Elasticity')
    ax4.set_ylabel('Frequency')
    ax4.set_title('Global Elasticity Comparison')
    ax4.legend()
    ax4.grid(True, linestyle='--', alpha=0.7)

    # Show
    plt.tight_layout()
    plt.show()

elasticity_plots(result_df)

Conclusion

Alternate Uses: Other than estimating price elasticity of demand, HB models even have a wide range of other uses in Data Science. In retail, HB models can forecast demand for existing stores and solve the cold-start problem for brand new stores by borrowing information from stores/networks which have already been established and are clustered inside the same hierarchy. For advice systems, HB can estimate user-level preferences from a mix of user and item-level characteristics. This structure enables relevant recommendations to recent users based on cohort behaviors, step by step transitioning to individualized recommendations as user history accumulates. If no cohort groupings are easily available, K-means might be used to group similar units based on their characteristics.

Finally, these models will also be used to mix results from experimental and observational studies. Scientists can use historical observational uplift estimates (ads uplift) and complement it with newly developed A/B tests to cut back the required sample size for experiments by incorporating prior knowledge. This approach creates a continuous learning framework where each recent experiment builds upon previous findings slightly than ranging from scratch. For teams facing resource constraints, this implies faster time-to-insight (especially when combined with surrogate models) and more efficient experimentation pipelines.

Final Remarks: While this introduction has highlighted several applications of hierarchical Bayesian models, we’ve only scratched the surface. We haven’t deep dived into granular implementation features equivalent to prior and posterior predictive checks, formal goodness-of-fit assessments, computational scaling, distributed training, performance of estimation strategies (MCMC vs. SVI), and non-nested hierarchical structures, each of which deserves their very own post.

Nevertheless, this overview should provide a practical start line for incorporating hierarchical Bayesian into your toolkit. These models offer a framework for handling (normally) messy, multi-level data structures which are often seen in real-world business problems. As you start implementing these approaches, I’d love to listen to about your experiences, challenges, successes, and recent use cases for this class of model, so please reach out with questions, insights, or examples through my email or LinkedIn. If you have got any feedback on this text, or would really like to request one other topic in causal inference/machine learning, please also be at liberty to succeed in out. Thanks for reading!

 

Note: All images utilized in this text is generated by the creator.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x