Two-Stage Hurdle Models: Predicting Zero-Inflated Outcomes

-

Introduction

, we recurrently encounter prediction problems where the end result has an unusual distribution: a big mass of zeros combined with a continuous or count distribution for positive values. If you happen to’ve worked in any customer-facing domain, you’ve almost definitely run into this. Take into consideration predicting customer spending. In any given week, the overwhelming majority of users in your platform don’t purchase anything in any respect, however the ones who do might spend anywhere from $5 to $5,000. Insurance claims follow an identical pattern: most policyholders don’t file anything in a given quarter, however the claims that do are available in vary enormously in size. You see the identical structure in loan prepayments, worker turnover timing, ad click revenue, and countless other business outcomes.

The instinct for many teams is to succeed in for an ordinary regression model and take a look at to make it work. I’ve seen this play out multiple times. Someone matches an OLS model, gets negative predictions for half the shopper base, adds a floor at zero, and calls it a day. Or they struggle a log-transform, run into the $log(0)$ problem, tack on a $+1$ offset, and hope for one of the best. These workarounds might work, but they gloss over a fundamental issue: the zeros and the positive values in your data are sometimes generated by completely different processes. A customer who won’t ever buy your product is fundamentally different from a customer who buys occasionally but happened to not this week. Treating them the identical way in a single model forces the algorithm to compromise on each groups, and it normally does a poor job on each.

The two-stage hurdle model provides a more principled solution by decomposing the issue into two distinct questions.
First, will the end result be zero or positive?
And second, on condition that it’s positive, what’s going to the worth be?
By separating the “if” from the “how much,” we are able to use the correct tools on each sub-problem independently with different algorithms, different features, and different assumptions, then mix the outcomes right into a single prediction.

In this text, I’ll walk through the idea behind hurdle models, provide a working Python implementation, and discuss the sensible considerations that matter when deploying these models in production.
Interested readers who’re already acquainted with the motivation can skip straight to the implementation section.

The Problem with Standard Approaches

Why Not Just Use Linear Regression? To make this concrete, consider predicting customer spend.
If 80% of shoppers spend zero and the remaining 20% spend between 10 and 1000 dollars, a linear regression model immediately runs into trouble.
The model can (and can) predict negative spend for some customers, which is nonsensical since you may’t spend negative dollars.
It’s going to also struggle on the boundary: the large spike at zero pulls the regression line down, causing the model to underpredict zeros and overpredict small positive values concurrently.
The variance structure can also be mistaken.
Customers who spend nothing have zero variance by definition, while customers who do spend have high variance.
While you need to use heteroskedasticity-robust standard errors to get valid inference despite non-constant variance, that only fixes the usual errors and doesn’t fix the predictions themselves.
The fitted values are still coming from a linear model that’s attempting to average over a spike at zero and a right-skewed positive distribution, which is a poor fit no matter the way you compute the boldness intervals.

Why Not Log-Transform? The following thing most individuals try is a log-transform: $log(y + 1)$ or $log(y + epsilon)$.
This compresses the correct tail and makes the positive values look more normal, however it introduces its own set of problems.
The alternative of offset ($1$ or $epsilon$) is unfair, and your predictions will change depending on what you decide.
While you back-transform via $exp(hat{y}) – 1$, you introduce a scientific bias as a consequence of Jensen’s inequality, for the reason that expected value of the exponentiated prediction just isn’t the identical because the exponentiation of the expected prediction.
More fundamentally, the model still doesn’t distinguish between a customer who never spends and one who sometimes spends but happened to be zero this era.
Each get mapped to $log(0 + 1) = 0$, and the model treats them identically although they represent very different customer behaviors.

What This Means for Forecasting. The deeper issue with forcing a single model onto zero-inflated data goes beyond poor point estimates.
While you ask one model to explain two fundamentally different behaviors (not engaging in any respect vs. engaging at various intensities), you find yourself with a model that conflates the drivers of every.
The features that predict whether a customer will purchase in any respect are sometimes quite different from the features that predict how much they’ll spend given a purchase order.
Recency and engagement frequency might dominate the “will they buy” query, while income and product category preferences matter more for the “how much” query.
A single regression mixes these signals together, making it difficult to disentangle what’s actually driving the forecast.

This also has practical implications for the way you act on the model.
In case your forecast is low for a specific customer, is it because they’re unlikely to buy, or because they’re prone to purchase but at a small amount?
The optimal business response to every scenario is different.
You may send a re-engagement campaign for the primary case and an upsell offer for the second.
A single model gives you one number, but there isn’t a solution to tell which lever to tug.

The Two-Stage Hurdle Model

Conceptual Framework. The core idea behind hurdle models is surprisingly intuitive.
Zeros and positives often arise from different data-generating processes, so we must always model them individually.
Consider it as two sequential questions your model needs to reply.
First, does this customer cross the “hurdle” and interact in any respect?
And second, on condition that they’ve engaged, how much do they spend?
Formally, we are able to write the distribution of the end result $Y$ conditional on features $X$ as:

$$ P(Y = y | X) = begin{cases} 1 – pi(X) & text{if } y = 0 pi(X) cdot f(y | X, y > 0) & text{if } y > 0 end{cases} $$

Here, $pi(X)$ is the probability of crossing the hurdle (having a positive end result), and $f(y | X, y > 0)$ is the conditional distribution of $y$ on condition that it’s positive.
The great thing about this formulation is that these two components will be modeled independently.
You should use a gradient boosting classifier for the primary stage and a gamma regression for the second, or logistic regression paired with a neural network, or some other combination that suits your data.
Each stage gets its own feature set, its own hyperparameters, and its own evaluation metrics.
This modularity is what makes hurdle models so practical in production settings.

Stage 1: The Classification Model. The primary stage is an easy binary classification problem: predict whether $y > 0$.
You’re training on the complete dataset, with every commentary labeled as either zero or positive.
This can be a problem that the ML community has a long time of tooling for.
Logistic regression gives you an interpretable and fast baseline.
Gradient boosting methods like XGBoost or LightGBM handle non-linearities and have interactions well.
Neural networks work when you will have high-dimensional or unstructured features.
The output from this stage is $hat{pi}(X) = P(Y > 0 | X)$, a calibrated probability that the end result might be positive.

The essential thing to get right here is calibration.
Since we’re going to multiply this probability by the conditional amount in the following stage, we want $hat{pi}(X)$ to be a real probability, not only a rating that ranks well.
In case your classifier outputs probabilities which might be systematically too high or too low, the combined prediction will inherit that bias.
Platt scaling will help in case your base classifier isn’t well-calibrated out of the box.

Stage 2: The Conditional Regression Model. The second stage predicts the worth of $y$ conditional on $y > 0$.
That is where the hurdle model shines compared to plain approaches since you’re training a regression model exclusively on the positive subset of your data, so the model never has to take care of the spike at zero.
This implies you need to use the complete range of regression techniques without worrying about how they handle zeros.

The alternative of model for this stage depends heavily on the form of your positive outcomes.
If $log(y | y > 0)$ is roughly normal, you need to use OLS on the log-transformed goal (with appropriate bias correction on back-transformation, which we’ll cover below).
For right-skewed positive continuous outcomes, a GLM with a gamma family is a natural alternative.
If you happen to’re coping with overdispersed count data, negative binomial regression works well.
A straightforward method is just to make use of Autogluon because the ensemble model and never need to worry concerning the distribution of your data.
The output is $hat{mu}(X) = E[Y | X, Y > 0]$, the expected value conditional on the end result being positive.

Combined Prediction. The ultimate prediction combines each stages multiplicatively:

$$ hat{E}[Y | X] = hat{pi}(X) cdot hat{mu}(X) $$

This provides the unconditional expected value of $Y$, accounting for each the probability that the end result is positive and the expected magnitude given positivity.
If a customer has a 30% likelihood of buying and their expected spend given a purchase order is 100 dollars, then their unconditional expected spend is 30 dollars.
This decomposition also makes business interpretation straightforward.
You’ll be able to individually obtain feature importance on each the probability of engagement versus what drives the intensity of engagement to see what must be addressed.

Implementation

Training Pipeline. The training pipeline is easy.
We train Stage 1 on the complete dataset with a binary goal, then train Stage 2 on only the positive observations with the unique continuous goal.
At prediction time, we get a probability from Stage 1 and a conditional mean from Stage 2, then multiply them together.

We will implement this in Python using scikit-learn as a place to begin.
The next class wraps each stages right into a single estimator that follows the scikit-learn API, making it easy to drop into existing pipelines and use with tools like cross-validation and grid search.

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.base import BaseEstimator, RegressorMixin

class HurdleModel(BaseEstimator, RegressorMixin):
    """
    Two-stage hurdle model for zero-inflated continuous outcomes.

    Stage 1: Binary classifier for P(Y > 0)
    Stage 2: Regressor for E[Y | Y > 0]
    """

    def __init__(self, classifier=None, regressor=None):
        self.classifier = classifier or LogisticRegression()
        self.regressor = regressor or GradientBoostingRegressor()

    def fit(self, X, y):
        # Stage 1: Train classifier on all data
        y_binary = (y > 0).astype(int)
        self.classifier.fit(X, y_binary)

        # Stage 2: Train regressor on positive outcomes only
        positive_mask = y > 0
        if positive_mask.sum() > 0:
            X_positive = X[positive_mask]
            y_positive = y[positive_mask]
            self.regressor.fit(X_positive, y_positive)

        return self

    def predict(self, X):
        # P(Y > 0)
        prob_positive = self.classifier.predict_proba(X)[:, 1]

        # E[Y | Y > 0]
        conditional_mean = self.regressor.predict(X)

        # E[Y] = P(Y > 0) * E[Y | Y > 0]
        return prob_positive * conditional_mean

    def predict_proba_positive(self, X):
        """Return probability of positive end result."""
        return self.classifier.predict_proba(X)[:, 1]

    def predict_conditional(self, X):
        """Return expected value given positive end result."""
        return self.regressor.predict(X)

Practical Considerations

Feature Engineering. One in every of the great properties of this framework is that the 2 stages can use entirely different feature sets.
In my experience, the features that predict whether someone engages in any respect are sometimes quite different from the features that predict how much they engage.
For Stage 1, behavioral signals are inclined to dominate: past activity, recency, frequency, whether the shopper has ever purchased before.
Demographic indicators and contextual aspects like time of 12 months or day of week also help separate the “will engage” group from the “won’t engage” group.
For Stage 2, intensity signals matter more: historical purchase amounts, spending velocity, capability indicators like income or credit limit, and product or category preferences.
These features help distinguish the 50 dollar spender from the five hundred dollar spender, conditional on each of them making a purchase order.
Moreover, we are able to use feature boosting by feeding within the output of the stage 1 model into the stage 2 model as an extra feature.
This permits the stage 2 model to find out how the probability of engagement interacts with the intensity signals, which improves performance.

Handling Class Imbalance. If zeros dominate your dataset, say 95% of observations are zero, then Stage 1 faces a category imbalance problem.
That is common in applications like ad clicks or insurance claims.
The usual toolkit applies here: you may tune the classification threshold to optimize to your specific business objective relatively than using the default 0.5 cutoff, upweight the minority class during training through sample weights, or apply undersampling to resolve this.
The bottom line is to consider carefully about what you’re optimizing for.
In lots of business settings, you care more about precision at the highest of the ranked list than you do about overall accuracy, and tuning your threshold accordingly could make a giant difference.

Model Calibration. For the reason that combined prediction $hat{pi}(X) cdot hat{mu}(X)$ is a product of two models, each must be well-calibrated for the ultimate output to be reliable.
If Stage 1’s probabilities are systematically inflated by 10%, your combined predictions might be inflated by 10% across the board, no matter how good Stage 2 is.
For Stage 1, check calibration curves and apply Platt scaling if the raw probabilities are off.
For Stage 2, confirm that the predictions are unbiased on the positive subset, meaning the mean of your predictions should roughly match the mean of the actuals when evaluated on holdout data where $y > 0$.
I’ve found that calibration issues in Stage 1 are the more common source of problems in practice, especially when extending the classifier to a discrete-time hazard model.

Evaluation Metrics. Evaluating a two-stage model requires serious about each stage individually after which taking a look at the combined output.
For Stage 1, standard classification metrics apply: AUC-ROC and AUC-PR for rating quality, precision and recall at your chosen threshold for operational performance, and the Brier rating for calibration.
For Stage 2, you need to evaluate only on the positive subset since that’s what the model was trained on.
RMSE and MAE offer you a way of absolute error, MAPE tells you about percentage errors (which matters when your outcomes span several orders of magnitude), and quantile coverage tells you whether your prediction intervals are honest.

For the combined model, take a look at overall RMSE and MAE on the complete test set, but in addition break it down by whether the true end result was zero or positive.
A model that appears great on aggregate may be terrible at one end of the distribution.
Lift charts by predicted decile are also useful for communicating model performance to stakeholders who don’t think when it comes to RMSE.

When to Use Hurdle vs. Zero-Inflated Models. This can be a distinction price getting right, because hurdle models and zero-inflated models (like ZIP or ZINB) make different assumptions about where the zeros come from.
Hurdle models assume that each one zeros arise from a single process, the “non-participation” process.
When you cross the hurdle, you’re within the positive regime, and the zeros are fully explained by Stage 1.
Zero-inflated models, however, assume that zeros can come from two sources: some are “structural” zeros (customers who could never be positive, like someone who doesn’t own a automobile being asked about auto insurance claims), and others are “sampling” zeros (customers who might have been positive but just weren’t this time).

To make this concrete with a retail example: a hurdle model says a customer either decides to buy or doesn’t, and in the event that they shop, they spend some positive amount.
A zero-inflated model says some customers never shop at this store (structural zeros), while others do shop here occasionally but just didn’t today (sampling zeros).
In case your zeros genuinely come from two distinct populations, a zero-inflated model is more appropriate.
But in lots of practical settings, the hurdle framing is each simpler and sufficient, and I’d recommend starting there unless you will have a transparent reason to consider in two varieties of zeros.

Extensions and Variations

Multi-Class Hurdle. Sometimes the binary split between zero and positive isn’t granular enough.
In case your end result has multiple meaningful states (say none, small, and enormous), you may extend the hurdle framework right into a multi-class version.
The primary stage becomes a multinomial classifier that assigns each commentary to one in every of $K$ buckets, after which separate regression models handle each bucket’s conditional distribution.
Formally, this looks like:

$$ P(Y) = begin{cases} pi_0 & text{if } Y = 0 pi_1 cdot f_{text{small}}(Y) & text{if } 0 < Y leq tau pi_2 cdot f_{text{large}}(Y) & text{if } Y > tau end{cases} $$

This is especially useful when the positive outcomes themselves have distinct sub-populations.
For example, in modeling insurance claims, there’s often a transparent separation between small routine claims and enormous catastrophic ones, and attempting to fit a single distribution to each results in poor tail estimates.
The brink $tau$ will be set based on domain knowledge or estimated from the info using mixture model techniques.

Generalizing the Stages. One thing price emphasizing is that neither stage must be a selected style of model.
Throughout this text, I’ve presented Stage 1 as a binary classifier, but that’s just the only version.
If the timing of the event matters, you would replace Stage 1 with a discrete-choice survival model that predicts not only whether a customer will purchase, but when.
This is particularly useful for subscription or retention contexts where the “hurdle” has a temporal dimension.
Similarly, Stage 2 doesn’t need to be a single hand-tuned regression.
You can use an AutoML framework like AutoGluon to ensemble over a big set of candidate models (gradient boosting, neural networks, linear models) and let it find one of the best combination for predicting the conditional amount.
The hurdle framework is agnostic to what sits inside each stage, so you need to be at liberty to swap in whatever modeling approach most closely fits your data and use case.


Common Pitfalls

These are mistakes I’ve either made myself or seen others make when deploying hurdle models.
None of them are obvious until you’ve been bitten, so that they’re price reading through even for those who’re already comfortable with the framework.

1. Leaking Stage 2 Information into Stage 1. If you happen to engineer features from the goal, something like “average historical spend” or “total lifetime value,” you want to watch out about how that information flows into each stage.
A feature that summarizes past spend implicitly accommodates details about whether the shopper has ever spent anything, which implies Stage 1 may be getting a free signal that wouldn’t be available at prediction time for brand new customers.
The fix is to consider carefully concerning the temporal structure of your features and be certain each stages only see information that will be available on the time of prediction.

2. Ignoring the Conditional Nature of Stage 2. This one is subtle but essential.
Stage 2 is trained only on observations where $y > 0$, so it ought to be evaluated only on that subset too.
I’ve seen people compute RMSE across the complete test set (including zeros) and conclude that Stage 2 is terrible.
So once you’re reporting metrics for Stage 2, all the time filter to the positive subset first.
Similarly, when diagnosing issues with the combined model, be certain you decompose the error into its Stage 1 and Stage 2 components.
A high overall error may be driven entirely by poor classification in Stage 1, even when Stage 2 is doing fantastic on the positive observations.

4. Misaligned Train/Test Splits. Each stages need to make use of the identical train/test splits.
This sounds obvious, however it’s easy to mess up in practice, especially for those who’re training the 2 stages in separate notebooks or pipelines.
If Stage 1 sees a customer in training but Stage 2 sees the identical customer in its test set (since you re-split the positive-only data independently), you’ve introduced data leakage.
The only fix is to do your train/test split once originally on the complete dataset, after which derive the Stage 2 training data by filtering the training fold to positive observations.
If you happen to’re doing cross-validation, the fold assignments have to be consistent across each stages.

5.
Assuming Independence Between Stages.
While we model the 2 stages individually, the underlying features and outcomes are sometimes correlated in ways in which matter.
Customers with high $hat{pi}(X)$ (likely to have interaction) often even have high $hat{mu}(X)$ (prone to spend quite a bit after they do).
This implies the multiplicative combination $hat{pi}(X) cdot hat{mu}(X)$ can amplify errors in ways you wouldn’t see if the stages were truly independent.
Keep this in mind when interpreting feature importance.
A feature that shows up as essential in each stages is doing double duty, and its total contribution to the combined prediction is larger than either stage’s importance rating suggests.

Final Remarks

Alternate Uses: Beyond the examples covered in this text, hurdle models show up in a surprising number of business contexts.
In marketing, they’re a natural fit for modeling customer lifetime value, where many shoppers churn before making a second purchase, making a mass of zeros, while retained customers generate widely various amounts of revenue.
In healthcare analytics, patient cost modeling follows the identical pattern: most patients have zero claims in a given period, however the claims that do are available in range from routine office visits to major surgeries.
For demand forecasting with intermittent demand patterns (spare parts, luxury goods, B2B transactions), the two-stage decomposition naturally captures the sporadic nature of purchases and avoids the smoothing artifacts that plague traditional time series methods.
In credit risk, expected loss calculations are inherently a hurdle problem: what’s the probability of default (Stage 1), and what’s the loss given default (Stage 2)?
If you happen to’re working with any end result where zeros have a fundamentally different meaning than “only a small value,” hurdle models are price considering as a primary approach.

Two-stage hurdle models provide a principled approach to predicting zero-inflated outcomes by decomposing the issue into two conceptually distinct parts: whether an event occurs and what magnitude it takes conditional on occurrence.
This decomposition offers flexibility, since each stage can use different algorithms, features, and tuning strategies.
It offers interpretability, because you may individually analyze and present what drives participation versus what drives intensity, which is commonly precisely the breakdown that product managers and executives need to see.
And it often delivers higher predictive performance than a single model attempting to handle each the spike at zero and the continual positive distribution concurrently.
The important thing insight is recognizing that zeros and positive values often arise from different mechanisms, and modeling them individually respects that structure relatively than fighting against it.

While this text covers the core framework, we haven’t touched on several other essential extensions that deserve their very own treatment.
Bayesian formulations of hurdle models can incorporate prior knowledge and supply natural uncertainty quantification, which might tie in nicely with our hierarchical Bayesian series.
Imagine estimating product-level hurdle models where products with sparse data borrow strength from their category.
Deep learning approaches open up the potential for using unstructured features (text, images) in either stage.
If you will have the chance to use hurdle models in your individual work, I’d love to listen to about it!
Please don’t hesitate to succeed in out with questions, insights, or stories through my email or LinkedIn.
If you will have any feedback on this text, or would really like to request one other topic in causal inference/machine learning, please also be at liberty to succeed in out.
Thanks for reading!

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x