Causal Inference Is Eating Machine Learning

-

shipped a readmission-prediction model in early 2024. This can be a composite case drawn from patterns documented by Hernán & Robins in , but every detail maps to real deployment failures. 

Accuracy on the held-out test set: 94%. The operations team used it to make a decision which patients to prioritize for follow-up calls. They expected readmission rates to drop.

Rates went up.

The model had captured every correlation in the info: older patients, certain zip codes, specific discharge diagnoses. It performed exactly as designed. The test metrics were clean. The confusion matrix looked textbook.

But when the team acted on those predictions (calling patients flagged as high-risk, rearranging discharge protocols) the relationships in the info shifted beneath them. Patients who received extra follow-up calls didn’t improve. Those who kept getting readmitted shared a unique profile entirely: they couldn’t afford their medications, lacked reliable transportation to follow-up appointments, or lived alone without support for post-discharge care. The variables that  readmission weren’t the identical variables that  it.

The model never learned that distinction, since it was never designed to. It saw correlations and assumed they were handles you could possibly pull. They weren’t. They were shadows solid by deeper causes the model couldn’t see.

A model that predicts readmission with 94% accuracy told the team exactly who would come back. It told them nothing about why, or what to do about it.

Should you’ve built a model that predicts well but fails when became a call, you’ve already felt this problem. You simply didn’t have a reputation for it.

The name is confounding. The answer is causal inference. And in 2026, the tools to do it properly are finally mature enough for any data scientist to make use of.


The Query Your Model Can’t Answer

Machine Learning (ML) is built for one job: find patterns in data and predict outcomes. That is  reasoning. It really works brilliantly for spam filters, image classifiers, and advice engines. Pattern in, pattern out.

But business stakeholders rarely ask “what is going to occur next?” They ask “what should we do?” Should we raise the value? Should we alter the treatment protocol? Should we provide this customer a reduction?

These are causal questions. And answering them with associational models is like using a thermometer to set the thermostat. The thermometer tells you the temperature. It doesn’t let you know what would occur should you modified the dial.

Answering “what should we do?” with a tool designed for “what is going to occur?” is like using a thermometer to set the thermostat.

Judea Pearl, the pc scientist who won the 2011 Turing Award for his work on probabilistic and causal reasoning, organized this gap into what he calls the Ladder of Causation. The ladder has three rungs, and the gap between them explains why so many ML projects fail after they move from prediction to motion.

Pearl’s three rungs of causal reasoning. The gap between Level 1 and Level 2 is where flawed decisions are made at scale. Image by the writer.

Level 1: Association (“Seeing”). “Patients who take Drug X have higher outcomes.” That is pure correlation. Every standard ML model operates here. It answers: what patterns exist in the info?

Level 2: Intervention (“Doing”). “If we  Drug X to this patient, will their consequence improve?” This requires understanding what happens whenever you change something. Pearl formalizes this with the do-operator: P(Y | do(X)). No amount of observational data, by itself, can answer this.

Level 3: Counterfactual (“Imagining”). “This patient took Drug X and recovered. Would they’ve recovered  it?” This requires reasoning about realities that never happened. It’s the very best type of causal considering.

Here’s what each level looks like in practice. A Level 1 model at an e-commerce company says: “Users who viewed product pages for trainers also bought protein bars.” Useful for recommendations. A Level 2 query from the identical company: “If we send a 20% discount on protein bars to users who viewed trainers, will purchases increase?” That requires knowing whether the discount  purchases or whether the identical users would have bought anyway. A Level 3 query: “This user bought protein bars after receiving the discount. Would they’ve bought them without it?” That requires reasoning a couple of world that didn’t occur.

Most ML operates on Level 1. Most business decisions require Level 2 or 3. That gap is where flawed decisions are made at scale.


When Accuracy Lies

The gap between prediction and causation isn’t theoretical. It has a body count.

Consider the kidney stone study from 1986. Researchers compared two treatments for renal calculi. Treatment A outperformed Treatment B for small stones. Treatment A also outperformed Treatment B for giant stones. But when the info was pooled across each groups, Treatment B appeared superior.

That is Simpson’s paradox. The lurking variable was stone severity. Doctors had prescribed Treatment A for harder cases. Pooling the info erased that context, flipping the apparent conclusion. A prediction model trained on the pooled data would confidently recommend Treatment B. It might be flawed.

That’s a statistics textbook example. The hormone therapy case drew blood.

For many years, observational studies suggested that postmenopausal Hormone Alternative Therapy (HRT) reduced the chance of coronary heart disease. The evidence looked solid. Hundreds of thousands of ladies were prescribed HRT based on these findings. Then the Women’s Health Initiative, a large-scale randomized controlled trial published in 2002, revealed the alternative: HRT actually  cardiovascular risk.

For many years, observational studies suggested hormone therapy protected hearts. A correct trial revealed it damaged them. Hundreds of thousands of prescriptions, one confound.

The confound was wealth. Healthier, wealthier women were more more likely to each select HRT  have lower heart disease rates. The observational models captured this correlation and mistook it for a treatment effect. A 2019 paper by Miguel Hernán in  used this exact case to argue that data science needs “a second likelihood to get causal inference right.”

How common is this error? A 2021 scoping review within the  examined observational studies and located that 26% of them conflated prediction with causal claims. One in 4 published papers, in medical journals, where people make life-and-death decisions based on the outcomes.

The core structure behind each cases is the confounding fork: a hidden common cause (Z) that influences each the treatment (X) and the consequence (Y), making a spurious association between them. Stone severity drove each treatment selection and outcomes. Wealth drove each HRT adoption and heart health. In each case, the correlation between X and Y was real in the info. But acting on it as if X caused Y produced the flawed intervention.

The confounding fork: Z causes each X and Y, making a correlation between X and Y that disappears (or reverses) whenever you control for Z. Image by the writer.

The lesson is uncomfortable: a model can have high accuracy, pass every validation check, and still give recommendations that make outcomes worse. Accuracy measures how well a model captures existing patterns. It says nothing about whether those patterns survive whenever you intervene.


The Toolkit Caught Up

For years, causal inference lived behind a wall of econometrics textbooks, custom R scripts, and a small circle of specialists. That wall has come down.

Microsoft Research built DoWhy, a Python library that reduces causal evaluation to 4 explicit steps: model your assumptions, discover the causal estimand, estimate the effect, and refute your individual result. That fourth step is what separates causal inference from “I ran a regression and it was significant.” DoWhy forces you to try to interrupt your conclusion before you trust it.

Alongside DoWhy sits EconML, one other Microsoft Research library that gives the estimation algorithms: Double Machine Learning (DML), causal forests, instrumental variable methods, and doubly robust estimators. Together, they form the PyWhy project, which is quickly becoming the usual causal evaluation stack in Python.

DoWhy reduces causal evaluation to 4 steps: model, discover, estimate, refute. That last step separates causal inference from “I ran a regression.”

The market signals align. Fortune Business Insights valued the worldwide Causal Artificial Intelligence (AI) market at $81.4 billion in 2025, projecting $116 billion for 2026 (a 42.5% Compound Annual Growth Rate, or CAGR). A further 25% of organizations plan to adopt causal AI by 2026, which might bring total adoption amongst AI-driven organizations to just about 70%.

Uber built CausalML for uplift modeling and treatment effect estimation. Netflix has published research on causal bandits for content recommendations. Amazon’s AWS team uses DoWhy for root cause evaluation in microservice architectures, diagnosing why latency spikes occur somewhat than simply predicting when they are going to. These aren’t academic experiments. They’re production systems running at scale.

The sensible barrier was once expertise. You needed to know structural causal models, the backdoor criterion, and derive estimands by hand. DoWhy automates the identification step. You draw the DAG (encoding your domain knowledge), and the library determines which statistical estimand answers your causal query. That’s the part that used to take a PhD-level methods course to do manually.


Where Causal Methods Break Down

A good objection: most ML applications work fantastic without causal reasoning. Suggestion systems, image classification, fraud detection, search rating. Pattern in, pattern out. These problems genuinely don’t need causal structure, and adding it might be over-engineering.

Causal inference also carries a value that prediction doesn’t. It requires assumptions. You have to specify a Directed Acyclic Graph (DAG), a diagram encoding which variables cause which. In case your DAG is flawed (a missing confounder, a reversed arrow) your causal estimate may be worse than a naive correlation. The rubbish-in-garbage-out problem doesn’t disappear; it moves from the info to the assumptions.

The argument here isn’t that causal inference should replace prediction. It’s that causal inference must  prediction at any time when you progress from pattern recognition to decision-making. The failure mode isn’t “ML doesn’t work.” The failure mode is “ML works for prediction, then gets misapplied to a causal query.” Knowing which query you’re answering is the skill that separates a model builder from a call scientist.


Does Your Problem Need Causal Inference?

The 5-Query Diagnostic

Before you decide a way, run your problem through these five questions. Should you answer “yes” to 2 or more, you would like causal inference. Should you answer “yes” to query 1 alone, you would like causal inference.

  1. Are you making a call or a prediction?
    Predicting who will churn = standard ML. Deciding which intervention prevents churn = causal inference.
  2. Would acting in your model change the underlying relationships?
    In case your intervention alters the very patterns the model learned, your correlations will shift post-deployment. This can be a causal problem.
  3. Could a confounding variable explain your result?
    If two variables (treatment and consequence) share a standard cause, your observed association may vanish, reverse, or amplify once the confounder is controlled for. Think: the HRT case.
  4. Do it is advisable answer “what if?” or “why?”
    “What if we doubled the value?” is a Level 2 (intervention) query. “Why did this customer leave?” is a Level 3 (counterfactual) query. Each require causal reasoning.
  5. Is there selection bias in how treatments were assigned?
    If doctors prescribe Drug A to sicker patients, or if users self-select right into a feature, comparing raw outcomes without adjustment is meaningless.
A simplified diagnostic flow. The complete 5-question version is within the text above. Image by the writer.

Which Causal Method Matches Your Problem?

Once you already know you would like causal inference, the subsequent query is which method. This matrix maps common situations to the fitting tool.

Image by the writer.

Should you’re unsure where to begin: begin with a DAG. Draw the causal relationships you think exist between your treatment, consequence, and potential confounders. Even a rough DAG makes your assumptions explicit, which is the only most vital step. You’ll be able to refine the estimation method afterward.

A DoWhy Workflow in Practice

Here’s a concrete example: measuring whether a customer loyalty program actually increases annual spending (versus loyal customers who would spend more anyway self-selecting into this system).

# Install: pip install dowhy
import dowhy
from dowhy import CausalModel

# Step 1: MODEL your causal assumptions as a DAG
# Income affects each loyalty signup AND spending (confounder)
model = CausalModel(
    data=df,
    treatment="loyalty_program",
    consequence="annual_spending",
    common_causes=["income", "prior_purchases", "age"],
)

# Step 2: IDENTIFY the causal estimand
# DoWhy determines what statistical quantity answers your query
identified = model.identify_effect()
# Returns: E[annual_spending | do(loyalty_program=1)]
#        - E[annual_spending | do(loyalty_program=0)]

# Step 3: ESTIMATE the causal effect
estimate = model.estimate_effect(
    identified,
    method_name="backdoor.propensity_score_matching"
)
print(f"Causal effect: ${estimate.value:.2f}/yr")

# Step 4: REFUTE your individual result
# Add a random variable that shouldn't affect the estimate
refutation = model.refute_estimate(
    identified, estimate,
    method_name="random_common_cause"
)
print(refutation)
# If the effect holds under random confounders, your result is powerful

4 steps. Model your assumptions, discover the estimand, estimate the effect, then try to interrupt your individual result. The DoWhy documentation provides full tutorials on integrating EconML estimators for more advanced use cases (DML, causal forests, instrumental variables).

The refutation step deserves emphasis. In standard ML, you validate with held-out test sets. In causal inference, you validate by attempting to destroy your individual estimate: adding random confounders, using placebo treatments, running the evaluation on data subsets. If the effect survives, you’ve something real. If it collapses, you’ve saved yourself from a costly flawed decision.

In case your model’s recommendations would change the relationships it learned from, you’ve left prediction territory. Welcome to causation.


What Changes Now

The convergence is already visible. Tech firms are hiring for causal reasoning: Microsoft built the complete PyWhy stack, Uber released CausalML, Netflix published research on causal inference in production. The skillset isn’t any longer confined to economics PhD programs and epidemiology departments. It’s entering production ML teams.

Universities are adapting. Hernán’s classification of information science tasks into Description, Prediction, and Causal Inference (published through the Harvard School of Public Health) is becoming a regular pedagogical framework. The query isn’t any longer “should data scientists learn causal inference?” It’s “how quickly can they?”

For the person practitioner, the return on learning causal methods is asymmetric. The information scientist who can answer “what is going to occur?” is useful. The one who can answer “what should we do?” (and reveal why the reply is powerful) commands a unique sort of trust within the room. That trust translates directly into influence over decisions, resource allocation, and strategy.

The training curve is real but shorter than it looks. Should you understand conditional probability and have built regression models, you have already got 60% of the inspiration. The remaining 40% is learning to think in graphs (DAGs), understanding the difference between conditioning and intervening, and knowing when to achieve for which estimator. The PyWhy documentationBrady Neal’s free online course on causal inference, and Pearl’s accessible  cover that gap in weeks, not years.

Remember the health-tech company from the opening? After the readmission spike, they rebuilt their evaluation using DoWhy. They drew a DAG, identified that socioeconomic aspects were confounders (not causes) of readmission, and isolated the actual causal drivers: medication adherence and follow-up appointment access. They redesigned their intervention around those two levers.

Readmission rates dropped 18%.

The model’s accuracy didn’t change. What modified was the query it answered.

The following time a stakeholder asks “what should we do?”, you’ve two options: hand them a correlation and hope it survives contact with reality, or hand them a causal estimate with a refutation report showing exactly how hard you tried to interrupt it. The tools exist. The mathematics is settled. The code is 4 lines.

The one query left is whether or not you’ll keep predicting, or start causing.


References

  1. Pearl, J. & Mackenzie, D. (2018). . Basic Books.
  2. Pearl, J. & Bareinboim, E. (2022). “On Pearl’s Hierarchy and the Foundations of Causal Inference.” Technical Report R-60, UCLA Cognitive Systems Laboratory.
  3. Hernán, M.A. (2019). “A Second Likelihood to Get Causal Inference Right: A Classification of Data Science Tasks.” , 32(1), 42-49.
  4. Luijken, K. et al. (2021). “Prediction or causality? A scoping review of their conflation inside current observational research.” , 37, 35-46.
  5. Hernán, M.A. & Robins, J.M. (2020). “Causal inference and counterfactual prediction in machine learning for actionable healthcare.” , 2, 369-375.
  6. Sharma, A. & Kiciman, E. (2020). DoWhy: An End-to-End Library for Causal Inference. Microsoft Research / PyWhy.
  7. Battocchi, K. et al. (2019). EconML: A Python Package for ML-Based Heterogeneous Treatment Effect Estimation. Microsoft Research / ALICE.
  8. Fortune Business Insights. (2025). “Causal AI Market Size, Industry Share | Forecast, 2026-2034.”
  9. Charig, C.R. et al. (1986). “Comparison of treatment of renal calculi by open surgery, percutaneous nephrolithotomy, and extracorporeal shockwave lithotomy.” , 292(6524), 879-882.
  10. PyWhy Contributors. (2024). “Tutorial on Causal Inference and its Connections to Machine Learning (Using DoWhy+EconML).” PyWhy Documentation.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x