If you’ve studied causal inference before, you most likely have already got a solid idea of the basics, just like the potential outcomes framework, propensity rating matching, and basic difference-in-differences. Nonetheless, foundational methods often break down relating to real-world challenges. Sometimes the confounders are unmeasured, treatments roll out at different deadlines, or effects vary across a population.
This text is geared towards individuals who’ve a solid grasp of the basics and at the moment are seeking to expand their skill set with more advanced techniques. To make things more relatable and tangible, we are going to use a recurring scenario as a case study to evaluate whether a job training program has a positive impact on earnings. This classic query of causality is especially well-suited for our purposes, because it is fraught with complexities that arise in real-world situations, comparable to self-selection, unmeasured ability, and dynamic effects, making it a super test case for the advanced methods we’ll be exploring.
Contents
Introduction
1. Doubly Robust Estimation: Insurance Against Misspecification
2. Instrumental Variables: When Confounders Are Unmeasured
3. Regression Discontinuity: The Credible Quasi-Experiment
4. Difference-in-Differences: Navigating Staggered Adoption
5. Heterogeneous Treatment Effects: Moving Beyond the Average
6. Sensitivity Evaluation: The Honest Researcher’s Toolkit
Putting It All Together: A Decision Framework
Final Thoughts
Part 1: Doubly Robust Estimation
Imagine we’re evaluating a training program where participants self-select into treatment. To estimate its effect on their earnings, we must account for confounders like age and education. Traditionally, we’ve two paths: final result regression (modeling earnings) or propensity rating weighting (modeling the probability of treatment). Each paths share a critical vulnerability, i.e., if we specify our model incorrectly, e.g., forget an interaction or misspecify a functional form, our estimate of the treatment effect might be biased. Now, Doubly Robust (DR) Estimation, also often called Augmented Inverse Probability Weighting (AIPW), combines each to supply a strong solution. Our estimate might be consistent if either the final result model or the propensity model is accurately specified. It uses the propensity model to create balance, while the final result model clears residual imbalance. If one model is flawed, the opposite corrects it.
In practice, this involves three steps:
- Model the final result: Fit two separate regression models (e.g., linear regression, machine learning models) to predict the final result (earnings) from the covariates. One model is trained to predict the final result μ^1(X), and one other is trained to predict μ^0(X).
- Model treatment: Fit a classification model (e.g., logistic regression, gradient boosting) to estimate the propensity rating, or each individual’s probability of enrolling in this system, π^(X)=P(T=1∣X)Here,π^(X) is the propensity rating. For all and sundry, it’s a number between 0 and 1 representing their estimated likelihood of joining the training program, based on their characteristics, and X symbolises the covariate.
- Mix: Plug these predictions into the Augmented Inverse Probability Weighting estimator. This formula cleverly combines the regression predictions with the inverse-propensity weighted residuals to create a final, doubly-robust estimate of the Average Treatment Effect (ATE).
Here is the code for a straightforward implementation in Python:
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor
from sklearn.model_selection import KFold
def doubly_robust_ate(Y, T, X, n_folds=5):
n = len(Y)
e = np.zeros(n) # propensity scores
mu1 = np.zeros(n) # E[Y|X,T=1]
mu0 = np.zeros(n) # E[Y|X,T=0]
kf = KFold(n_folds, shuffle=True, random_state=42)
for train_idx, test_idx in kf.split(X):
X_tr, X_te = X[train_idx], X[test_idx]
T_tr, T_te = T[train_idx], T[test_idx]
Y_tr = Y[train_idx]
# Propensity model
ps = GradientBoostingClassifier(n_estimators=200)
ps.fit(X_tr, T_tr)
e[test_idx] = ps.predict_proba(X_te)[:, 1]
# Final result models (fit only on appropriate groups)
mu1_model = GradientBoostingRegressor(n_estimators=200)
mu1_model.fit(X_tr[T_tr==1], Y_tr[T_tr==1])
mu1[test_idx] = mu1_model.predict(X_te)
mu0_model = GradientBoostingRegressor(n_estimators=200)
mu0_model.fit(X_tr[T_tr==0], Y_tr[T_tr==0])
mu0[test_idx] = mu0_model.predict(X_te)
# Clip extreme propensities
e = np.clip(e, 0.05, 0.95)
# AIPW estimator
treated_term = mu1 + T * (Y - mu1) / e
control_term = mu0 + (1 - T) * (Y - mu0) / (1 - e)
ate = np.mean(treated_term - control_term)
# Standard error via influence function
influence = (treated_term - control_term) - ate
se = np.std(influence, ddof=1) / np.sqrt(n)
return ate, se
When it falls short: The DR property protects against functional form misspecification, nevertheless it cannot protect against unmeasured confounding. If a key confounder is missing from each models, our estimate stays biased. This can be a fundamental limitation of all methods that depend on the choice of observables assumption, which leads us on to our next method.
Part 2: Instrumental Variables
Sometimes, no amount of covariate adjustment can save us. Consider the challenge of unmeasured ability. Individuals who voluntarily enrol in job training are likely more motivated and capable than those that don’t. These traits directly affect earnings, creating confounding that no set of observed covariates can fully capture.
Now, suppose the training program sends promotional mailers to randomly chosen households to encourage enrolment. Not everyone who receives a mailer enrols, and a few people enrol without receiving one. However the mailer creates exogenous variation in enrolment, which is unrelated to ability or motivation.
This mailer is our instrument. Its effect on earnings flows entirely through its effect on program participation. We are able to use this clean variation to estimate this system’s causal impact.
The Core Insight: Instrumental variables (IV) exploit a strong idea of finding something that nudges people toward treatment but has no direct effect on the final result except through that treatment. This “instrument” provides variation in treatment that’s freed from confounding.
The Three Requirements: For an instrument Z to be valid, three conditions must hold:
- Relevance: Z must affect treatment (A typical rule of thumb is a first-stage F-statistic > 10).
- Exclusion Restriction: Z affects the final result only through treatment.
- Independence: Z is unrelated to unmeasured confounders.
Conditions 2 and three are fundamentally untestable and require deep domain knowledge.
Two-Stage Least Squares: The common estimator, Two-Stage Least Squares (2SLS), is used to estimate IV. It isolates the variation in treatment driven by the instrument to estimate the effect. Because the name suggests, it really works in two stages:
- First stage: Regress treatment on the instrument (and any controls). This isolates the portion of treatment variation driven by the instrument.
- Second stage: Regress the final result on the predicted treatment from stage one. Because predicted treatment only reflects instrument-driven variation, the second-stage coefficient captures the causal effect for compliers.
Here is a straightforward implementation in Python:
import pandas as pd
import statsmodels.formula.api as smf
from linearmodels.iv import IV2SLS
# Assume 'data' is a pandas DataFrame with columns:
# earnings, training (endogenous), mailer (instrument), age, education
# ---- 1. Check instrument strength (first stage) ----
first_stage = smf.ols('training ~ mailer + age + education', data=data).fit()
# For a single instrument, the t-statistic on 'mailer' tests relevance
t_stat = first_stage.tvalues['mailer']
f_stat = t_stat ** 2
print(f"First-stage F-statistic on instrument: {f_stat:.1f} (t = {t_stat:.2f})")
# Rule of thumb: F > 10 indicates strong instrument
# ---- 2. Two-Stage Least Squares ----
# Formula syntax: final result ~ exogenous + [endogenous ~ instrument]
iv_formula = 'earnings ~ 1 + age + education + [training ~ mailer]'
iv_result = IV2SLS.from_formula(iv_formula, data=data).fit()
# Display key results
print("nIV Estimates (2SLS)")
print(f"Coefficient for training: {iv_result.params['training']:.3f}")
print(f"Standard error: {iv_result.std_errors['training']:.3f}")
print(f"95% CI: {iv_result.conf_int().loc['training'].values}")
# Optional: full summary
# print(iv_result.summary)
The Interpretation Trap — LATE vs. ATE: A key nuance of IV is that it doesn’t estimate the treatment effect for everybody. It estimates the Local Average Treatment Effect (LATE), i.e., the effect for . These are the individuals who take the treatment they received the instrument. In our example, compliers are individuals who enroll they got the mailer. We learn nothing about “always-takers” (enroll regardless) or “never-takers” (never enroll). This is important context when interpreting results.
Part 3: Regression Discontinuity (RD)
Now, let’s imagine the job training program is obtainable only to staff who scored below 70 on a skills assessment. A employee scoring 69.5 is basically equivalent to 1 scoring 70.5. The tiny difference in rating is practically random noise. Yet one receives training and the opposite doesn’t.
By comparing outcomes for people slightly below the cutoff (who received training) with those just above (who didn’t), we approximate a randomized experiment within the neighbourhood of the edge. This can be a regression discontinuity (RD) design, and it provides causal estimates that rival the credibility of actual randomized controlled trials.
Unlike most observational methods, RD allows for powerful diagnostic checks:
- The Density Test: If people can manipulate their rating to land slightly below the cutoff, those slightly below will differ systematically from those just above. We are able to check this by plotting the density of the running variable around the edge. A smooth density supports the design, whereas a suspicious spike slightly below the cutoff suggests gaming.
- Covariate Continuity: Pre-treatment characteristics needs to be smooth through the cutoff. If age, education, or other covariates jump discontinuously at the edge, something aside from treatment is changing, and our RD estimate will not be reliable. We formally test this by checking whether covariates themselves show a discontinuity on the cutoff.
The Bandwidth Dilemma: RD design faces an inherent dilemma:
- Narrow bandwidth (looking only at scores very near the cutoff): That is more credible, because persons are truly comparable, but less precise, because we’re using fewer observations.
- Wide bandwidth (including scores farther from the cutoff): That is more precise, because we’ve more data; nonetheless, it’s also riskier, because people removed from the cutoff may differ systematically.
Modern practice uses data-driven bandwidth selection methods developed by Calonico, Cattaneo, and Titiunik, which formalize this trade-off. However the golden rule is to at all times check that our conclusions don’t flip after we adjust the bandwidth.
Modeling RD: In practice, we fit separate local linear regressions on either side of the cutoff and weight the observations by their distance from the edge. This offers more weight to observations closest to the cutoff, where comparability is highest. The treatment effect is the difference between the 2 regression lines on the cutoff point.
Here is a straightforward implementation in Python:
from rdrobust import rdrobust
# Estimate at default (optimal) bandwidth
result = rdrobust(y=data['earnings'], x=data['score'], c=70)
print(result.summary())
# Sensitivity evaluation over fixed bandwidths
for bw in [3, 5, 7, 10]:
r = rdrobust(y=data['earnings'], x=data['score'], c=70, h=bw)
#Use integer indices: 0 = conventional, 1 = bias-corrected, 2 = robust
eff = r.coef[0]
ci_low, ci_high = r.ci[0, :]
print(f"Bandwidth={bw}: Effect={eff:.2f}, "
f"95% CI=[{ci_low:.2f}, {ci_high:.2f}]")
Part 4: Difference-in-Differences
If you’ve studied causal inference, you’ve probably encountered the fundamental difference-in-differences (DiD) method, which compares the change in outcomes for a treated group before and after treatment, subtracts the corresponding change for a control group, and attributes the remaining difference to the treatment. The logic is intuitive and powerful. But over the past few years, methodological researchers have uncovered a major problem that has raised several questions.
The Problem with Staggered Adoption: In the true world, treatments rarely activate at a single time limit for a single group. Our job training program might roll out city by city over several years. The usual approach, throwing all our data right into a two-way fixed effects (TWFE) regression with unit and time fixed effects, feels natural but it could be potentially very flawed.
The problem, formalized by Goodman-Bacon (2021), is that with staggered treatment timing and heterogeneous effects, the TWFE estimator doesn’t produce a clean weighted average of treatment effects. As an alternative, it makes comparisons that may include already-treated units as controls for later-treated units. Imagine City A is treated in 2018, City B in 2020. When estimating the effect in 2021, TWFE may compare the change in City B to the change in City A but when City A’s treatment effect evolves over time (e.g., grows or diminishes), that comparison is contaminated because we’re using a treated unit as a control. The resulting estimate might be biased.
The Solution: Recent work by Callaway and Sant’Anna (2021), Sun and Abraham (2021), and others offers practical solutions to the staggered adoption problem:
- Group-time specific effects: As an alternative of estimating a single treatment coefficient, estimate separate effects for every treatment cohort at each post-treatment time period. This avoids the problematic implicit comparisons in TWFE.
- Clean control groups: Only use never-treated or not-yet-treated units as controls. This prevents already-treated units from contaminating your comparison group.
- Careful aggregation: Once you’ve clean cohort-specific estimates, aggregate them into summary measures using appropriate weights, typically weighting by cohort size.
Diagnostic — The Event Study Plot: Before running any estimator, we must always visualize the dynamics of treatment effects. An event study plot shows estimated effects in periods before and after treatment, relative to a reference period (often the period just before treatment). Pre-treatment coefficients near zero support the parallel trends assumption. The DiD approach rests on an assumption that within the absence of treatment, the common outcomes for the treated and control groups would have followed parallel paths, i.e., any difference between them would remain constant over time.
These approaches are implemented within the did package in R and the csdid package in Python. Here is a straightforward Python implementation:
import matplotlib.pyplot as plt
import numpy as np
def plot_event_study(estimates, ci_lower, ci_upper, event_times):
"""
Visualize dynamic treatment effects with confidence intervals.
Parameters
----------
estimates : array-like
Estimated effects for every event time.
ci_lower, ci_upper : array-like
Lower and upper bounds of confidence intervals.
event_times : array-like
Time periods relative to treatment (e.g., -5, -4, ..., +5).
"""
fig, ax = plt.subplots(figsize=(10, 6))
ax.fill_between(event_times, ci_lower, ci_upper, alpha=0.2, color='steelblue')
ax.plot(event_times, estimates, 'o-', color='steelblue', linewidth=2)
ax.axhline(y=0, color='gray', linestyle='--', linewidth=1)
ax.axvline(x=-0.5, color='firebrick', linestyle='--', alpha=0.7,
label='Treatment onset')
ax.set_xlabel('Periods Relative to Treatment', fontsize=12)
ax.set_ylabel('Estimated Effect', fontsize=12)
ax.set_title('Event Study', fontsize=14)
ax.legend(fontsize=11)
ax.grid(alpha=0.3)
return fig
# Pseudocode using csdid (if available)
from csdid import did_multiplegt # hypothetical
results = did_multiplegt(df, 'earnings', 'treatment', 'city', 'yr')
plot_event_study(results.estimates, results.ci_lower, results.ci_upper, results.event_times)
Part 5: Heterogeneous Treatment Effects
All of the methods we’ve seen up to now are based on the common effect, but that might be misleading. Suppose training increases earnings by $5,000 for non-graduates but has zero effect for graduates. Reporting just the ATE misses essentially the most actionable insight, which is who we must always goal?
Treatment effect heterogeneity is the norm, not the exception. A drug might help younger patients while harming older ones. An academic intervention might profit struggling students while having no impact on high performers.
Understanding this heterogeneity serves three purposes:
- Targeting: Allocate treatments to those that profit most.
- Mechanism: Patterns of heterogeneity reveal how a treatment works. If training only helps staff in specific industries, that tells us something in regards to the underlying mechanism.
- Generalisation: If we run a study in a single population and need to use findings elsewhere, we want to know whether effects rely upon characteristics that differ across populations.
The Conditional Average Treatment Effect (CATE): CATE captures how effects vary with observable characteristics:
τ(x) = E[Y(1) − Y(0) | X = x]
This function tells us the expected treatment effect for people with characteristics X = x.
Modern Tools to estimate CATE: Causal Forests, developed by Athey and Wager, extend random forests to estimate CATEs. The important thing innovation is that the information used to find out tree structure is separate from the information used to estimate effects inside leaves, enabling valid confidence intervals. The algorithm works by:
- Growing many trees, each on a subsample of the information.
- At each split, it chooses the variable that maximizes the in treatment effects between the resulting child nodes.
- Inside each leaf, it estimates a treatment effect (typically using a difference-in-means or a small linear model).
- The ultimate CATE for a brand new point is the common of the leaf-specific estimates across all trees.
Here’s a snippet of Python implementation:
from econml.dml import CausalForestDML
from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier
import numpy as np
import pandas as pd
# X: covariates (age, education, industry, prior earnings, etc.)
# T: binary treatment (enrolled in training)
# Y: final result (earnings after program)
# For feature importance later, we want feature names
feature_names = ['age', 'education', 'industry_manufacturing', 'industry_services',
'prior_earnings', 'unemployment_duration']
causal_forest = CausalForestDML(
model_y=GradientBoostingRegressor(n_estimators=200, max_depth=4), # final result model
model_t=GradientBoostingClassifier(n_estimators=200, max_depth=4), # propensity model
n_estimators=2000, # variety of trees
min_samples_leaf=20, # minimum leaf size
random_state=42
)
causal_forest.fit(Y=Y, T=T, X=X)
# Individual treatment effect estimates (ITE, but interpreted as CATE given X)
individual_effects = causal_forest.effect(X)
# Summary statistics
print(f"Average effect (ATE): ${individual_effects.mean():,.0f}")
print(f"Bottom quartile: ${np.percentile(individual_effects, 25):,.0f}")
print(f"Top quartile: ${np.percentile(individual_effects, 75):,.0f}")
# Variable importance: which features drive heterogeneity?
# (econml returns importance for every feature; we sort and show top ones)
importances = causal_forest.feature_importances_
for name, imp in sorted(zip(feature_names, importances), key=lambda x: -x[1]):
if imp > 0.05: # show only features with >5% importance
print(f" {name}: {imp:.3f}")
Part 6: Sensitivity Evaluation
Now we have now covered five sophisticated methods. Each of them requires assumptions we cannot fully confirm. The query will not be whether our assumptions are exactly correct but whether violations are severe enough to overturn our conclusions. That is where sensitivity evaluation is available in. It won’t make our results look more impressive. But it could help us to make our causal evaluation robust.
If an unmeasured confounder would should be implausibly strong i.e., more powerful than any observed covariate, our findings are robust. If modest confounding could eliminate the effect, we’ve to interpret the outcomes with caution.
The Omitted Variable Bias Framework: A practical approach, formalized by Cinelli and Hazlett (2020), frames sensitivity by way of two quantities:
- How strongly would the confounder relate to treatment task? (partial R² with treatment)
- How strongly would the confounder relate to the final result? (partial R² with final result)
Using these two approaches, we are able to create a sensitivity contour plot, which is a map showing which regions of confounder strength would overturn our findings and which might not.
Here is a straightforward Python implementation:
def sensitivity_summary(estimated_effect, se, r2_benchmarks):
"""
Illustrate robustness to unmeasured confounding using a simplified approach.
r2_benchmarks: dict mapping covariate names to their partial R²
with the final result, providing real-world anchors for "how strong is robust?"
"""
print("SENSITIVITY ANALYSIS")
print("=" * 55)
print(f"Estimated effect: ${estimated_effect:,.0f} (SE: ${se:,.0f})")
print(f"Effect can be fully explained by bias >= ${abs(estimated_effect):,.0f}")
print()
print("Benchmarking against observed covariates:")
print("-" * 55)
for name, r2 in r2_benchmarks.items():
# Simplified: assume confounder with same R² as observed covariate
# would induce bias proportional to that R²
implied_bias = abs(estimated_effect) * r2
remaining = estimated_effect - implied_bias
print(f"n If unobserved confounder is as strong as '{name}':")
print(f" Implied bias: ${implied_bias:,.0f}")
print(f" Adjusted effect: ${remaining:,.0f}")
would_overturn = "YES" if abs(remaining) < 1.96 * se else "No"
print(f" Would overturn conclusion? {would_overturn}")
# Example usage
sensitivity_summary(
estimated_effect=3200,
se=800,
r2_benchmarks={
'education': 0.15,
'prior_earnings': 0.25,
'age': 0.05
}
)
Putting It All Together: A Decision Framework
With five methods and a sensitivity toolkit at your disposal, the sensible query now could be which one should we use, and when? Here's a flowchart to assist with that call.
Final Thoughts
Causal inference is fundamentally about reasoning rigorously under uncertainty. The methods we've explored in this text are all powerful tools, but they continue to be tools. Each demands judgment about when it applies and vigilance about when it fails. A straightforward method applied thoughtfully will at all times outperform a posh method applied blindly.
Really helpful Resources for Going Deeper
Listed below are the resources I even have found Most worthy:
For constructing statistical intuition around causal considering:
- Regression and Other Stories by Gelman, Hill, and Vehtari
- For rigorous potential outcomes framework: Causal Inference for Statistics, Social, and Biomedical Sciences by Imbens and Rubin
- For domain-grounded causal reasoning: Causal Inference in Public Health by Glass et al.
- Athey and Imbens (2019), “Machine Learning Methods That Economists Should Know About” — Bridges ML and causal inference.
- Callaway and Sant’Anna (2021) Essential reading on modern DiD with staggered adoption.
- Cinelli and Hazlett (2020) The definitive practical guide to sensitivity evaluation.
