Causal ML for the Aspiring Data Scientist

: Limitations of Machine Learning

As an information scientist in today’s digital age, it’s essential to be equipped to reply quite a lot of questions that go far beyond easy pattern recognition. Typical machine learning is built on association; it seeks to search out patterns in existing data to predict future observations under the idea that the underlying system stays constant. When you train a model to predict house prices, you’re asking the algorithm to search out the probably price given a set of features.

Nonetheless, causal evaluation introduces a “what if” component. It goes beyond commentary to ask how the system would react if we actively modified a variable. That is the difference between noticing that folks who buy expensive lattes are also more likely to buy sports cars and understanding whether lowering the value of that coffee will actually cause a rise in automobile sales. On the planet of causal inference, we’re essentially attempting to learn the underlying laws of a business or social system, allowing us to predict the outcomes of actions we haven’t taken yet.

Causal evaluation is critical in quite a lot of fields when we’d like to maneuver beyond observing patterns to make decisions, particularly in areas like healthcare, marketing, and public policy. Consider a medical researcher evaluating a brand new blood pressure medication and its effect on heart attack severity. With historical data, you may see that patients taking the medication even have more severe heart attacks. An ordinary ML (Machine Learning) model would suggest the drug is harmful. Nonetheless, this is probably going as a consequence of confounding: doctors only prescribe the medication to patients who have already got poorer health. To seek out the reality, we must isolate the drug’s actual impact from the noise of the patients’ existing conditions.

In this text, I’ll introduce a few of the vital concepts and tools in causal ML in an accessible manner. I’ll only use libraries that manage data, calculate probabilities, and estimate regression parameters. This text is just not a tutorial, but a place to begin for those excited about but intimidated by causal inference methods. I used to be inspired by the web reading by Matheus Facure Alves. Note: For those unfamiliar with probability, E[X] refers to the typical value that a random variable/quantity x takes.

The Potential Outcomes Framework

Once we start a causal study, the questions we ask are way more specific than loss minimization or prediction accuracy. We typically start with the Average Treatment Effect (ATE), which tells us the mean impact of an intervention or motion across a whole population.

In our medical example, we would like to know the difference in heart attack severity if the complete population took the drug versus if the complete population didn’t. To define this mathematically, we use the Potential Outcomes Framework. First, let’s define just a few variables:

Y: The End result (e.g., a heart attack severity rating from 0 to 100).
T: The Treatment indicator. It is a binary “switch”:
T = 1 means the patient took the drug.
T = 0 means the patient didn’t take the drug (the Control).
Y(1): The consequence we might see if the patient was treated.
Y(0): The consequence we might see if the patient was not treated.
The theoretical ATE is the expected difference between these two potential outcomes across the complete population:
ATE = E[Y(1) - Y(0)]

To handle the dilemma of unobserved outcomes, researchers use the Potential Outcomes Framework as a conceptual guide. On this framework, we assume that for each individual, there exist two “potential” results: Y(1) and Y(0). We only ever observe one among these two values, which is often known as the Fundamental Problem of Causal Inference.

If a patient takes the medication (T=1), we see their factual consequence, Y(1). Their consequence without the medication is now a counterfactual, a state of the world that might have existed but didn’t.

The limitation of causal inference is that for any given person, we only ever observe one among these two values. If a patient takes the medication, we see their factual consequence, Y(1), while their consequence without the medication Y(0) stays a counterfactual, a state that might have existed but didn’t.

ATE in a Perfect World

For the reason that individual treatment effect is the difference between these two values, it stays hidden to us. This shifts the complete goal of causal estimation away from the person and toward the group. Because we cannot subtract a counterfactual from a factual for one person, we must find clever ways to check groups of individuals.
If the group receiving the treatment is statistically comparable to the group that is just not, we are able to use the typical observed consequence of 1 group to face in for the missing counterfactual of the opposite. This permits us to estimate the Average Treatment Effect by calculating the difference between the mean consequence of the treated group and the mean consequence of the control group:

{ATE} = E[Y|T=1] - E[Y|T=0]
Suppose that for many who took the drug, we observed a mean heart attack severity of 56/100, in comparison with 40/100 for many who didn’t. If we try and estimate the causal effect by taking an easy difference in means, the info suggests that taking the drug led to a 16-point increase in severity.

E[Y|T=1] = 56, E[Y|T=0] = 40 -> ATE_BIASED = 16

Unless this drug is amongst essentially the most dangerous created, there may be likely one other mechanism at play. This discrepancy arises because we are able to only interpret an easy difference in means because the Average Treatment Effect if the treatment was assigned through a Randomized Controlled Trial (RCT), which ensures complete random project of treatment groups. Without randomization, the treated and control groups should not exchangeable and differ in ways in which make a direct comparison difficult to do.

Randomization

The rationale an RCT is the default method for calculating the ATE is that it helps eliminate Selection Bias. In our medical example, the 16-point harm we observed likely occurred because doctors gave the drug to the highest-risk patients. On this scenario, the treated group was already predisposed to higher severity scores before they ever took the pill. Once we use an RCT, we remove the human element of selection. With this randomized selection, we make sure that high-risk and low-risk patients are distributed equally between each groups.
Mathematically, randomization ensures that the treatment project is independent of the potential outcomes.

Now, we are able to assume that the typical consequence of the treated group is an ideal proxy for what would have happened if the complete population had been treated. Since the “Treated” and “Control” groups start as statistical clones of each other, any difference we see at the tip of the study should be brought on by the drug itself.

Observational Data and Confounders

In the true world, we are sometimes forced to work with observational data. In these situations, the easy difference in means fails us due to presence of confounders. A confounder is a variable that influences each the treatment and the consequence, making a “backdoor path” that enables a non-causal correlation to flow between them.

To be able to visualize these hidden relationships, causal researchers use Directed Acyclic Graphs (DAGs). A DAG is a specialized graph where variables are represented as nodes and causal relationships are represented as arrows. Directed that the arrows have a particular direction, indicating a one-way causal flow from a cause to an effect. Acyclic means the graph incorporates no cycles; you can not follow a sequence of arrows and find yourself back at the primary variable, mainly because transitioning from one node to the subsequent should represent a lapse in time. A confounder will reveal itself in a DAG by its directed connection to each the treatment and the consequence, as seen below.

Once we have now identified the confounders through our DAG, the subsequent step is to mathematically account for them. If we would like to isolate the true effect of the drug, we’d like to check patients who’re similar in every way aside from whether or not they took the drugs. In causal evaluation, crucial tool for that is Linear Regression. By including the confounder as an independent variable, the model calculates the connection while holding the initial health of the patient constant. For our example, I generated a mock dataset
where treatment project was depending on initial health (I.H). This may be seen within the code below, where each the probability of receiving the drug and the severity relies on the initial health rating.

On this view, individuals who received the drug had a mean severity increase of three.47 points. To seek out the reality, we run an OLS (Peculiar Least Squares) multiple linear regression model to regulate
for the participants’ initial health rating.

An important finding here is the coefficient of the treatment variable (drug). While the raw data suggested the drug was harmful, our coefficient is roughly -9.89. This implies that after we control for the confounder of initial health, taking the drug actually decreases heart attack severity by nearly 10 points. This may be very near our true effect, which was a decrease of exactly 10 points!

It is a result that was more in keeping with our expectations, and that’s because we eliminated a big source of selection bias by controlling for confounders. The beauty of linear regression on this context is that the setup is comparable to that of a typical regression problem. Transformations may be applied, diagnostic plots may be produced, and slopes may be interpreted as normal. Nonetheless, because we’re including confounders in our model, their effect on the consequence won’t be absorbed into the treatment coefficient, something often known as de-biasing or adjusting, as previously mentioned.

Matching and Propensity Scoring

While multiple linear regression is a robust tool for de-biasing, it relies heavily on the idea that the connection between your confounders and the consequence is linear. In lots of real-world situations, your treated and control groups is perhaps so fundamentally different that a regression model is forcedto guess ends in areas where it has no actual data.

To unravel this, researchers often turn to Matching, a way that shifts the main focus from mathematical adjustment to data restructuring. As a substitute of using a formula to carry health constant, matching searches the control group for a ”twin” for each treated individual. Once we pair a patient who took the drug (T = 1) with a patient of nearly similar initial health who didn’t (T = 0), we effectively prune our dataset right into a Synthetic RCT.

On this balanced subset, the groups are finally exchangeable, allowing us to check their outcomes directly to disclose the true Average Treatment Effect (ATE). It is sort of as if each pair allows us to watch each the factual and the counterfactual states for a single sort of commentary. When we predict of how one can match two entries in a dataset, consider that every entry is represented by a vector in an n-dimensional space, where n − 1 is the variety of features or confounders.

At first glance, it seems we could simply calculate the space between these vectors using Euclidean distance. Nonetheless, the problem with this approach is that every one covariates are weighted equally, no matter their actual causal impact. In high dimensions, an issue often known as the curse of dimensionality, even an entry’s closest match could still be fundamentally different within the ways that truly matter for the treatment.

In our mock dataset, participants with the bottom health scores below, we see that treated participant 74 and untreated participant 668 have nearly similar initial health scores. Because we’re only coping with one confounder here, these two are ideal candidates to be matched together. Nonetheless, as dimensionality increases, it becomes unattainable to search out these matches by just the numbers, and easy Euclidean distance fails to prioritize the variables that really drive the choice bias.

In practice, this process is mostly executed as one-to-one matching, where each treated unit is paired with its single closest neighbor within the control group. To make sure these matches are high-
quality, we use the Propensity Rating: a single number representing the probability that a participant would receive the treatment given their characteristics, P (T = 1|X). This rating collapses our high-
dimensional space right into a single dimension that specifically reflects the likelihood of treatments given a set of covariates. We then use a k-Nearest Neighbors (k-NN) algorithm to perform a ”fuzzy” search
on this rating.

To stop poor matches, we are able to select a threshold to function the utmost allowable distance to match. We will calculate propensity in numerous ways, essentially the most common being logistic regression, but other ML methods able to outputting probabilities, equivalent to XGBoost or Random Forest, work as well. Within the below code, I calculated propensities by organising a logistic regression model that predicts drug participation from just initial health. In practice, you’d have more confounders in your model.

As mentioned, step one of propensity rating matching is the calculation of the propensity rating. In our example, we only have initial health as a confounder, in order that will likely be the only real covariate in our easy logistic regression.

As expected, participants 74 and 668 were assigned a really similar propensity and would likely be matched. Additionally it is often helpful to generate what’s often known as a Common Support plot, which displays the density of calculated propensity scores separated by treated and control. Ideally, we would like to see as much overlap and symmetry as possible, as that means matching units will likely be easier. As seen below, selection bias is present in our dataset. It’s a great exercise to analyze the info generation code above and determine why.

Although not crucial within the one-dimensional case, we are able to then use k-NN to match treated with untreated based on their propensity rating.

When you recall from before, our linear regression yielded an ATE of -9.89 in comparison with our now calculated value of -10.16. As we increase the complexity and variety of covariates in our model, our propensity rating matching ATE will likely catch up with and closer to the underlying causal effect of -10.

Time Invariant Effects Using Difference-in-Differences

While matching is great for de-biasing based on the variables we are able to see, it falls short when there are hidden aspects, like a patient’s genetic predisposition or a hospital’s specific management style, that we haven’t recorded in our data. If these unobserved confounders are time-invariant (meaning they stay constant over the study period), we are able to use Difference-in-Differences (DiD) to cancel them out.

As a substitute of just comparing the treated group to the control group at a single time limit, DiD looks at two groups over two periods: before and after the treatment. The logic is easy yet elegant: we calculate the change within the control group and assume the treated group would have modified by that very same amount in the event that they hadn’t received the treatment. Any additional change observed within the treated group is attributed to the treatment itself. The equation for the DiD estimator is as follows:

While this formula may appear intimidating at first glance, it’s best read because the difference in changes occurring before and after treatment. For instance, imagine two ice cream shops in numerous towns. Before the weekend, Store A (our treatment group) sells 200 cones, and Store B (our control group) sells 300. On Saturday, a heat wave hits Store A’s town, but not Store B’s. By the tip of the day, Store A’s sales jump to 500, while Store B’s sales rise to 400. A straightforward evaluation of Store A would suggest the warmth wave caused a +300 increase. Nonetheless, the control shop (Store B) grew by +100 in the identical period with none heat wave, perhaps as a consequence of a vacation or general summer weather.

The Difference-in-Differences approach subtracts this natural time trend of +100 from Store A’s total growth. It effectively cancels out any time-invariant confounders—aspects just like the store’s location or its base popularity that will have otherwise skewed our results. This reveals that the true causal impact of the warmth wave was +200 units.

A major limitation of the fundamental Difference-in-Differences (DiD) is that it doesn’t account for aspects that change over time. While the “change-in-change” logic successfully cancels out static, time-invariant confounders (like someone’s genetic history or a hospital’s geographic location), it stays vulnerable to time-varying confounders. These are aspects that shift through the study period and affect the treatment and control groups in a different way.

In our heart attack study, as an example, even a DiD evaluation could possibly be biased if the hospitals administering the drug also underwent significant staffing changes or received upgraded equipment through the “Post” period. If we fail to account for these changing variables, the DiD estimator will incorrectly attribute their impact to the drug itself, resulting in a “polluted” causal estimate.

It will be significant to notice that the easy cross-sectional data structure we utilized for Regression and Matching is insufficient for this method. To calculate a “change within the change,” we’d like a temporal dimension in our dataset. Specifically, we’d like a variable indicating whether an commentary occurred within the Pre-treatment or Post-treatment period for each the treated and control groups.

To resolve this, we move beyond easy subtraction and implement DiD inside a Multiple Linear Regression framework. This permits us to explicitly “control” for time-varying aspects, effectively isolating the treatment effect while holding external shifts constant.

The regression model is defined as:

Below, a brand new synthetic dataset is constructed to reflect the required structure. I also added a Quality of Care variable for demonstration purposes. I didn’t include the total simulation code as a consequence of its length, nevertheless it essentially modifies the previous logic by duplicating our observations across two distinct time periods.

Since we have now our data in the proper format, we are able to fit a linear regression model using the specifications just mentioned.

The R-squared value of 0.324 indicates that the model explains roughly 32.4 percent of thevariance in heart attack severity. In causal evaluation, that is common, as many unmeasured aspects like
genetics are treated as noise. The intercept of 48.71 represents the baseline severity for the controlgroup through the pre-treatment period. The drug coefficient of 12.75 confirms selection bias, showing
the treated group initially had higher severity scores. Moreover, the standard of care coefficient suggests that every unit increase in that index corresponds to a 2.10-point reduction in severity.

The interaction term, drug:post, provides the difference-in-differences estimator, which reveals an estimated drug effect of -6.58. This tells us the medication reduced severity after adjusting for group differences and time trends, though the estimate is notably lower than the true effect of -10. This discrepancy occurs because the standard of care improved specifically for the treated group through the post-treatment period, as a consequence of the info generation process. Since these two changes happened concurrently to the identical group, they’re perfectly correlated, or collinear.

The model essentially faces a mathematical stalemate where it cannot determine if the development got here from the drug or the higher care, so it splits the credit between them. As for any linear regression, if two variables are perfectly correlated, a model might drop one entirely or provide highly unstable estimates. Nevertheless, all variables maintain p-values of 0.000, confirming that despite the split credit, the outcomes remain statistically significant. In real data and evaluation, we’ll cope with these sorts of situations, and it is vital to know all of the tools in your data science shed before you tackle an issue.

Conclusion and Final Thoughts

In this text, we explored the transition from standard ML to the logic of causal inference. We saw through synthetic examples that while easy differences in means may be misleading as a consequence of selection bias, methods like linear regression, propensity rating matching, and difference-in-differences allow us to strip away confounders and isolate true impact.

Having these tools in our arsenal is just not enough. As seen with our final model, even sophisticated techniques can yield issues when interventions overlap. While these methods are powerful in adjusting for confounding, they require a deep understanding of their underlying mechanics. Counting on model outputs without acknowledging the fact of collinearity or time-varying aspects can result in misleading conclusions.

At the identical time, knowing when and how one can apply these tools can function a beneficial skill to any data scientist. For my part, the most effective parts of doing statistical programming for causal inference is the incontrovertible fact that many of the methods stem from just a few fundamental statistical models, making implementation easier than one might expect.

The actual world is undeniably messy and full of knowledge issues, and it’s rare that we are going to observe a superbly clean causal signal. Causal machine learning is ultimately about exploiting the fitting data while having the boldness that our variables allow for true adjustment. This text is my first step within the documentation of my causal inference journey, and I plan to release a component two that dives deeper into more topics, including Instrumental Variables (IV), Panel Regression, Double Machine Learning (DML), and Meta-Learners.

Causal ML for the Aspiring Data Scientist

: Limitations of Machine Learning

The Potential Outcomes Framework

ATE in a Perfect World

Randomization

Observational Data and Confounders

Matching and Propensity Scoring

Time Invariant Effects Using Difference-in-Differences

Conclusion and Final Thoughts

Suggested Reading

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Run a Chatgpt-like Chatbot on a Single GPU with ROCm

Introducing RWKV – An RNN with the benefits of a transformer

Ray: Distributed Computing For All, Part 2

OpenAI spills technical details about how its AI coding agent works

Unlocking Agentic RL Training for GPT-OSS: A Practical Retrospective

Causal ML for the Aspiring Data Scientist

: Limitations of Machine Learning

The Potential Outcomes Framework

ATE in a Perfect World

Randomization

Observational Data and Confounders

Matching and Propensity Scoring

Time Invariant Effects Using Difference-in-Differences

Conclusion and Final Thoughts

Suggested Reading

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.