For simplicity we’ll examine Simpson’s paradox specializing in two cohorts, female and male adults.
Examining this data we are able to make three statements about three variables of interest:
- Gender is an independent variable (it doesn’t “take heed to” the opposite two)
- Treatment will depend on Gender (as we are able to see, on this setting the extent given will depend on Gender — women have been given, for some reason, the next dosage.)
- End result will depend on each Gender and Treatment
In accordance with these we are able to draw the causal graph as the next
Notice how each arrow contributes to speak the statements above. As essential, the dearth of an arrow pointing into Gender conveys that it’s an independent variable.
We also notice that by having arrows pointing from Gender to Treatment and End result it is taken into account a common cause between them.
The essence of the Simpson’s paradox is that although the End result is effected by changes in Treatment, as expected, there’s also a backdoor path flow of data via Gender.
The answer to this paradox, as you could have guessed by this stage, is that the common cause Gender is a confounding variable that should be controlled.
Controlling for a variable, by way of a causal graph, means eliminating the connection between Gender and Treatment.
This will likely be done in two manners:
- Pre data collection: Establishing a Randomised Control Trial (RCT) through which participants shall be given dosage no matter their Gender.
- Post data collection: As on this made up scenario the info has already been collected and hence we want to cope with what’s known as Observational Data.
In each pre- and post- data collection the elimination of the Treatment dependency of Gender (i.e, controlling for the Gender) could also be done by modifying the graph such that the arrow between them is removed as such:
Applying this “graphical surgery” signifies that the last two statements have to be modified (for convenience I’ll write all three):
- Gender is an independent variable
- Treatment is an independent variable
- End result will depend on Gender and Treatment (but with no backdoor path)
This allows obtaining the causal relationship of interest : we are able to assess the direct impact of modification Treatment on the End result.
The means of controlling for a confounder, i.e manipulation of the info generation process, is formally known as applying an intervention. That’s to say we are not any longer passive observers of the info, but we’re taking an energetic role in modification it to evaluate the causal impact.
How is that this manifested in practice?
Within the case of the RCT the researcher needs ensure to regulate for essential confounding variables. Here we limit the discussion to Gender (but in real world settings you may imagine other variables corresponding to Age, Social Status and anything that could be relevant to at least one’s health).
RCTs are considered the golden standard for causal evaluation in lots of experimental settings because of its practice of confounding variables. That said, it has many setbacks:
- It could be expensive to recruit individuals and will be complicated logistically
- The intervention under investigation is probably not physically possible or ethical to conduct (e.g, one can’t ask randomly chosen people to smoke or not for ten years)
- Artificial setting of a laboratory — not true natural habitat of the population
Observational data however is way more available within the industry and academia and hence less expensive and could possibly be more representative of actual habits of the individuals. But as illustrated within the Simpson’s diagram it could have confounding variables that have to be controlled.
That is where ingenious solutions developed within the causal community previously few a long time are making headway. Detailing them are beyond the scope of this post, but I briefly mention methods to learn more at the top.
To resolve for this Simpson’s paradox with the given observational data one
- Calculates for every cohort the impact of the change of the treatment on the end result
- Calculates a weighted average contribution of every cohort on the population.
Here we’ll deal with intuition, but in a future post we’ll describe the maths behind this solution.
I’m sure that many analysts, similar to myself, have noticed Simpson’s during their data and hopefully have corrected for it. Now you recognize the name of this effect and hopefully start to understand how causal tools are useful.
That said … being confused at this stage is OK 😕
I’ll be the primary to confess that I struggled to grasp this idea and it took me three weekends of deep diving into examples to internalised it. This was the gateway drug to causality for me. A part of my process to understanding statistics is twiddling with data. For this purpose I created an interactive web application hosted in Streamlit which I call Simpson’s Calculator 🧮. I’ll write a separate post for this in the long run.
Even in case you are confused the most important takeaways of Simpson’s paradox is that:
- It’s a situation where trends can exist in subgroups but reverse for the entire.
- It could be resolved by identifying confounding variables between the treatment and the end result variables and controlling for them.
This raises the query — should we just control for all variables aside from the treatment and end result? Let’s keep this in mind when resolving for the Berkson’s paradox.
