of treatments in experiments has the amazing tendency to balance out confounders and other covariates across testing groups. This tendency provides a variety of favorable features for analyzing the outcomes of experiments and drawing conclusions. Nevertheless, randomization to balance covariates — it’s guaranteed.
What if randomization doesn’t balance the covariates? Does imbalance undermine the validity of the experiment?
I grappled with this query for a while before I got here to a satisfactory conclusion. In this text, I’ll walk you thru the thought process I took to grasp that experimental validity is determined by independence of the covariates and the treatment, not balance.
Listed here are the particular topics that I’ll cover:
- Randomization tends to balance covariates
- What causes covariate imbalance even with randomization
- Experimental validity is about independence and never balance
Randomization tends to balance covariates, but there is no such thing as a guarantee
The Central Limit Theorem (CLT) shows that a randomly chosen sample’s mean is generally distributed with a mean equal to the population mean and a variance equal to the population variance divided by the sample size. This idea could be very applicable to our conversation because we’re occupied with balance — i.e., when the of our random samples are close. The CLT provides a distribution for these sample means.
Due to the CLT, we are able to consider the mean of a sample the identical way we might another random variable. For those who remember back to probability 101, given the distribution of a random variable, we are able to calculate the chances that a person draw from the distribution falls between a selected range.
Before we get too theoretical, let’s jump into an example to construct intuition. Say we’re wanting to do an experiment that needs two randomly chosen groups of rabbits. We’ll assume that a person rabbit’s weight is essentially normally distributed with a mean of three.5 lbs and a variance of 0.25 lbs.
The straightforward Python function below calculates the probability that our random sample of rabbits falls in a selected range given the population distribution and a sample size:
from scipy.stats import norm
def normal_range_prob(lower,
upper,
pop_mean,
pop_std,
sample_size):
sample_std = pop_std/np.sqrt(sample_size)
upper_prob = norm.cdf(upper, loc=mean, scale=sample_std)
lower_prob = norm.cdf(lower, loc=mean, scale=sample_std)
return upper_prob - lower_prob
Let’s say that we might consider two sample means as balanced in the event that they each fall inside +/-0.10 lbs of the population mean. Moreover, we’ll start with a sample size of 100 rabbits each. We will calculate the probability of a single sample mean falling on this range using our function like below:

With a sample size of 100 rabbits, now we have a couple of 95% likelihood of our sample mean falling inside 0.1 lbs of the population mean. Because randomly sampling two groups are events, we are able to use the Product Rule, to calculate the probability of two samples being inside 0.1 lbs of the population mean by simply squaring the unique probability. So, the probability of the 2 samples being balanced and shut to the population mean is 0.90% (0.952). If we had three sample sizes, the probability of all of them balancing near the mean is 0.953 = 87%.
There are two relationships I need to call out here — (1) when the sample size goes up, the probability of balancing increases and (2) because the variety of test groups increase, the probability of all of them balancing goes down.
The table below shows the probability of all randomly assigned test groups balancing for multiple sample sizes and test group numbers:

Here we see that with a sufficiently large sample size, our simulated rabbit weight could be very more likely to balance, even with 5 test groups. But, with a mix of smaller sample sizes and more test groups, that probability shrinks.
Now that now we have an understanding of how randomization tends to balance covariates in favorable circumstances, we’ll jump right into a discussion of why covariates sometimes don’t balance out.
Note: On this discussion, we only considered the likelihood that covariates balance near the sample mean. Hypothetically, they might balance at a location away from the sample mean, but that may be impossible. We ignored that possibility here — but I desired to call out that it does exist.
Causes of covariate imbalances despite randomized project
Within the previous discussion, we built intuition on why covariates are inclined to balance out with random project. Now we’ll transition to discussing what aspects can drive imbalances in covariates across testing groups.
Below are the five reasons I’ll cover:
- Bad luck in sampling
- Small sample sizes
- Extreme covariate distributions
- Plenty of testing groups
- Many impactful covariates
Bad luck in sampling
Covariate balancing is all the time related to probabilities and there’s never an ideal 100% probability of balancing. For this reason, there’s all the time a likelihood — even under superb randomization conditions — that the covariates in an experiment won’t balance.
Small sample sizes
When now we have small sample sizes, the variance of our mean distribution is large. This massive variance can result in high probabilities of enormous differences in our average covariates across testing populations, which might ultimately result in covariate imbalance.

Until now, we’ve also assumed that our treatment groups all have the identical sample sizes. There are a lot of circumstances where we are going to wish to have different sample sizes across treatment groups. For instance, we could have a preferred medication for patients with a selected illness; but we also wish to test if a brand new medication is healthier. For a test like this, we wish to maintain most patients on the popular medication while randomly assigning some patients to a potentially higher, but untested medication. In situations like this, the smaller testing groups may have a wider distribution for his or her sample mean and due to this fact have the next probability of getting a sample mean farther from the population mean and which might cause misbalances.
Extreme covariate distributions
The CLT appropriately identifies that the sample mean of distribution is generally distributed with sufficient sample size. Nevertheless, is just not the identical for all distributions. Extreme distributions require more sample size for the sample mean to turn into normally distributed. If a population has covariates with extreme distributions, larger samples will probably be required for the sample means to behave nicely. If the sample sizes are relatively large, but too small to compensate for the acute distributions, chances are you’ll face the small sample size problem that we discussed within the previous section though you might have a big sample size.

Plenty of testing groups
Ideally, we wish all testing groups to have balanced covariates. Because the variety of testing groups increases, that becomes less and fewer likely. Even in extreme cases where a single testing group has a 99% likelihood of being near the population mean, having 100 groups means we should always expect a minimum of one to fall outside that range.
While 100 testing groups seems pretty extreme. It is just not unusual practice to have many testing groups. Common experimental designs include multiple aspects to be tested, each with various levels. Imagine we’re testing the efficacy of various plant nutrients on plant growth. We will want to test 4 different nutrients and three different levels of concentration. If this experiment was full-rank (we create a test group for every possible combination of treatments), we might create 81 (34) test groups.
Many impactful covariates
In our rabbit experiment example, we only discussed a single covariate. In practice, we wish all impactful covariates to balance out. The more impactful covariates there are, the less likely complete balance is to be achieved. Just like the issue of too many testing groups, each covariate has a probability of not balancing — the more covariates, the less likely it’s that every one of them will balance. We should always consider not only the covariates we all know are essential, but additionally the unmeasured ones we don’t track and even learn about. We wish those to balance too.
Those are five reasons that we may not see balance in our covariates. It isn’t a comprehensive list, but it surely is enough for us to have an excellent grasp of where the issue often comes up. We at the moment are in an excellent position to start out talking about why experiments are valid even when covariates don’t balance.
Experiment validity is about independence, not balance
Balanced covariates have advantages when analyzing the outcomes of an experiment, but they aren’t required for validity. On this section, we are going to explore why balance is useful, but not obligatory for a legitimate experiment.
Advantages of balanced covariates
When covariates balance across test groups, treatment effect estimates are inclined to be more precise, with lower variance within the experimental sample.
It is commonly an excellent idea to incorporate covariates within the evaluation of an experiment. When covariates balance, estimated treatment effects are less sensitive to the inclusion and specification of covariates within the evaluation. When covariates don’t balance, each the magnitude and interpretation of the estimated treatment effect can depend more heavily on which covariates are included and the way they’re modeled.
Why balance is just not required for a legitimate experiment
While balance is right, it isn’t required for a legitimate experiment. Experimental validity is all about breaking the treatment’s dependence on any covariate. If that’s broken, then the experiment is valid — correct randomization all the time breaks the systematic relationship between treatment and all covariates.
Let’s return to our rabbit example again. If we allowed the rabbits to self-select the weight loss plan, there could be aspects that impact each weight gain and weight loss plan selection. Possibly younger rabbits prefer the upper fat weight loss plan and younger rabbits usually tend to gain weight as they grow. Or perhaps there’s a genetic marker that makes rabbits more more likely to gain weight and more more likely to prefer higher fat meals. Self-selection could cause all varieties of confounding issues within the conclusion of our evaluation.
If as an alternative, we did randomization, the systematic relationships between weight loss plan selection (treatment) and age or genetics (confounders) are broken and our experimental process could be valid. Because of this, any remaining association between treatment and covariates is because of reasonably than selection, and causal inference from the experiment is valid.

While randomization breaks the link between confounders and coverings and makes the experimental process valid. It doesn’t guarantee that our experiment won’t come to an incorrect conclusion.
Take into consideration easy hypothesis testing out of your intro to statistics course. We randomly draw a sample from a population to come to a decision if a population mean is or is just not different from to a given value. This process is valid — meaning it has well-defined long-run error rates, but bad luck in a single random sample may cause type I or type II errors. In other words, the approach is sound, though it doesn’t guarantee an accurate conclusion each time.

Randomization in experimentation works the identical way. It’s a legitimate approach to causal inference, but that doesn’t mean every individual randomized experiment will yield the right conclusion. Likelihood imbalances and sampling variation can still affect leads to any individual experiment. The potential for erronous conclusions doesn’t invalidate the approach.
Wrapping it up
Randomization tends to balance covariates across treatment groups, but it surely doesn’t guarantee balance in any single experiment. What randomization guarantees is validity. The systematic relationship between treatment project and covariates is broken by design. Covariate balance improves precision, but it surely is just not a prerequisite for valid causal inference. When imbalance occurs, covariate adjustment can mitigate its consequences. The important thing takeaway is that balance is desirable and helpful, but randomization (not balance) is what makes an experiment valid.
