Omitted Variable Bias

Rothstein, A., photographer. (1939) Farm family at dinner. Fairfield Bench Farms, Montana. Montana Fairfield Bench Farms United States Teton County, 1939. May. [Photograph] Retrieved from the Library of Congress, https://www.loc.gov/item/2017777606/.

An intro to an especially sneaky bias that invades many regression models

From 2000 to 2013, a flood of research showed a striking correlation between the speed of dangerous behavior amongst adolescents, and the way often they ate meals with their family.

Study after study looked as if it would reach the identical conclusion:

The greater the variety of meals per week that adolescents had with their family, the lower their odds of indulging in substance abuse, violence, delinquency, vandalism, and plenty of other problem behaviors.

A better frequency of family meals also correlated with reduced stress, reduced incidence of childhood depression, and reduced frequency of suicidal thoughts. Eating together correlated with increased self-esteem, and a generally increased emotional well-being amongst adolescents.

Soon, the media got wind of those results, they usually were packaged and distributed as easy-to-consume sound bites, similar to this one:

“Studies show that the more often families eat together, the less likely kids are to smoke, drink, do drugs, get depressed, develop eating disorders and consider suicide, and the more likely they’re to do well in class, delay having sex, eat their vegetables, learn big words and know which fork to make use of.” — TIME Magazine, “The magic of the family meal”, June 4, 2006

One in every of the most important studies on the subject was conducted in 2012 by the National Center on Addiction and Substance Abuse (CASA) at Columbia University. CASA surveyed 1003 American teenagers aged 12 to 17 about various elements of their lives.

CASA discovered the identical, and in some cases, startlingly clear correlations between the variety of meals adolescents had with their family and a broad range of behavioral and emotional parameters.

There was no escaping the conclusion.

Family meals make well-adjusted teens.

Until you read what’s literally the last sentence in CASA’s 2012 white paper:

“Because this can be a cross-sectional survey, the information can’t be used to ascertain causality or measure the direction of the relationships which are observed between pairs of variables within the White Paper.”

And so here we come to a couple of salient points.

Frequency of family meals might not be the one driver of the reduction in dangerous behaviors amongst adolescents. It might not even be the first driver.

Families who eat together more ceaselessly may achieve this just because they already share a cushty relationship and have good communication with each other.

Eating together may even be the effect of a healthy, well-functioning family.

And youngsters from such families may simply be less prone to take pleasure in dangerous behaviors and more prone to enjoy higher mental health.

Several other aspects are also at play. Aspects similar to demography, the kid’s personality, and the presence of the fitting role models at home, school, or elsewhere might make children less at risk of dangerous behaviors and poor mental health.

Clearly, the reality, as is commonly the case, is murky and multivariate.

Although, make no mistake, ‘Eat together’ shouldn’t be bad advice, as advice goes. The difficulty with it’s the next:

A lot of the studies on this topic, including the CASA study, in addition to a very thorough meta-analysis published by Goldfarb et al in 2013 of 14 other studies, did in reality fastidiously measure and tease out the partial effects of exactly all of those aspects on adolescent dangerous behavior.

So what did the researchers find?

They found that the partial effect of the frequency of family meals on the observed rate of dangerous behaviors in adolescents was considerably diluted when other aspects similar to demography, personality, and nature of relationship with the family were included within the regression models. The researchers also found that in some cases, the partial effect of frequency of family meals, completely disappeared.

Here, for instance, is a finding from Goldfarb et al (2013) (FFM=Frequency of Family Meals):

“The associations between FFM and the final result in query were most definitely to be statistically significant with unadjusted models or univariate analyses. Associations were less prone to be significant in models that controlled for demographic and family characteristics or family/parental connectedness. When methods like propensity rating matching were used, no significant associations were found between FFM and alcohol or tobacco use. When methods to regulate for time-invariant individual characteristics were used, the associations were significant about half the time for substance use, five of 16 times for violence/delinquency, and two of two times for depression/suicide ideation.”

Wait, but what does all this must do with bias?

The relevance to bias comes from two unfortunately co-existing properties of the frequency of family meals variable:

On one hand, most studies on the subject found that the frequency of family meals does have an intrinsic partial effect on the susceptibility to dangerous behavior. But, the effect is weak while you think about other variables.
At the identical time, the frequency of family meals can be heavily correlated with several other variables, similar to the character of inter-personal relationships with other relations, the character of communication inside the family, the presence of role models, the personality of the kid, and demographics similar to household income. All of those variables, it was found, have a robust joint correlation with the speed of indulgence in dangerous behaviors.

The best way the mathematics works is that in case you unwittingly omit even a single one in every of these other variables out of your regression model, the coefficient of the frequency of family meals gets biased within the negative direction. In the subsequent two sections, I’ll show exactly why that happens.

This negative bias on the coefficient of frequency of family meals will make it appear that simply increasing the variety of times families sit together to eat should, by itself, considerably reduce the incidence of — oh, say — alcohol abuse amongst adolescents.

The above phenomenon known as Omitted Variable Bias. It’s probably the most ceaselessly occurring, and easily missed, biases in regression studies. If not spotted and accounted for, it might probably result in unlucky real-world consequences.

For instance, any social policy that disproportionately stresses the necessity for increasing the variety of times families eat together as a significant means to scale back childhood substance abuse will inevitably miss its design goal.

Now, you would possibly ask, isn’t much of this problem attributable to choosing explanatory variables that correlate with one another so strongly? Isn’t it just an example of a sloppily conducted variable-selection exercise? Why not select variables which are correlated only with the response variable?

In spite of everything, shouldn’t a talented statistician have the opportunity to employ their ample training and imagination to discover a set of things that don’t have any greater than a passing correlation with each other and which are prone to be strong determinants of the response variable?

Sadly, in any real-world setting, finding a set of explanatory variables which are only barely (or by no means) correlated is the stuff of dreams, if even that.

But to paraphrase G. B. Shaw, in case your imagination is stuffed with ‘fairy princesses and noble natures and fearless cavalry charges’, you would possibly just come across a whole set of perfectly orthogonal explanatory variables, as statisticians wish to so evocatively call them. But again, I’ll bet you the Brooklyn Bridge that even in your sweetest statistical dreamscapes, you is not going to find them. You usually tend to stumble into the non-conforming Loukas and the reality-embracing Captain Bluntschlis as a substitute of greeting the quixotic Rainas and the Major Saranoffs.

An idealized depiction of family life by Norman Rockwell, “Freedom from Want”, Published: March 6, 1943 — An idealized depiction of family life by Norman Rockwell. *“Freedom from Want”, Published:* March 6, 1943, (Public domain artwork)

And so, we must learn to live in a world where explanatory variables freely correlate with each other, while at the identical time influencing the response of the model to various degrees.

In our world, omitting one in every of these variable s— either by accident, or by the innocent ignorance of its existence, or by the dearth of means to measure it, or through sheer carelessness — causes the model to be biased. We would as well develop a greater appreciation of this bias.

In the remainder of this text, I’ll explore Omitted Variable Bias in great detail. Specifically, I’ll cover the next:

Definition and properties of omitted variable bias.
Formula for estimating the omitted variable bias.
An evaluation of the omitted variable bias in a model of adolescent dangerous behavior.
A demo and calculation of omitted variable bias in a regression model trained on a real-world dataset.

From a statistical perspective, omitted variable bias is defined as follows:

When a very important explanatory variable is omitted from a regression model and the truncated model is fitted on a dataset, the expected values of the estimated coefficients of the non-omitted variables within the fitted model shift away from their true population values. This shift known as omitted variable bias.

Even when a single vital variable is omitted, the expected values of the coefficients of all the non-omitted explanatory variables within the model develop into biased. No variable is spared from the bias.

Magnitude of the bias

In linear models, the magnitude of the bias is determined by the next three quantities:

Covariance of the non-omitted variable with the omitted variable: The bias on a non-omitted variable’s estimated coefficient is directly proportional to the covariance of the non-omitted variable with the omitted variable, conditioned upon the remainder of the variables within the model. In other words, the more tightly correlated the omitted variable is with the variables which are left behind, the heavier the worth you pay for omitting it.
Coefficient of the omitted variable: The bias on a non-omitted variable’s estimated coefficient is directly proportional to the population value of the coefficient of the omitted variable in the total model. The greater the influence of the omitted variable on the model’s response, the larger the opening you dig for yourself by omitting it.
Variance of the non-omitted variable: The bias on a non-omitted variable’s estimated coefficient is inversely proportional to the variance of the non-omitted variable, conditioned upon the remainder of the variables within the model. The more scattered the non-omitted variable’s values are around its mean, the less affected it’s by the bias. That is one more place wherein the well-known effect of bias-variance tradeoff makes its presence felt.

Direction of the bias

Typically, the direction of omitted variable bias on the estimated coefficient of a non-omitted variable, is unfortunately hard to evaluate. Whether the bias will boost or attenuate the estimate is difficult to inform without actually knowing the omitted variable’s coefficient in the total model, and figuring out the conditional covariance and conditional variance of non-omitted variable.

On this section, I’ll present the formula for Omitted Variable Bias that’s applicable to coefficients of only linear models. But the overall concepts and principles of how the bias works, and the aspects it is determined by carry over easily to numerous other forms of models.

Consider the next linear model which regresses y on x_1 through x_m and a relentless:

A linear model that regresses y on x_1 through x_m and a constant — A linear model that regresses y on x_1 through x_m and a relentless (Image by Writer)

On this model, γ_1 through γ_m are the population values of the coefficients of x_1 through x_m respectively, and γ_0 is the intercept (a.k.a. the regression constant). ϵ is the regression error. It captures the variance in y that x_1 through x_m and γ_0 are jointly unable to elucidate.

As a side note, y, x_1 through x_m, 1, and ϵ are all column vectors of size n x 1, meaning they each contain n rows and 1 column, with ‘n’ being the variety of samples within the dataset on which the model operates.

Lest you get able to take flight and flee, let me assure you that beyond mentioning the above fact, I is not going to go any further into matrix algebra in this text. But you have got to let me say the next: if it helps, I find it useful to assume an n x 1 column vector as a vertical cabinet with (n — 1) internal shelves and a number sitting on each shelf.

Anyway.

Now, let’s omit the variable x_m from this model. After omitting x_m, the truncated model looks like this:

A truncated linear model that regresses y on x_1 through x_(m-1) and a constant — A truncated linear model that regresses y on x_1 through x_(m-1) and a relentless (Image by Writer)

Within the above truncated model, I’ve replaced all of the gammas with betas to remind us that after dropping x_m, the coefficients of the truncated model might be decidedly different than in the total model.

The query is, how different are the betas from the gammas? Let’s discover.

For those who fit (train) the truncated model on the training data, you’ll get a fitted model. Let’s represent the fitted model as follows:

The fitted truncated model (Image by Writer)

Within the fitted model, the β_0_cap through β_(m — 1)_cap are the fitted (estimated) values of the coefficients β_0 through β_(m — 1). ‘e’ is the residual error, which captures the variance within the observed values of y that the fitted model is unable to elucidate.

The speculation says that the omission of x_m has biased the expected value of each coefficient from β_0_cap through β_(m — 1)_cap away from their true population values γ_1 through γ_(m — 1).

Let’s examine the bias on the estimated coefficient β_k_cap of the kth regression variable, x_k.

The quantity by which the expected value of β_k_cap within the truncated fitted model is biased is given by the next equation:

The expected value of the coefficient β_k_cap in the truncated fitted model is biased by an amount equal to the scaled ratio of the conditional covariance of x_k and x_m, and the conditional variance of x_k — The expected value of the coefficient β_k_cap within the truncated fitted model is biased by an amount equal to the **scaled ratio** of the **conditional covariance** of x_k and x_m, and the **conditional variance** of x_k (Image by Writer)

Let’s note all the following things in regards to the above equation:

β_k_cap is the estimated coefficient of the non-omitted variable x_k within the truncated model. You get this estimate of β_k from fitting the truncated model on the information.
E( β_k_cap | x_1 through x_m) is the expected value of the above mentioned estimate, conditioned on all of the observed values of x_1 through x_m. Note that x_m is definitely not observed. We’ve omitted it, remember? Anyway, the expectation operator E() has the next meaning: in case you train the truncated model on 1000’s of randomly drawn datasets, you’ll get 1000’s of various estimates of β_k_cap. E(β_k_cap) is the mean of all these estimates.
γ_k is the true population value of the coefficient of x_k in the total model.
γ_m is the true population value of the coefficient of the variable x_m that was omitted from the total model.
The covariance term within the above equation represents the covariance of x_k with x_m, conditioned on the remainder of the variables in the total model.
Similarly, the variance term represents the variance of x_k conditioned on all the opposite variables in the total model.

The above equation tells us the next:

At the beginning, had x_m not been omitted, the expected value of β_k_cap within the fitted truncated model would have been γ_k. It is a property of all linear models fitted using the OLS technique: the expected value of every estimated coefficient within the fitted model is the unbiased population value of the respective coefficient.
Nonetheless, resulting from the missing x_m within the truncated model, the expected value β_k_cap has develop into biased away from its population value, γ_k.
The quantity of bias is the ratio of, the conditional covariance of x_k with x_m, and the conditional variance of x_k, scaled by γ_m.

The above formula for the omitted variable bias should provide you with a primary glimpse of the appalling carnage wreaked in your regression model, do you have to unwittingly omit even a single explanatory variable that happens to be not only highly influential but in addition heavily correlated with a number of non-omitted variables within the model.

As we’ll see in the next section, that’s, regrettably, just what happens in a particular type of flawed model for estimating the speed of dangerous behaviour in adolescents.

Let’s apply the formula for the omitted variable bias to a model that tries to elucidate the speed of dangerous behavior in adolescents. We’ll examine a scenario wherein one in every of the regression variables is omitted.

But first, we’ll take a look at the total (non-omitted) version of the model. Specifically, let’s consider a linear model wherein the speed of dangerous behavior is regressed on the suitably quantified versions of the next 4 aspects:

frequency of family meals
how well-informed a baby thinks their parents are about what’s occurring of their life,
the standard of the connection between parent and child, and
the kid’s intrinsic personality.

For simplicity, we’ll use the variables x_1, x_2, x_3 and x_4 to represent the above 4 regression variables.

Let y represent the response variable, namely, the speed of dangerous behaviors.

The linear model is as follows:

A linear model of y regressed on x_1, x_2, x_3 and x_4 and a constant — A linear model of y regressed on x_1, x_2, x_3 and x_4 and a relentless (Image by Writer)

We’ll study the biasing effect of omitting x_2(=how well-informed a baby thinks their parents are about what’s occurring of their life) on the coefficient of x_1(=frequency of family meals).

If x_2 is omitted from the above linear model, and the truncated model is fitted, the fitted model looks like this:

The truncated and fitted model (Image by Writer)

Within the fitted model, β_1_cap is the estimated coefficient of the frequency of family meals. Thus, β_1_cap quantifies the partial effect of frequency of family meals on the speed of dangerous behavior in adolescents.

Using the formula for the omitted variable bias, we are able to state the expected value of the partial effect of x_1 as follows:

Formula for the expected value of β_1_cap (Image by Writer)

Studies have shown that frequency of family meals (x_1) happens to be heavily correlated with how well-informed a baby thinks their parents are about what’s occurring of their life (x_2). Now take a look at the covariance within the numerator of the bias term. Since x_1 is very correlated with x_2, the big covariance makes the numerator large.

If that weren’t enough, the identical studies have shown that x_2 (=how well-informed a baby thinks their parents are about what’s occurring of their life) is itself heavily correlated (inversely) with the speed of dangerous behavior that the kid indulges in (y). Due to this fact, we’d expect the coefficient γ_2 in the total model to be large and negative.

The massive covariance and the big negative γ_2 join forces to make the bias term large and negative. It’s easy to see how such a big negative bias will drive down the expected value of β_1_cap deep into negative territory.

It’s this huge negative bias that may make it seem to be the frequency of family meals has an outsized partial effect on explaining the speed of dangerous behavior in adolescents.

All of this bias occurs by the inadvertent omission of a single highly influential variable.

Until now, I’ve relied on equations and formulae to supply a descriptive demonstration of how omitting a very important variable biases a regression model.

On this section, I’ll show you the bias in motion on real world data.

For illustration, I’ll use the next dataset of automobiles published by UC Irvine.

The automobiles dataset (License: CC BY 4.0) (Image by Writer)

Each row within the dataset accommodates 26 different features of a singular vehicle. The characteristics include make, variety of doors, engine features similar to fuel type, variety of cylinders, and engine aspiration, physical dimensions of the vehicle similar to length, breath, height, and wheel base, and the vehicle’s fuel efficiency on city and highway roads.

There are 205 unique vehicles on this dataset.

Our goal is to construct a linear model for estimating the fuel efficiency of a vehicle in the town.

Out of the 26 variables covered by the information, only two variables — curb weight and horsepower — occur to be essentially the most potent determiners of fuel efficiency. Why these two specifically? Because, out of the 25 potential regression variables within the dataset, only curb weight and horsepower have statistically significant partial correlations with fuel efficiency. For those who are curious how I went in regards to the technique of identifying these variables, take a take a look at my article on the partial correlation coefficient.

A linear model of fuel efficiency (in the town) regressed on curb weight and horsepower is as follows:

Notice that the above model has no intercept. That’s so because when either one in every of curb weight and horsepower is zero, the opposite one must be zero. And you may agree that it’s going to be quite unusual to return across a vehicle with zero weight and horsepower but someway sporting a positive mileage.

So next, we’ll filter out the rows within the dataset containing missing data. And from the remaining data, we’ll carve out two randomly chosen datasets for training and testing the model in a 80:20 ratio. After doing this, the training data happens to contain 127 vehicles.

For those who were to coach the model in equation (1) on the training data using Strange Least Squares, you’ll get the estimates γ_1_cap and γ_2_cap for the coefficients γ_1 and γ_2.

At the tip of this text, you’ll find the link to the Python code for doing this training plus all other code utilized in this text.

Meanwhile, following is the equation of the trained model:

The fitted linear model of automobile fuel efficiency (Image by Writer)

Now suppose you were to omit the variable horsepower from the model. The truncated model looks like this:

The truncated linear model of automobile fuel efficiency (Image by Writer)

For those who were to coach the model in equation (3) on the training data using OLS, you’ll get the next estimate for β_1:

Thus, β_1_cap is 0.01. That is different than the 0.0193 in the total model.

Due to omitted variable, the expected value of β_1_cap has gotten biased as follows:

As mentioned earlier, in a non-biased linear model fitted using OLS, the expected value of β_1_cap might be the population value of β_1_cap which is γ_1. Thus, in a non-biased model:

E(β_1_cap) = γ_1

However the omission of horsepower has biased this expectation as shown in equation (5).

To calculate the bias, you must know three quantities:

γ_2: That is the population value of the coefficient of horsepower in the total model shown in equation (1).
Covariance(curb_weight, horsepower): That is the population value of the covariance.
Variance(curb_weight): That is the population value of the variance.

Unfortunately, not one of the three values are computable because the general population of all vehicles is inaccessible to you. All you have got is a sample of 127 vehicles.

In practice though, you possibly can estimate this bias by substituting sample values for the population values.

Thus, rather than γ_2, you should use γ_2_cap= — 0.2398 from equation (2).

Similarly, using the training data of 127 vehicles as the information sample, you possibly can calculate the sample covariance of curb_weight and horsepower, and the sample variance of curb_weight.

The sample covariance comes out to be 11392.85. The sample variance of curb_weight comes out to be 232638.78.

With these values, the bias term in equation (5) may be estimated as follows:

Estimated bias on E(β_1_cap) (Image by Writer)

Getting a feel for the impact of the omitted variable bias

To get a way of how strong this bias is, let’s return to the fitted full model:

Within the above model, γ_1_cap = 0.0193. Our calculation shows that the bias on the estimated value of γ_1 is 0.01174 within the negative direction. The magnitude of this bias (0.01174) is 0.01174/0.0193*100 = 60.93 , in other words an alarming 60.83% of the estimated value of γ_1.

There is no such thing as a gentle option to say this: Omitting the highly influential variable horsepower has wreaked havoc in your easy linear regression model.

Omitting horsepower has precipitously attenuated the expected value of the estimated coefficient of the non-omitted variable curb_weight. Using equation (5), you’ll have the opportunity to approximate the attenuated value of this coefficient as follows:

E(β_1_cap | curb_weight, horsepower)
= γ_1_cap + bias = 0.0193—0.01174 = 0.00756

Remember once more that you simply are working with estimates as a substitute of the particular values of γ_1 and bias.

Nevertheless, the estimated attenuated value of γ_1_cap (0.00756) matches closely with the estimate of 0.01 returned by fitting the truncated model of city_mpg (equation 4) on the training data. I’ve reproduced it below.

Listed below are the links to the Python code and the information used for constructing and training the total and the truncated models and for calculating the Omitted Variable Bias on E(β_1_cap).

Link to the auto dataset.

By the way in which, every time you run the code, it’s going to pull a randomly chosen set of coaching data from the general autos dataset. Training the total and truncated models on this training data will result in barely different estimated coefficient values. Due to this fact, every time you run the code, the bias on E(β_1_cap) may also be barely different. Actually, this illustrates slightly nicely why the estimated coefficients are themselves random variables and why they’ve their very own estimated values.

Let’s summarize what we learned.

Omitted variable bias is caused when a number of vital variables are omitted from a regression model.
The bias affects the expected values of the estimated coefficients of all non-omitted variables. The bias causes the expected values to develop into either greater or smaller from their true population values.
Omitted variable bias will make the non-omitted variables look either more vital or less vital than what they really are by way of their influence on the response variable of the regression model.
The magnitude of the bias on each non-omitted variable is directly proportional to how correlated is the non-omitted variable with the omitted variable(s), and likewise how influential is/are the omitted variables on the the response variable of the model. The bias is inversely proportional to how dispersed is the non-omitted variable.
In most real-world cases, the direction of the bias is difficult to evaluate without computing it.

Omitted Variable Bias

An intro to an especially sneaky bias that invades many regression models

Magnitude of the bias

Direction of the bias

Getting a feel for the impact of the omitted variable bias

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Evolving the World Foundation Models for Physical AI

combining generative AI with live-action filmmaking

Nvidia's recent AI framework trains an 8B model to administer tools like a professional

Bootstrap a Data Lakehouse in an Afternoon

Granite 4.0 Nano: Just how small are you able to go?

Omitted Variable Bias

An intro to an especially sneaky bias that invades many regression models

Magnitude of the bias

Direction of the bias

Getting a feel for the impact of the omitted variable bias

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.