**Omitted variable bias **occurs when a relevant explanatory variable is not included in a regression model, which can cause the coefficient of one or more explanatory variables in the model to be biased.

An omitted variable is often left out of a regression model for one of two reasons:

**1.** Data for the variable is simply not available.

**2. **The effect of the explanatory variable on the response variable is unknown.

In order for the omitted variable to actually bias the coefficients in the model, the following two requirements must be met:

**1.** The omitted variable must be correlated with one or more explanatory variables in the model.

**2.** The omitted variable must be correlated with the response variable in the model.

**The Effects of Omitted Variable Bias**

Suppose we have two explanatory variables, A and B, and one response variable, Y. Suppose we fit a simple linear regression model with A as the only explanatory variable and we leave B out of the model.

If B is correlated with A *and *correlated with Y, then it will cause the coefficient estimate of A to be biased. The following diagram shows how the coefficient estimate of A will be biased, depending on the nature of the relationship with B:

**Example: Omitted Variable Bias**

Suppose we want to study the effect that square footage has on house price so we fit the following simple linear regression model:

House price = B_{0} + B_{1}(square footage)

Suppose we find the estimated model to be:

**House price = 40,203.91 + 118.31(square footage)**

The way we would interpret the coefficient for square footage is that *each additional one unit increase in square footage is associated with an increase in house price of $118.31, on average.*

However, suppose we leave out the explanatory variable *age* which turns out to be highly negatively correlated with square footage and highly negatively correlated with house price. This variable should be in the model, but it’s not. Thus, the coefficient estimate for square footage is likely biased.

Based on the fact that *age *is negatively correlated with both the explanatory variable and the response variable in the model, we would expect the coefficient estimate for square footage to be positively biased:

Suppose we find data for house age and then include it in the model. The model then becomes:

House price = B_{0} + B_{1}(square footage) + B_{2}(age)

Suppose we find the estimated model to be:

**House price = 123,426.20 + 81.06(square footage) – 1,291.04(age)**

Note that the coefficient estimate for square footage went significantly down, which means it *was *positively biased in the previous model.

The way we would interpret the coefficient for square footage in this model is that *each additional one unit increase in square footage is associated with an average increase in house price of $81.06, assuming age is held constant.*

**What to Do About Omitted Variable Bias**

Unfortunately omitted variable bias occurs often in the real world because there are usually some variables that *should *be included in a regression model but aren’t because data for them isn’t available or the relationship between them and the response variable is unknown.

If possible, you should try to include any and all relevant explanatory variables in a regression model so that you can understand the true relationship between the explanatory variables and the response variable.

Leaving relevant explanatory variables out of a model can significantly affect the interpretation of the model, as we saw in the previous example with house prices.

**Additional Resources**

What is a Lurking Variable?

What is a Confounding Variable?