Often in statistics we’re interested in estimating the effect that one variable has on another. For example, perhaps we want to know:
- How does amount of time spent studying affect exam scores?
- How does a certain drug affect blood pressure?
- How does stress affect heart rate?
In each scenario, we want to understand whether or not some predictor variable affects a response variable. However, often there will be other variables that affect the relationship between the two variables.
For example, suppose we use a certain drug as our predictor variable and blood pressure as our response variable. We are only interested in the effect that the drug has on blood pressure:
However, other variables like time spent exercising, overall diet, and stress levels also affect blood pressure:
Thus, if we run a simple linear regression using the drug as our predictor variable and blood pressure as our response variable, we can’t be sure that the regression coefficients will accurately capture the effect that the drug has on blood pressure because outside factors (exercise, diet, stress, etc.) could also be playing a role.
One potential way to get around this problem is to use an instrumental variable.
What is an Instrumental Variable?
An instrumental variable is a third variable introduced into regression analysis that is correlated with the predictor variable, but uncorrelated with the response variable. By using this variable, it becomes possible to estimate the true causal effect that some predictor variable has on a response variable.
For example, suppose we want to estimate the effect that a certain drug has on blood pressure:
An example of an instrumental variable that we may use in this regression analysis is an individual’s proximity to a pharmacy.
This variable “proximity” would likely be highly correlated with whether or not the individual takes the certain drug because an individual wouldn’t be able to obtain it in the first place if they don’t live near a pharmacy.
However, the variable “proximity” is not expected to have any correlation with blood pressure. The only association it would have with blood pressure is through the predictor variable.
The way that we actually use an instrumental variable is through instrumental variables regression, sometimes called two-stage least squares regression.
Instrumental Variables Regression
Instrumental variables regression (or two-stage least squares regression) uses the following approach to estimate the effect that a predictor variable has on a response variable:
Stage 1: Fit a regression model using the instrumental variable as the predictor variable.
In our specific example, we would first fit the following regression model:
Certain drug = B0 + B1(proximity)
We would then be left with predicted values for certain drug (cd), which we’ll call cdhat.
Stage 2: Fit a second regression model using the predicted values for cdhat.
Next, we’ll fit the following regression model:
Blood pressure = B0 + B1(cdhat)
If the regression coefficient for cdhat turns out to be statistically significant, then we can say that there is a causal effect of the drug on blood pressure.
The reason we can say this is because we solely used “proximity” to come up with cdhat and since we know that proximity should not be correlated with blood pressure, any significant correlation in the second stage regression can be attributed to the certain drug.
Cautions on Using Instrumental Variables
An instrumental variable should only be used if it meets the following criteria:
- It is highly correlated with the predictor variable.
- It is not correlated with the response variable.
- It is not correlated with the other variables that are left out of the model (e.g. proximity is not correlated with exercise, diet, or stress).
If an instrumental variable does not meet this criteria, then it should not be used in the regression model because it will likely produce unreliable and biased results.
Bonus: A Video Explanation of Instrumental Variables
The following video by Ashley Hodgson provides an excellent visual explanation of instrumental variables: