**Multicollinearity** in regression analysis occurs when two or more explanatory variables are highly correlated to each other, such that they do not provide unique or independent information in the regression model. If the degree of correlation is high enough between variables, it can cause problems when fitting and interpreting the regression model.

For example, suppose you run a multiple linear regression with the following variables:

**Response variable: **max vertical jump

**Explanatory variables: **shoe size, height, time spent practicing

In this case, the explanatory variables shoe size and height are likely to be highly correlated since taller people tend to have larger shoe sizes. This means that multicollinearity is likely to be a problem in this regression.

Fortunately, it’s possible to detect multicollinearity using a metric known as the** variance inflation factor (VIF)**, which measures the correlation and strength of correlation between the explanatory variables in a regression model.

This tutorial explains how to use VIF to detect multicollinearity in a regression analysis in Stata.

**Example: Multicollinearity in Stata**

For this example we will use the Stata built-in dataset called *auto*. Use the following command to load the dataset:

sysuse auto

We’ll use the **regress **command to fit a multiple linear regression model using price as the response variable and weight, length, and mpg as the explanatory variables:

regress price weight length mpg

Next, we’ll use the **vif **command to test for multicollinearity:

vif

This produces a VIF value for each of the explanatory variables in the model. The value for VIF starts at 1 and has no upper limit. A general rule of thumb for interpreting VIFs is as follows:

- A value of 1 indicates there is no correlation between a given explanatory variable and any other explanatory variables in the model.
- A value between 1 and 5 indicates moderate correlation between a given explanatory variable and other explanatory variables in the model, but this is often not severe enough to require attention.
- A value greater than 5 indicates potentially severe correlation between a given explanatory variable and other explanatory variables in the model. In this case, the coefficient estimates and p-values in the regression output are likely unreliable.

We can see that the VIF values for both weight and length are greater than 5, which indicates that multicollinearity is likely a problem in the regression model.

**How to Deal with Multicollinearity**

Often the easiest way to deal with multicollinearity is to simply remove one of the problematic variables since the variable you’re removing is likely redundant anyway and adds little unique or independent information the model.

To determine which variable to remove, we can use the **corr **command to create a correlation matrix to view the correlation coefficients between each of the variables in the model, which can help us identify which variables might be highly correlated with each other and could be causing the problem of multicollinearity:

corr price weight length mpg

We can see that length is highly correlated with both weight and mpg, and it has the lowest correlation with the response variable price. Thus, removing length from the model could solve the problem of multicollinearity without reducing the overall quality of the regression model.

To test this, we can perform the regression analysis again using just weight and mpg as explanatory variables:

regress price weight mpg

We can see that the adjusted R-squared of this model is **0.2735 **compared to **0.3298 **in the previous model. This indicates that the overall usefulness of the model decreased only slightly. Next, we can find the VIF values again using the **VIF **command:

VIF

Both VIF values are below 5, which indicates that multicollinearity is no longer a problem in the model.

This is not as much clear for new comers as you explained VIF and R2 comparatively

. Very useful

Thanks Zach. This was a great, simple, easy to understand tutorial. I know more from reading this than my 3 hour lecture. Great work.

Very good brief explanation

Thank you, please provide more on other tests like heteroscedasticity test.

Thanks! This was very helpful 🙂

how to test in case of logistic regression?