In regression analysis, multicollinearity occurs when two or more predictor variables are highly correlated with each other, such that they do not provide unique or independent information in the regression model.
If the degree of correlation is high enough between variables, it can cause problems when fitting and interpreting the regression model.
One way to detect multicollinearity is by using a metric known as the variance inflation factor (VIF), which measures the correlation and strength of correlation between the explanatory variables in a regression model.
This tutorial explains how to calculate VIF in SAS.
Example: Calculating VIF in SAS
For this example we’ll create a dataset that describes the attributes of 10 basketball players:
/*create dataset*/ data my_data; input rating points assists rebounds; datalines; 90 25 5 11 85 20 7 8 82 14 7 10 88 16 8 6 94 27 5 6 90 20 7 9 76 12 6 6 75 15 9 10 87 14 9 10 86 19 5 7 ; run; /*view dataset*/ proc print data=my_data;
Suppose we would like to fit a multiple linear regression model using rating as the response variable and points, assists, and rebounds as the predictor variables.
We can use PROC REG to fit this regression model along with the VIF option to calculate the VIF values for each predictor variable in the model:
/*fit regression model and calculate VIF values*/ proc reg data=my_data; model rating = points assists rebounds / vif; run;
From the Parameter Estimates table we can see the VIF values for each of the predictor variables:
- points: 1.76398
- assists: 1.96591
- rebounds: 1.17503
Note: Ignore the VIF for the “Intercept” in the model since this value is irrelevant.
The value for VIF starts at 1 and has no upper limit. A rule of thumb for interpreting VIFs is as follows:
- A value of 1 indicates there is no correlation between a given predictor variable and any other predictor variables in the model.
- A value between 1 and 5 indicates moderate correlation between a given predictor variable and other predictor variables in the model, but this is often not severe enough to require attention.
- A value greater than 5 indicates potentially severe correlation between a given predictor variable and other predictor variables in the model. In this case, the coefficient estimates and p-values in the regression output are likely unreliable.
Given that each of the VIF values for the predictor variables in our regression model are close to 1, multicollinearity is not a problem in our example.
How to Deal with Multicollinearity
If you determine that multicollinearity is a problem in your regression model, there are a few common ways to deal with it:
1. Remove one or more of the highly correlated variables.
This is the quickest fix in most cases and is often an acceptable solution because the variables you’re removing are redundant anyway and add little unique or independent information the model.
2. Linearly combine the predictor variables in some way, such as adding or subtracting them from one way.
By doing so, you can create one new variables that encompasses the information from both variables and you no longer have an issue of multicollinearity.
3. Perform an analysis that is designed to account for highly correlated variables such as principal component analysis or partial least squares (PLS) regression.
These techniques are specifically designed to handle highly correlated predictor variables.
The following tutorials explain how to perform other common tasks in SAS: