What is Perfect Multicollinearity? (Definition & Examples)


In statistics, multicollinearity occurs when two or more predictor variables are highly correlated with each other, such that they do not provide unique or independent information in the regression model.

If the degree of correlation is high enough between variables, it can cause problems when fitting and interpreting the regression model.

The most extreme case of multicollinearity is known as perfect multicollinearity. This occurs when at least two predictor variables have an exact linear relationship between them.

For example, suppose we have the following dataset:

Notice that the values for predictor variable x2 are simply the values of x1 multiplied by 2.

example of perfect multicollinearity

This is an example of perfect multicollinearity.

The Problem with Perfect Multicollinearity

When perfect multicollinearity is present in a dataset, the method of ordinary least squares is unable to produce estimates for regression coefficients.

This is because it’s not possible to estimate the marginal effect of one predictor variable (x1) on the response variable (y) while holding another predictor variable (x2) constant because x2 always moves exactly when x1 moves.

In short, perfect multicollinearity makes it impossible to estimate a value for every coefficient in a regression model.

How to Handle Perfect Multicollinearity

The simplest way to handle perfect multicollinearity is to drop one of the variables that has an exact linear relationship with another variable.

For example, in our previous dataset we could simply drop x2 as a predictor variable.

We would then fit a regression model using x1 as a predictor variable and y as the response variable.

Examples of Perfect Multicollinearity

The following examples show the three most common scenarios of perfect multicollinearity in practice.

1. One Predictor Variable is a Multiple of Another

Suppose we want to use “height in centimeters” and “height in meters” to predict the weight of a certain species of dolphin.

Here’s what our dataset might look like:

Notice that the value for “height in centimeters” is simply equal to “height in meters” multiplied by 100. This is a case of perfect multicollinearity.

If we attempt to fit a multiple linear regression model in R using this dataset, we won’t be able to produce a coefficient estimate for the “meters” predictor variable:

#define data
df <- data.frame(weight=c(400, 460, 470, 475, 490, 440, 430, 490, 500, 540),
                 m=c(1.3, .7, .6, 1.3, 1.2, 1.5, 1.2, 1.6, 1.1, 1.4),
                 cm=c(130, 70, 60, 130, 120, 150, 120, 160, 110, 140))

#fit multiple linear regression model
model <- lm(weight~m+cm, data=df)

#view summary of model
summary(model)

Call:
lm(formula = weight ~ m + cm, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-70.501 -25.501   5.183  19.499  68.590 

Coefficients: (1 not defined because of singularities)
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  458.676     53.403   8.589 2.61e-05 ***
m              9.096     43.473   0.209    0.839    
cm                NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 41.9 on 8 degrees of freedom
Multiple R-squared:  0.005442,	Adjusted R-squared:  -0.1189 
F-statistic: 0.04378 on 1 and 8 DF,  p-value: 0.8395

2. One Predictor Variable is a Transformed Version of Another

Suppose we want to use “points” and “scaled points” to predict the rating of basketball players.

Let’s assume that the variable “scaled points” is calculated as:

Scaled points = (points – μpoints) / σpoints

Here’s what our dataset might look like:

Notice that each value for “scaled points” is simply a standardized version of “points.” This is a case of perfect multicollinearity.

If we attempt to fit a multiple linear regression model in R using this dataset, we won’t be able to produce a coefficient estimate for the “scaled points” predictor variable:

#define data
df <- data.frame(rating=c(88, 83, 90, 94, 96, 78, 79, 91, 90, 82),
                 pts=c(17, 19, 24, 29, 33, 15, 14, 29, 25, 22))

df$scaled_pts <- (df$pts - mean(df$pts)) / sd(df$pts)

#fit multiple linear regression model
model <- lm(rating~pts+scaled_pts, data=df)

#view summary of model
summary(model)

Call:
lm(formula = rating ~ pts + scaled_pts, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.4932 -1.3941 -0.2935  1.3055  5.8412 

Coefficients: (1 not defined because of singularities)
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  67.4218     3.5896  18.783 6.67e-08 ***
pts           0.8669     0.1527   5.678 0.000466 ***
scaled_pts        NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.953 on 8 degrees of freedom
Multiple R-squared:  0.8012,	Adjusted R-squared:  0.7763 
F-statistic: 32.23 on 1 and 8 DF,  p-value: 0.0004663

3. The Dummy Variable Trap

Another scenario where perfect multicollinearity can occur is known as the dummy variable trap. This is when we want to use a categorical variable in a regression model and convert it into a “dummy variable” that takes on values of 0, 1, 2, etc.

For example, suppose we would like to use predictor variables “age” and “marital status” to predict income:

To use “marital status” as a predictor variable, we need to first convert it to a dummy variable.

To do so, we can let “Single” be our baseline value since it occurs most often and assign values of 0 or 1 to “Married” and “Divorce” as follows:

A mistake would be to create three new dummy variables as follows:

In this case, the variable “Single” is a perfect linear combination of the “Married” and “Divorced” variables. This is an example of perfect multicollinearity.

If we attempt to fit a multiple linear regression model in R using this dataset, we won’t be able to produce a coefficient estimate for every predictor variable:

#define data
df <- data.frame(income=c(45, 48, 54, 57, 65, 69, 78, 83, 98, 104, 107),
                 age=c(23, 25, 24, 29, 38, 36, 40, 59, 56, 64, 53),
                 single=c(1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0),
                 married=c(0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1),
                 divorced=c(0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0))

#fit multiple linear regression model
model <- lm(income~age+single+married+divorced, data=df)

#view summary of model
summary(model)

Call:
lm(formula = income ~ age + single + married + divorced, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.7075 -5.0338  0.0453  3.3904 12.2454 

Coefficients: (1 not defined because of singularities)
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  16.7559    17.7811   0.942  0.37739   
age           1.4717     0.3544   4.152  0.00428 **
single       -2.4797     9.4313  -0.263  0.80018   
married           NA         NA      NA       NA   
divorced     -8.3974    12.7714  -0.658  0.53187   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.391 on 7 degrees of freedom
Multiple R-squared:  0.9008,	Adjusted R-squared:  0.8584 
F-statistic:  21.2 on 3 and 7 DF,  p-value: 0.0006865

Additional Resources

A Guide to Multicollinearity & VIF in Regression
How to Calculate VIF in R
How to Calculate VIF in Python
How to Calculate VIF in Excel

Leave a Reply

Your email address will not be published.