One of the most common problems that you’ll encounter in machine learning is multicollinearity. This occurs when two or more predictor variables in a dataset are highly correlated.
When this occurs, a model may be able to fit a training dataset well but it may perform poorly on a new dataset it has never seen because it overfits the training set.
One way to get around the problem of multicollinearity is to use principal components regression, which calculates M linear combinations (known as “principal components”) of the original p predictor variables and then uses the method of least squares to fit a linear regression model using the principal components as predictors.
The drawback of principal components regression (PCR) is that it does not consider the response variable when calculating the principal components.
Instead, it only considers the magnitude of the variance among the predictor variables captured by the principal components. Because of this, it’s possible that in some cases the principal components with the largest variances aren’t actually able to predict the response variable well.
A technique that is related to PCR is known as partial least squares. Similar to PCR, partial least squares calculates M linear combinations (known as “PLS components”) of the original p predictor variables and uses the method of least squares to fit a linear regression model using the PLS components as predictors.
But unlike PCR, partial least squares attempts to find linear combinations that explain the variation in both the response variable and the predictor variables.
Steps to Perform Partial Least Squares
In practice, the following steps are used to perform partial least squares.
1. Standardize the data such that all of the predictor variables and the response variable have a mean of 0 and a standard deviation of 1. This ensures that each variable is measured on the same scale.
2. Calculate Z1, … , ZM to be the M linear combinations of the original p predictors.
- Zm = ΣΦjmXj for some constants Φ1m, Φ2m, Φpm, m = 1, …, M.
- To calculate Z1, set Φj1 equal to the coefficient from the simple linear regression of Y onto Xjis the linear combination of the predictors that captures the most variance possible.
- To calculate Z2, regression each variable on Z1 and take the residuals. Then calculate Z2 using this orthogonalized data in exactly the same manner that Z1 was calculated.
- Repeat this process M times to obtain the M PLS components.
3. Use the method of least squares to fit a linear regression model using the PLS components Z1, … , ZM as predictors.
4. Lastly, use k-fold cross-validation to find the optimal number of PLS components to keep in the model. The “optimal” number of PLS components to keep is typically the number that produces the lowest test mean-squared error (MSE).
In cases where multicollinearity is present in a dataset, partial least squares tends to perform better than ordinary least squares regression. However, it’s a good idea to fit several different models so that we can identify the one that generalizes best to unseen data.
In practice, we fit many different types of models (PLS, PCR, Ridge, Lasso, Multiple Linear Regression, etc.) to a dataset and use k-fold cross-validation to identify the model that produces the lowest test MSE on new data.