In ordinary multiple linear regression, we use a set of *p* predictor variables and a response variable to fit a model of the form:

**Y = β _{0} + β_{1}X_{1} + β_{2}X_{2} + … + β_{p}X_{p} + ε**

The values for β_{0}, β_{1}, B_{2}, … , β_{p} are chosen using the least square method, which minimizes the sum of squared residuals (RSS):

**RSS = Σ(y _{i} – ŷ_{i})^{2}**

where:

**Σ**: A symbol that means “sum”**y**: The actual response value for the i_{i}^{th}observation**ŷ**: The predicted response value for the i_{i}^{th}observation

**The Problem of Multicollinearity in Regression**

One problem that often occurs in practice with multiple linear regression is multicollinearity – when two or more predictor variables are highly correlated to each other, such that they do not provide unique or independent information in the regression model.

This can cause the coefficient estimates of the model to be unreliable and have high variance. That is, when the model is applied to a new set of data it hasn’t seen before, it’s likely to perform poorly.

**Avoiding Multicollinearity: Ridge & Lasso Regression**

Two methods we can use to get around this issue of multicollinearity are **ridge regression** and **lasso regression**.

**Ridge regression** seeks to minimize the following:

**RSS + λΣβ**_{j}^{2}

**Lasso regression** seeks to minimize the following:

**RSS + λΣ|β**_{j}|

In both equations, the second term is known as a *shrinkage penalty*.

When λ = 0, this penalty term has no effect and both ridge regression and lasso regression produce the same coefficient estimates as least squares.

However, as λ approaches infinity the shrinkage penalty becomes more influential and the predictor variables that aren’t importable in the model get shrunk towards zero.

With Lasso regression, it’s possible that some of the coefficients could go *completely to zero* when λ gets sufficiently large.

**Pros & Cons of Ridge & Lasso Regression**

The **benefit** of ridge and lasso regression compared to least squares regression lies in the bias-variance tradeoff.

Recall that mean squared error (MSE) is a metric we can use to measure the accuracy of a given model and it is calculated as:

MSE = Var(*f̂(*x_{0})) + [Bias(*f̂(*x_{0}))]^{2} + Var(ε)

MSE = Variance + Bias^{2} + Irreducible error

The basic idea of both ridge and lasso regression is to introduce a little bias so that the variance can be substantially reduced, which leads to a lower overall MSE.

To illustrate this, consider the following chart:

Notice that as λ increases, variance drops substantially with very little increase in bias. Beyond a certain point, though, variance decreases less rapidly and the shrinkage in the coefficients causes them to be significantly underestimated which results in a large increase in bias.

We can see from the chart that the test MSE is lowest when we choose a value for λ that produces an optimal tradeoff between bias and variance.

When λ = 0, the penalty term in lasso regression has no effect and thus it produces the same coefficient estimates as least squares. However, by increasing λ to a certain point we can reduce the overall test MSE.

This means the model fit by ridge and lasso regression can potentially produce smaller test errors than the model fit by least squares regression.

The **drawback** of ridge and lasso regression is that it becomes difficult to interpret the coefficients in the final model since they get shrunk towards zero.

Thus, ridge and lasso regression should be used when you’re interested in optimizing for predictive ability rather than inference.

**Ridge vs. Lasso Regression: When to Use Each**

Both lasso regression and ridge regression are known as *regularization methods* because they both attempt to minimize the sum of squared residuals (RSS) along with some penalty term.

In other words, they constrain or *regularize* the coefficient estimates of the model.

This naturally brings up the question: **Is ridge or lasso regression better?**

In cases where only a small number of predictor variables are significant, **lasso regression** tends to perform better because it’s able to shrink insignificant variables completely to zero and remove them from the model.

However, when many predictor variables are significant in the model and their coefficients are roughly equal then **ridge regression** tends to perform better because it keeps all of the predictors in the model.

To determine which model is better at making predictions, we typically perform k-fold cross-validation and choose whichever model produces the lowest test mean squared error.

**Additional Resources**

The following tutorials provide an introduction to both Ridge and Lasso Regression:

The following tutorials explain how to perform both types of regression in R and Python: