Ridge regression is a method we can use to fit a regression model when multicollinearity is present in the data.
In a nutshell, least squares regression tries to find coefficient estimates that minimize the sum of squared residuals (RSS):
RSS = Σ(yi – ŷi)2
- Σ: A greek symbol that means sum
- yi: The actual response value for the ith observation
- ŷi: The predicted response value based on the multiple linear regression model
Conversely, ridge regression seeks to minimize the following:
RSS + λΣβj2
where j ranges from 1 to p predictor variables and λ ≥ 0.
This second term in the equation is known as a shrinkage penalty. In ridge regression, we select a value for λ that produces the lowest possible test MSE (mean squared error).
This tutorial provides a step-by-step example of how to perform ridge regression in Python.
Step 1: Import Necessary Packages
First, we’ll import the necessary packages to perform ridge regression in Python:
import pandas as pd from numpy import arange from sklearn.linear_model import Ridge from sklearn.linear_model import RidgeCV from sklearn.model_selection import RepeatedKFold
Step 2: Load the Data
For this example, we’ll use a dataset called mtcars, which contains information about 33 different cars. We’ll use hp as the response variable and the following variables as the predictors:
The following code shows how to load and view this dataset:
#define URL where data is located url = "https://raw.githubusercontent.com/Statology/Python-Guides/main/mtcars.csv" #read in data data_full = pd.read_csv(url) #select subset of data data = data_full[["mpg", "wt", "drat", "qsec", "hp"]] #view first six rows of data data[0:6] mpg wt drat qsec hp 0 21.0 2.620 3.90 16.46 110 1 21.0 2.875 3.90 17.02 110 2 22.8 2.320 3.85 18.61 93 3 21.4 3.215 3.08 19.44 110 4 18.7 3.440 3.15 17.02 175 5 18.1 3.460 2.76 20.22 105
Step 3: Fit the Ridge Regression Model
Next, we’ll use the RidgeCV() function from sklearn to fit the ridge regression model and we’ll use the RepeatedKFold() function to perform k-fold cross-validation to find the optimal alpha value to use for the penalty term.
Note: The term “alpha” is used instead of “lambda” in Python.
For this example we’ll choose k = 10 folds and repeat the cross-validation process 3 times.
Also note that RidgeCV() only tests alpha values .1, 1, and 10 by default. However, we can define our own alpha range from 0 to 1 by increments of 0.01:
#define predictor and response variables X = data[["mpg", "wt", "drat", "qsec"]] y = data["hp"] #define cross-validation method to evaluate model cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1) #define model model = RidgeCV(alphas=arange(0, 1, 0.01), cv=cv, scoring='neg_mean_absolute_error') #fit model model.fit(X, y) #display lambda that produced the lowest test MSE print(model.alpha_) 0.99
The lambda value that minimizes the test MSE turns out to be 0.99.
Step 4: Use the Model to Make Predictions
Lastly, we can use the final ridge regression model to make predictions on new observations. For example, the following code shows how to define a new car with the following attributes:
- mpg: 24
- wt: 2.5
- drat: 3.5
- qsec: 18.5
The following code shows how to use the fitted ridge regression model to predict the value for hp of this new observation:
#define new observation new = [24, 2.5, 3.5, 18.5] #predict hp value using ridge regression model model.predict([new]) array([104.16398018])
Based on the input values, the model predicts this car to have an hp value of 104.16398018.
You can find the complete Python code used in this example here.