Lasso Regression in Python (Step-by-Step)


Lasso regression is a method we can use to fit a regression model when multicollinearity is present in the data.

In a nutshell, least squares regression tries to find coefficient estimates that minimize the sum of squared residuals (RSS):

RSS = Σ(yi – ŷi)2

where:

  • Σ: A greek symbol that means sum
  • yi: The actual response value for the ith observation
  • ŷi: The predicted response value based on the multiple linear regression model

Conversely, lasso regression seeks to minimize the following:

RSS + λΣ|βj|

where j ranges from 1 to p predictor variables and λ ≥ 0.

This second term in the equation is known as a shrinkage penalty. In lasso regression, we select a value for λ that produces the lowest possible test MSE (mean squared error).

This tutorial provides a step-by-step example of how to perform lasso regression in Python.

Step 1: Import Necessary Packages

First, we’ll import the necessary packages to perform lasso regression in Python:

import pandas as pd
from numpy import arange
from sklearn.linear_model import LassoCV
from sklearn.model_selection import RepeatedKFold

Step 2: Load the Data

For this example, we’ll use a dataset called mtcars, which contains information about 33 different cars. We’ll use hp as the response variable and the following variables as the predictors:

  • mpg
  • wt
  • drat
  • qsec

The following code shows how to load and view this dataset:

#define URL where data is located
url = "https://raw.githubusercontent.com/Statology/Python-Guides/main/mtcars.csv"

#read in data
data_full = pd.read_csv(url)

#select subset of data
data = data_full[["mpg", "wt", "drat", "qsec", "hp"]]

#view first six rows of data
data[0:6]

	mpg	wt	drat	qsec	hp
0	21.0	2.620	3.90	16.46	110
1	21.0	2.875	3.90	17.02	110
2	22.8	2.320	3.85	18.61	93
3	21.4	3.215	3.08	19.44	110
4	18.7	3.440	3.15	17.02	175
5	18.1	3.460	2.76	20.22	105

Step 3: Fit the Ridge Regression Model

Next, we’ll use the LassoCV() function from sklearn to fit the lasso regression model and we’ll use the RepeatedKFold() function to perform k-fold cross-validation to find the optimal alpha value to use for the penalty term.

Note: The term “alpha” is used instead of “lambda” in Python.

For this example we’ll choose k = 10 folds and repeat the cross-validation process 3 times.

Also note that LassoCV() only tests alpha values 0.1, 1, and 10 by default. However, we can define our own alpha range from 0 to 1 by increments of 0.01:

#define predictor and response variables
X = data[["mpg", "wt", "drat", "qsec"]]
y = data["hp"]

#define cross-validation method to evaluate model
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

#define model
model = LassoCV(alphas=arange(0, 1, 0.01), cv=cv, n_jobs=-1)

#fit model
model.fit(X, y)

#display lambda that produced the lowest test MSE
print(model.alpha_)

0.99

The lambda value that minimizes the test MSE turns out to be 0.99.

Step 4: Use the Model to Make Predictions

Lastly, we can use the final lasso regression model to make predictions on new observations. For example, the following code shows how to define a new car with the following attributes:

  • mpg: 24
  • wt: 2.5
  • drat: 3.5
  • qsec: 18.5

The following code shows how to use the fitted lasso regression model to predict the value for hp of this new observation:

#define new observation
new = [24, 2.5, 3.5, 18.5]

#predict hp value using lasso regression model
model.predict([new])

array([105.63442071])

Based on the input values, the model predicts this car to have an hp value of 105.63442071.

You can find the complete Python code used in this example here.

Leave a Reply

Your email address will not be published. Required fields are marked *