How to Get Regression Model Summary from Scikit-Learn


Often you may want to extract a summary of a regression model created using scikit-learn in Python.

Unfortunately, scikit-learn doesn’t offer many built-in functions to analyze the summary of a regression model since it’s typically only used for predictive purposes.

So, if you’re interested in getting a summary of a regression model in Python, you have two options:

1. Use limited functions from scikit-learn.

2. Use statsmodels instead.

The following examples show how to use each method in practice with the following pandas DataFrame:

import pandas as pd

#create DataFrame
df = pd.DataFrame({'x1': [1, 2, 2, 4, 2, 1, 5, 4, 2, 4, 4],
                   'x2': [1, 3, 3, 5, 2, 2, 1, 1, 0, 3, 4],
                   'y': [76, 78, 85, 88, 72, 69, 94, 94, 88, 92, 90]})

#view first five rows of DataFrame
df.head()

       x1      x2	 y
0	1	1	76
1	2	3	78
2	2	3	85
3	4	5	88
4	2	2	72

Method 1: Get Regression Model Summary from Scikit-Learn

We can use the following code to fit a multiple linear regression model using scikit-learn:

from sklearn.linear_model import LinearRegression

#initiate linear regression model
model = LinearRegression()

#define predictor and response variables
X, y = df[['x1', 'x2']], df.y

#fit regression model
model.fit(X, y)

We can then use the following code to extract the regression coefficients of the model along with the R-squared value of the model:

#display regression coefficients and R-squared value of model
print(model.intercept_, model.coef_, model.score(X, y))

70.4828205704 [ 5.7945 -1.1576] 0.766742556527

Using this output, we can write the equation for the fitted regression model:

y = 70.48 + 5.79x1 – 1.16x2

We can also see that the R2 value of the model is 76.67. 

This means that 76.67% of the variation in the response variable can be explained by the two predictor variables in the model.

Although this output is useful, we still don’t know the overall F-statistic of the model, the p-values of the individual regression coefficients, and other useful metrics that can help us understand how well the model fits the dataset.

Method 2: Get Regression Model Summary from Statsmodels

If you’re interested in extracting a summary of a regression model in Python, you’re better off using the statsmodels package.

The following code shows how to use this package to fit the same multiple linear regression model as the previous example and extract the model summary:

import statsmodels.api as sm

#define response variable
y = df['y']

#define predictor variables
x = df[['x1', 'x2']]

#add constant to predictor variables
x = sm.add_constant(x)

#fit linear regression model
model = sm.OLS(y, x).fit()

#view model summary
print(model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.767
Model:                            OLS   Adj. R-squared:                  0.708
Method:                 Least Squares   F-statistic:                     13.15
Date:                Fri, 01 Apr 2022   Prob (F-statistic):            0.00296
Time:                        11:10:16   Log-Likelihood:                -31.191
No. Observations:                  11   AIC:                             68.38
Df Residuals:                       8   BIC:                             69.57
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         70.4828      3.749     18.803      0.000      61.839      79.127
x1             5.7945      1.132      5.120      0.001       3.185       8.404
x2            -1.1576      1.065     -1.087      0.309      -3.613       1.298
==============================================================================
Omnibus:                        0.198   Durbin-Watson:                   1.240
Prob(Omnibus):                  0.906   Jarque-Bera (JB):                0.296
Skew:                          -0.242   Prob(JB):                        0.862
Kurtosis:                       2.359   Cond. No.                         10.7
==============================================================================

Notice that the regression coefficients and the R-squared value match those calculated by scikit-learn, but we’re also provided with a ton of other useful metrics for the regression model.

For example, we can see the p-values for each individual predictor variable:

  • p-value for x1 = .001
  • p-value for x2 = 0.309

We can also see the overall F-statistic of the model, the adjusted R-squared value, the AIC value of the model, and much more.

Additional Resources

The following tutorials explain how to perform other common operations in Python:

How to Perform Simple Linear Regression in Python
How to Perform Multiple Linear Regression in Python
How to Calculate AIC of Regression Models in Python

Leave a Reply

Your email address will not be published.