How to Calculate R-Squared in Python (With Example)


R-squared, often written R2, is the proportion of the variance in the response variable that can be explained by the predictor variables in a linear regression model.

The value for R-squared can range from 0 to 1 where:

  • 0 indicates that the response variable cannot be explained by the predictor variable at all.
  • 1 indicates that the response variable can be perfectly explained without error by the predictor variables.

The following example shows how to calculate R2 for a regression model in Python.

Example: Calculate R-Squared in Python

Suppose we have the following pandas DataFrame:

import pandas as pd

#create DataFrame
df = pd.DataFrame({'hours': [1, 2, 2, 4, 2, 1, 5, 4, 2, 4, 4, 3, 6],
                   'prep_exams': [1, 3, 3, 5, 2, 2, 1, 1, 0, 3, 4, 3, 2],
                   'score': [76, 78, 85, 88, 72, 69, 94, 94, 88, 92, 90, 75, 96]})

#view DataFrame
print(df)

    hours  prep_exams  score
0       1           1     76
1       2           3     78
2       2           3     85
3       4           5     88
4       2           2     72
5       1           2     69
6       5           1     94
7       4           1     94
8       2           0     88
9       4           3     92
10      4           4     90
11      3           3     75
12      6           2     96

We can use the LinearRegression() function from sklearn to fit a regression model and the score() function to calculate the R-squared value for the model:

from sklearn.linear_model import LinearRegression

#initiate linear regression model
model = LinearRegression()

#define predictor and response variables
X, y = df[["hours", "prep_exams"]], df.score

#fit regression model
model.fit(X, y)

#calculate R-squared of regression model
r_squared = model.score(X, y)

#view R-squared value
print(r_squared)

0.7175541714105901

The R-squared of the model turns out to be 0.7176.

This means that 71.76% of the variation in the exam scores can be explained by the number of hours studied and the number of prep exams taken.

If we’d like, we could then compare this R-squared value to another regression model with a different set of predictor variables.

In general, models with higher R-squared values are preferred because it means the set of predictor variables in the model is capable of explaining the variation in the response variable well.

Related: What is a Good R-squared Value?

Additional Resources

The following tutorials explain how to perform other common operations in Python:

How to Perform Simple Linear Regression in Python
How to Perform Multiple Linear Regression in Python
How to Calculate AIC of Regression Models in Python

Featured Posts

One Reply to “How to Calculate R-Squared in Python (With Example)”

  1. I want to perform a multiple linear regression of variables with price but I am getting an error.
    X, Y = df[[“floors”, “waterfront”,”lat” ,”bedrooms” ,”sqft_basement” ,”view” ,”bathrooms”,”sqft_living15″,”sqft_above”,”grade”,”sqft_living”]], df.price

    from sklearn.linear_model import LinearRegression
    lm = LinearRegression()
    lm.fit(X, Y)

    ValueError: Input X contains NaN.
    LinearRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

    how do I get the multiple linear regression and R squared value

Leave a Reply

Your email address will not be published. Required fields are marked *