How to Calculate SST, SSR, and SSE in Python


We often use three different sum of squares values to measure how well a regression line fits a dataset:

1. Sum of Squares Total (SST) – The sum of squared differences between individual data points (yi) and the mean of the response variable (y).

  • SST = Σ(yiy)2

2. Sum of Squares Regression (SSR) – The sum of squared differences between predicted data points (ŷi) and the mean of the response variable(y).

  • SSR = Σ(ŷiy)2

3. Sum of Squares Error (SSE) – The sum of squared differences between predicted data points (ŷi) and observed data points (yi).

  • SSE = Σ(ŷi – yi)2

The following step-by-step example shows how to calculate each of these metrics for a given regression model in Python.

Step 1: Create the Data

First, let’s create a dataset that contains the number of hours studied and exam score received for 20 different students at a certain university:

import pandas as pd

#create pandas DataFrame
df = pd.DataFrame({'hours': [1, 1, 1, 2, 2, 2, 2, 2, 3, 3,
                             3, 4, 4, 4, 5, 5, 6, 7, 7, 8],
                   'score': [68, 76, 74, 80, 76, 78, 81, 84, 86, 83,
                             88, 85, 89, 94, 93, 94, 96, 89, 92, 97]})

#view first five rows of DataFrame
df.head()

	hours	score
0	1	68
1	1	76
2	1	74
3	2	80
4	2	76

Step 2: Fit a Regression Model

Next, we’ll use the OLS() function from the statsmodels library to fit a simple linear regression model using score as the response variable and hours as the predictor variable:

import statsmodels.api as sm

#define response variable
y = df['score']

#define predictor variable
x = df[['hours']]

#add constant to predictor variables
x = sm.add_constant(x)

#fit linear regression model
model = sm.OLS(y, x).fit()

Step 3: Calculate SST, SSR, and SSE

Lastly, we can use the following formulas to calculate the SST, SSR, and SSE values of the model:

import numpy as np

#calculate sse
sse = np.sum((model.fittedvalues - df.score)**2)
print(sse)

331.07488479262696

#calculate ssr
ssr = np.sum((model.fittedvalues - df.score.mean())**2)
print(ssr)

917.4751152073725

#calculate sst
sst = ssr + sse
print(sst)

1248.5499999999995

The metrics turn out to be:

  • Sum of Squares Total (SST): 1248.55
  • Sum of Squares Regression (SSR): 917.4751
  • Sum of Squares Error (SSE): 331.0749

We can verify that SST = SSR + SSE:

  • SST = SSR + SSE
  • 1248.55 = 917.4751 + 331.0749

Additional Resources

You can use the following calculators to automatically calculate SST, SSR, and SSE for any simple linear regression line:

The following tutorials explain how to calculate SST, SSR, and SSE in other statistical software:

Leave a Reply

Your email address will not be published.