One of the key assumptions of linear regression is that the residuals are distributed with equal variance at each level of the predictor variable. This assumption is known as homoscedasticity.
When this assumption is violated, we say that heteroscedasticity is present in the residuals. When this occurs, the results of the regression become unreliable.
One way to handle this issue is to instead use weighted least squares regression, which places weights on the observations such that those with small error variance are given more weight since they contain more information compared to observations with larger error variance.
This tutorial provides a step-by-step example of how to perform weight least squares regression in Python.
Step 1: Create the Data
First, let’s create the following pandas DataFrame that contains information about the number of hours studied and the final exam score for 16 students in some class:
import pandas as pd #create DataFrame df = pd.DataFrame({'hours': [1, 1, 2, 2, 2, 3, 4, 4, 4, 5, 5, 5, 6, 6, 7, 8], 'score': [48, 78, 72, 70, 66, 92, 93, 75, 75, 80, 95, 97, 90, 96, 99, 99]}) #view first five rows of DataFrame print(df.head()) hours score 0 1 48 1 1 78 2 2 72 3 2 70 4 2 66
Step 2: Fit Simple Linear Regression Model
Next, we’ll use functions from the statsmodels module to fit a simple linear regression model using hours as the predictor variable and score as the response variable:
import statsmodels.api as sm #define predictor and response variables y = df['score'] X = df['hours'] #add constant to predictor variables X = sm.add_constant(x) #fit linear regression model fit = sm.OLS(y, X).fit() #view model summary print(fit.summary()) OLS Regression Results ============================================================================== Dep. Variable: score R-squared: 0.630 Model: OLS Adj. R-squared: 0.603 Method: Least Squares F-statistic: 23.80 Date: Mon, 31 Oct 2022 Prob (F-statistic): 0.000244 Time: 11:19:54 Log-Likelihood: -57.184 No. Observations: 16 AIC: 118.4 Df Residuals: 14 BIC: 119.9 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 60.4669 5.128 11.791 0.000 49.468 71.465 hours 5.5005 1.127 4.879 0.000 3.082 7.919 ============================================================================== Omnibus: 0.041 Durbin-Watson: 1.910 Prob(Omnibus): 0.980 Jarque-Bera (JB): 0.268 Skew: -0.010 Prob(JB): 0.875 Kurtosis: 2.366 Cond. No. 10.5
From the model summary we can see that the R-squared value of the model is 0.630.
Related: What is a Good R-squared Value?
Step 3: Fit Weighted Least Squares Model
Next, we can use the WLS() function from statsmodels to perform weighted least squares by defining the weights in such a way that the observations with lower variance are given more weight:
#define weights to use
wt = 1 / smf.ols('fit.resid.abs() ~ fit.fittedvalues', data=df).fit().fittedvalues**2
#fit weighted least squares regression model
fit_wls = sm.WLS(y, X, weights=wt).fit()
#view summary of weighted least squares regression model
print(fit_wls.summary())
WLS Regression Results
==============================================================================
Dep. Variable: score R-squared: 0.676
Model: WLS Adj. R-squared: 0.653
Method: Least Squares F-statistic: 29.24
Date: Mon, 31 Oct 2022 Prob (F-statistic): 9.24e-05
Time: 11:20:10 Log-Likelihood: -55.074
No. Observations: 16 AIC: 114.1
Df Residuals: 14 BIC: 115.7
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 63.9689 5.159 12.400 0.000 52.905 75.033
hours 4.7091 0.871 5.407 0.000 2.841 6.577
==============================================================================
Omnibus: 2.482 Durbin-Watson: 1.786
Prob(Omnibus): 0.289 Jarque-Bera (JB): 1.058
Skew: 0.029 Prob(JB): 0.589
Kurtosis: 1.742 Cond. No. 17.6
==============================================================================
From the output we can see that the R-squared value for this weighted least squares model increased to 0.676.
This indicates that the weighted least squares model is able to explain more of the variance in exam scores compared to the simple linear regression model.
This tells us that the weighted least squares model offers a better fit to the data compared to the simple linear regression model.
Additional Resources
The following tutorials explain how to perform other common tasks in Python:
How to Create a Residual Plot in Python
How to Create a Q-Q Plot in Python
How to Test for Multicollinearity in Python