How to Perform Quantile Regression in Python

Linear regression is a method we can use to understand the relationship between one or more predictor variables and a response variable.

Typically when we perform linear regression, we’re interested in estimating the mean value of the response variable.

However, we could instead use a method known as quantile regression to estimate any quantile or percentile value of the response value such as the 70th percentile, 90th percentile, 98th percentile, etc.

This tutorial provides a step-by-step example of how to use this function to perform quantile regression in Python.

Step 1: Load the Necessary Packages

First, we’ll load the necessary packages and functions:

import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt

Step 2: Create the Data

For this example we’ll create a dataset that contains the hours studied and the exam score received for 100 students at some university:

#make this example reproducible

#create dataset
obs = 100

hours = np.random.uniform(1, 10, obs)
score = 60 + 2*hours + np.random.normal(loc=0, scale=.45*hours, size=100)

df = pd.DataFrame({'hours': hours, 'score': score})

#view first five rows

hours	score
0	5.939322	68.764553
1	7.436704	77.888040
2	6.424870	74.196060
3	5.903949	67.726441
4	4.812893	72.849046

Step 3: Perform Quantile Regression

Next, we’ll fit a quantile regression model using hours studied as the predictor variable and exam score as the response variable.

We’ll use the model to predict the expected 90th percentile of exam scores based on the number of hours studied:

#fit the model
model = smf.quantreg('score ~ hours', df).fit(q=0.9)

#view model summary

                         QuantReg Regression Results                          
Dep. Variable:                  score   Pseudo R-squared:               0.6057
Model:                       QuantReg   Bandwidth:                       3.822
Method:                 Least Squares   Sparsity:                        10.85
Date:                Tue, 29 Dec 2020   No. Observations:                  100
Time:                        15:41:44   Df Residuals:                       98
                                        Df Model:                            1
                 coef    std err          t      P>|t|      [0.025      0.975]
Intercept     59.6104      0.748     79.702      0.000      58.126      61.095
hours          2.8495      0.128     22.303      0.000       2.596       3.103

From the output, we can see the estimated regression equation:

90th percentile of exam score = 59.6104 + 2.8495*(hours)

For example, the 90th percentile of scores for all students who study 8 hours is expected to be 82.4:

90th percentile of exam score = 59.6104 + 2.8495*(8) = 82.4.

The output also displays the upper and lower confidence limits for the intercept and the predictor variable hours.

Step 4: Visualize the Results

We can also visualize the results of the regression by creating a scatterplot with the fitted quantile regression equation overlaid on the plot:

#define figure and axis
fig, ax = plt.subplots(figsize=(8, 6))

#get y values
get_y = lambda a, b: a + b * hours
y = get_y(model.params['Intercept'], model.params['hours'])

#plot data points with quantile regression equation overlaid
ax.plot(hours, y, color='black')
ax.scatter(hours, score, alpha=.3)
ax.set_xlabel('Hours Studied', fontsize=14)
ax.set_ylabel('Exam Score', fontsize=14)

Quantile regression in Python

Unlike a simple linear regression line, notice that this fitted line doesn’t represent the “line of best fit” for the data. Instead, it goes through the estimated 90th percentile at each level of the predictor variable.

Additional Resources

How to Perform Simple Linear Regression in Python
How to Perform Quadratic Regression in Python

Leave a Reply

Your email address will not be published. Required fields are marked *