The statsmodels module in Python offers a variety of functions and classes that allow you to fit various statistical models.

The following step-by-step example shows how to perform logistic regression using functions from statsmodels.

**Step 1: Create the Data**

First, let’s create a pandas DataFrame that contains three variables:

- Hours Studied (Integer value)
- Study Method (Method A or B)
- Exam Result (Pass or Fail)

We’ll fit a logistic regression model using hours studied and study method to predict whether or not a student passes a given exam.

The following code shows how to create the pandas DataFrame:

import pandas as pd #create DataFrame df = pd.DataFrame({'result': [0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1], 'hours': [1, 2, 2, 2, 3, 2, 5, 4, 3, 6, 5, 8, 8, 7, 6, 7, 5, 4, 8, 9], 'method': ['A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'A', 'B', 'A', 'B', 'B', 'A', 'A', 'B', 'A', 'B', 'A']}) #view first five rows of DataFrame df.head() result hours method 0 0 1 A 1 1 2 A 2 0 2 A 3 0 2 B 4 0 3 B

**Step 2: Fit the Logistic Regression Model**

Next, we’ll fit the logistic regression model using the **logit()** function:

**import statsmodels.formula.api as smf
#fit logistic regression model
model = smf.logit('result ~ hours + method', data=df).fit()
#view model summary
print(model.summary())
Optimization terminated successfully.
Current function value: 0.557786
Iterations 5
Logit Regression Results
==============================================================================
Dep. Variable: result No. Observations: 20
Model: Logit Df Residuals: 17
Method: MLE Df Model: 2
Date: Mon, 22 Aug 2022 Pseudo R-squ.: 0.1894
Time: 09:53:35 Log-Likelihood: -11.156
converged: True LL-Null: -13.763
Covariance Type: nonrobust LLR p-value: 0.07375
===============================================================================
coef std err z P>|z| [0.025 0.975]
-------------------------------------------------------------------------------
Intercept -2.1569 1.416 -1.523 0.128 -4.932 0.618
method[T.B] 0.0875 1.051 0.083 0.934 -1.973 2.148
hours 0.4909 0.245 2.002 0.045 0.010 0.972
===============================================================================
**

The values in the **coef** column of the output tell us the average change in the log odds of passing the exam.

For example:

- Using study method B is associated with an average increase of
**.0875**in the log odds of passing the exam compared to using study method A. - Each additional hour studied is associated with an average increase of
**.4909**in the log odds of passing the exam.

The values in the **P>|z|** column represent the p-values for each coefficient.

For example:

- Studying method has a p-value of
**.934**. Since this value is not less than .05, it means there is not a statistically significant relationship between hours studied and whether or not a student passes the exam. - Hours studied has a p-value of
**.045**. Since this value is less than .05, it means there is a statistically significant relationship between hours studied and whether or not a student passes the exam.

**Step 3: Evaluate Model Performance**

To assess the quality of the logistic regression model, we can look at two metrics in the output:

**1. Pseudo R-Squared**

This value can be thought of as the substitute to the R-squared value for a linear regression model.

It is calculated as the ratio of the maximized log-likelihood function of the null model to the full model.

This value can range from 0 to 1, with higher values indicating a better model fit.

In this example, the pseudo R-squared value is **.1894**, which is quite low. This tells us that the predictor variables in the model don’t do a very good job of predicting the value of the response variable.

**2. LLR p-value**

This value can be thought of as the substitute to the p-value for the overall F-value of a linear regression model.

If this value is below a certain threshold (e.g. α = .05) then we can conclude that the model overall is “useful” and is better at predicting the values of the response variable compared to a model with no predictor variables.

In this example, the LLR p-value is **.07375**. Depending on the significance level we choose (e.g. .01, .05, .1) we may or may not conclude that the model as a whole is useful.

**Additional Resources**

The following tutorials explain how to perform other common tasks in Python:

How to Perform Linear Regression in Python

How to Perform Logarithmic Regression in Python

How to Perform Quantile Regression in Python