Logistic Regression is a statistical method that we use to fit a regression model when the response variable is binary. To assess how well a logistic regression model fits a dataset, we can look at the following two metrics:
- Sensitivity: The probability that the model predicts a positive outcome for an observation when indeed the outcome is positive. This is also called the “true positive rate.”
- Specificity: The probability that the model predicts a negative outcome for an observation when indeed the outcome is negative. This is also called the “true negative rate.”
One way to visualize these two metrics is by creating a ROC curve, which stands for “receiver operating characteristic” curve. This is a plot that displays the sensitivity and specificity of a logistic regression model.
The following step-by-step example shows how to create and interpret a ROC curve in Python.
Step 1: Import Necessary Packages
First, we’ll import the packages necessary to perform logistic regression in Python:
import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn import metrics import matplotlib.pyplot as plt
Step 2: Fit the Logistic Regression Model
Next, we’ll import a dataset and fit a logistic regression model to it:
#import dataset from CSV file on Github url = "https://raw.githubusercontent.com/Statology/Python-Guides/main/default.csv" data = pd.read_csv(url) #define the predictor variables and the response variable X = data[['student', 'balance', 'income']] y = data['default'] #split the dataset into training (70%) and testing (30%) sets X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=0) #instantiate the model log_regression = LogisticRegression() #fit the model using the training data log_regression.fit(X_train,y_train)
Step 3: Plot the ROC Curve
Next, we’ll calculate the true positive rate and the false positive rate and create a ROC curve using the Matplotlib data visualization package:
#define metrics y_pred_proba = log_regression.predict_proba(X_test)[::,1] fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba) #create ROC curve plt.plot(fpr,tpr) plt.ylabel('True Positive Rate') plt.xlabel('False Positive Rate') plt.show()
The more that the curve hugs the top left corner of the plot, the better the model does at classifying the data into categories.
As we can see from the plot above, this logistic regression model does a pretty poor job of classifying the data into categories.
To quantify this, we can calculate the AUC – area under the curve – which tells us how much of the plot is located under the curve.
The closer AUC is to 1, the better the model. A model with an AUC equal to 0.5 is no better than a model that makes random classifications.
Step 4: Calculate the AUC
We can use the following code to calculate the AUC of the model and display it in the lower right corner of the ROC plot:
#define metrics y_pred_proba = log_regression.predict_proba(X_test)[::,1] fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba) auc = metrics.roc_auc_score(y_test, y_pred_proba) #create ROC curve plt.plot(fpr,tpr,label="AUC="+str(auc)) plt.ylabel('True Positive Rate') plt.xlabel('False Positive Rate') plt.legend(loc=4) plt.show()
The AUC for this logistic regression model turns out to be 0.5602. Since this is close to 0.5, this confirms that the model does a poor job of classifying data.