Logistic regression is a type of regression we can use when the response variable is binary.

One common way to evaluate the quality of a logistic regression model is to create a **confusion matrix**, which is a 2×2 table that shows the predicted values from the model vs. the actual values from the test dataset.

The following step-by-step example shows how to create a confusion matrix in R.

**Step 1: Fit the Logistic Regression Model**

For this example we’ll use the **Default** dataset from the **ISLR** package. We’ll use student status, bank balance, and annual income to predict the probability that a given individual defaults on their loan.

The following code shows how to fit a logistic regression model to this dataset:

#load necessary packages library(caret) library(InformationValue) library(ISLR) #load dataset data <- Default #split dataset into training and testing set set.seed(1) sample <- sample(c(TRUE, FALSE), nrow(data), replace=TRUE, prob=c(0.7,0.3)) train <- data[sample, ] test <- data[!sample, ] #fit logistic regression model model <- glm(default~student+balance+income, family="binomial", data=train)

**Step 2: Create the Confusion Matrix**

Next, we’ll use the **confusionMatrix()** function from the **caret** package to create a confusion matrix:

#use model to predict probability of default predicted <- predict(model, test, type="response") #convert defaults from "Yes" and "No" to 1's and 0's test$default <- ifelse(test$default=="Yes", 1, 0) #find optimal cutoff probability to use to maximize accuracy optimal <- optimalCutoff(test$default, predicted)[1] #create confusion matrix confusionMatrix(test$default, predicted) 0 1 0 2912 64 1 21 39

**Step 3: Evaluate the Confusion Matrix**

We can also calculate the following metrics using the confusion matrix:

**Sensitivity:**The “true positive rate” – the percentage of individuals the model correctly predicted would default.**Specificity:**The “true negative rate” – the percentage of individuals the model correctly predicted would*not*default.**Total misclassification rate:**The percentage of total incorrect classifications made by the model.

The following code shows how to calculate these metrics:

**#calculate sensitivity
sensitivity(test$default, predicted)
[1] 0.3786408
#calculate specificity
specificity(test$default, predicted)
[1] 0.9928401
#calculate total misclassification error rate
misClassError(test$default, predicted, threshold=optimal)
[1] 0.027**

The total misclassification error rate is **2.7%** for this model.

In general, the lower this rate the better the model is able to predict outcomes, so this particular model turns out to be very good at predicting whether an individual will default or not.

This does NOT work.

> confusionMatrix(test$default, predicted)

fails.

Also, no explanation of why you converted “Yes” to 1.

Not clear what is the need for optimalCutoff before generating confusionMatrix and what it does.

You’ve been really helpful, Zac! Thanks for your short yet powerful tutorials. A real life saver you are!

Thanks for your powerful and well detailed tutorials. Please help me with R code to perform discriminant analysis based on a disease data that comprises of about 7 variables with 400 sample size. The response variable is categorical and the remaining variables are numerical. The variables are as follows; Age, Gravidity, Parity, Gestational Age (weeks), BMI, Height and Health status.

Thanks