Logistic regression is a method that we can use to fit a regression model when the response variable is binary.
Before fitting a model to a dataset, logistic regression makes the following assumptions:
Assumption #1: The Response Variable is Binary
Logistic regression assumes that the response variable only takes on two possible outcomes. Some examples include:
- Yes or No
- Male or Female
- Pass or Fail
- Drafted or Not Drafted
- Malignant or Benign
How to check this assumption: Simply count how many unique outcomes occur in the response variable. If there are more than two possible outcomes, you will need to perform ordinal regression instead.
Assumption #2: The Observations are Independent
Logistic regression assumes that the observations in the dataset are independent of each other. That is, the observations should not come from repeated measurements of the same individual or be related to each other in any way.
How to check this assumption: The easiest way to check this assumption is to create a plot of residuals against time (i.e. the order of the observations) and observe whether or not there is a random pattern. If there is not a random pattern, then this assumption may be violated.
Assumption #3: There is No Multicollinearity Among Explanatory Variables
Multicollinearity occurs when two or more explanatory variables are highly correlated to each other, such that they do not provide unique or independent information in the regression model. If the degree of correlation is high enough between variables, it can cause problems when fitting and interpreting the model.
For example, suppose you want to perform logistic regression using max vertical jump as the response variable and the following variables as explanatory variables:
- Player height
- Player shoe size
- Hours spent practicing per day
In this case, height and shoe size are likely to be highly correlated since taller people tend to have larger shoe sizes. This means that multicollinearity is likely to be a problem if we use both of these variables in the regression.
How to check this assumption: The most common way to detect multicollinearity is by using the variance inflation factor (VIF), which measures the correlation and strength of correlation between the predictor variables in a regression model. Check out this tutorial for an in-depth explanation of how to calculate and interpret VIF values.
Assumption #4: There are No Extreme Outliers
Logistic regression assumes that there are no extreme outliers or influential observations in the dataset.
How to check this assumption: The most common way to test for extreme outliers and influential observations in a dataset is to calculate Cook’s distance for each observation. If there are indeed outliers, you can choose to (1) remove them, (2) replace them with a value like the mean or median, or (3) simply keep them in the model but make a note about this when reporting the regression results.
Assumption #5: There is a Linear Relationship Between Explanatory Variables and the Logit of the Response Variable
Logistic regression assumes that there exists a linear relationship between each explanatory variable and the logit of the response variable. Recall that the logit is defined as:
Logit(p) = log(p / (1-p)) where p is the probability of a positive outcome.
How to check this assumption: The easiest way to see if this assumption is met is to use a Box-Tidwell test.
Assumption #6: The Sample Size is Sufficiently Large
Logistic regression assumes that the sample size of the dataset if large enough to draw valid conclusions from the fitted logistic regression model.
How to check this assumption: As a rule of thumb, you should have a minimum of 10 cases with the least frequent outcome for each explanatory variable. For example, if you have 3 explanatory variables and the expected probability of the least frequent outcome is 0.20, then you should have a sample size of at least (10*3) / 0.20 = 150.
Assumptions of Logistic Regression vs. Linear Regression
In contrast to linear regression, logistic regression does not require:
- A linear relationship between the explanatory variable(s) and the response variable.
- The residuals of the model to be normally distributed.
- The residuals to have constant variance, also known as homoscedasticity.