In statistics, an influential observation is an observation in a dataset that, when removed, dramatically changes the coefficient estimates of a regression model.
The most common way to measure the influence of observations is to use Cook’s distance, which quantifies how much all of the fitted values in a regression model change when the ith observation is deleted.
As a rule of thumb, any observation with a Cook’s distance greater than 1 is considered to be an observation with high leverage.
The following example shows how to calculate and interpret Cook’s distance for a given dataset to detect potential influential observations.
Example: Detecting Influential Observations
Suppose we have the following dataset with 14 values:
Now suppose we fit a simple linear regression model. The regression output is shown below:
Using statistical software, we can calculate the following values for Cook’s distance for each observation:
Notice that the last observation has a value significantly greater than 1 for Cook’s distance, which tells us that it’s an influential observation.
Suppose we remove this value from the dataset and fit a new simple linear regression model. The output for this model is shown below:
Notice that the regression coefficients for the intercept and x both changed dramatically. This tells us that removing the influential observation from the dataset completely changed the fitted regression model.
The following plots show the difference between these two fitted regression equations:
Notice how much the one influential observation changes the regression line. By removing this observation, we were able to find a regression line that fits the data much more closely.
It’s important to note that Cook’s distance should be used as a way to identify potentially influential observations. However, just because an observation is influential doesn’t necessarily mean that it should be deleted from the dataset.
First, you should verify that the observation isn’t a result of a data entry error or some other odd occurrence. If it turns out to be a legit value, you can then decide to deal with it in one of the following ways:
- Delete it from the dataset.
- Leave it in the dataset.
- Replace it with an alternative value like the mean or median.
Depending on your specific scenario, one of these options may make more sense than the others.
How to Calculate Cook’s Distance in Practice
The following tutorials explain how to calculate Cook’s distance for a given dataset in Python and R: