**Cook’s distance** is used to identify influential observations in a regression model.

The formula for Cook’s distance is:

**D _{i} = (r_{i}^{2} / p*MSE) * (h_{ii} / (1-h_{ii})^{2})**

where:

**r**_{i }is the i^{th}residual**p**is the number of coefficients in the regression model**MSE**is the mean squared error**h**_{ii}is the i^{th}leverage value

Cook’s distance effectively measures how much all of the fitted values in the model change when the i^{th} observation is deleted.

The larger the value for Cook’s distance, the more influential a given observation.

A general rule of thumb is that any observation with a Cook’s distance greater than 4/n (where *n* = total observations) is considered to be highly influential.

The following example shows how to calculate Cook’s distance for a regression model in SPSS:

**Example: How to Calculate Cook’s Distance in SPSS**

Suppose we have the following dataset in SPSS that contains information about total ad spend and total sales for 12 different retail stores:

Suppose that we would like to fit a simple linear regression model to this dataset, using **Ad_Spend** as the predictor variable and **Sales** as the response variable.

To do so, click the **Analyze** tab, then click **Regression**, then click **Linear**:

In the new window that appears, drag **Sales** to the **Dependent** panel and then drag **Ad_Spend** to the **Independent** panel:

Then click the **Save** button. Then check the box next to **Cook’s** under **Distances**:

Then click **Continue**. Then click **OK**.

A new variable will be created in the **Data View** named **COO_1** that shows Cook’s distance for each observation:

Recall that, as a rule of thumb, any observation with a Cook’s distance greater than 4/n is considered to be highly influential.

In this particular dataset there are 12 observations, so any observation with a Cook’s distance greater than 4/12 = **0.333** is considered to be highly influential.

We can see that three observations in the dataset are just above this threshold.

To visualize the Cook’s distance values for each observation, click the **Chart** tab, then click **Scatter/Dot**:

Then click **Simple Scatter**:

In the new window that appears, drag **Store_ID** to the **X Axis** and **Cook’s Distance** to the **Y Axis**:

Then click **OK**.

The following scatterplot will be generated that shows the 12 store ID’s along the x-axis and Cook’s distance along the y-axis:

This plot helps us visualize Cook’s distance for each observation and allows us to quickly spot which observations have the highest Cook’s distance values.

**Notes on Cook’s Distance**

It’s important to keep in mind that Cook’s Distance should be used as a way to **identify **potentially influential observations.

Just because an observation is influential doesn’t necessarily mean that it should be deleted from the dataset. First, you should verify that the observation isn’t a result of a data entry error or some other odd occurrence.

If it turns out to be a legitimate value, you can then decide if it’s appropriate to delete it, leave it, or replace it with an alternative value like the median.