A residual is the difference between an observed value and a predicted value in a regression model.
It is calculated as:
Residual = Observed value – Predicted value
If we plot the observed values and overlay the fitted regression line, the residuals for each observation would be the vertical distance between the observation and the regression line:
One type of residual we often use to identify outliers in a regression model is known as a standardized residual.
It is calculated as:
ri = ei / s(ei) = ei / RSE√
- ei: The ith residual
- RSE: The residual standard error of the model
- hii: The leverage of the ith observation
In practice, we often consider any standardized residual with an absolute value greater than 3 to be an outlier.
This doesn’t necessarily mean that we’ll remove these observations from the model, but we should at least investigate them further to verify that they’re not a result of a data entry error or some other odd occurrence.
Note: Sometimes standardized residuals are also referred to as “internally studentized residuals.”
Example: How to Calculate Standardized Residuals
Suppose we have the following dataset with 12 total observations:
y = 29.63 + 0.7553x
Using this line, we can calculate the predicted value for each Y value based on the value of X. For example, the predicted value of the first observation would be:
y = 29.63 + 0.7553*(8) = 35.67
We can then calculate the residual for this observation as:
Residual = Observed value – Predicted value = 41 – 35.67 = 5.33
We can repeat this process to find the residual for every single observation:
We can also use statistical software to find that the residual standard error of the model is 4.44.
And, although it’s beyond the scope of this tutorial, we can use software to find the leverage statistic (hii) for each observation:
We can then use the following formula to calculate the standardized residual for each observation:
ri = ei / RSE√
For example, the standardized residual for the first observation is calculated as:
ri = 5.33 / 4.44√ = 1.404
We can repeat this process to find the standardized residual for each observation:
We can then create a quick scatterplot of the predictor values vs. standardized residuals to visually see if any of the standardized residuals exceed an absolute value threshold of 3:
From the plot we can see that none of the standardized residuals exceed an absolute value of 3. Thus, none of the observations appear to be outliers.
It’s worth noting in some cases that researchers consider observations with standardized residuals that exceed an absolute value of 2 to be considered outliers.
It’s up to you to decide, depending on the field you’re working in and the specific problem you’re working on, whether to use an absolute value of 2 or 3 as the threshold for outliers.