An outlier is an observation that lies abnormally far away from other values in a dataset. Outliers can be problematic because they can affect the results of an analysis.
One way to identify outliers in a dataset is to use Chauvenet’s Criterion, which uses the following process:
1. For each individual value xi in the dataset, calculate the deviation from the mean as:
Deviation = |xi – x| / s
where x is the sample mean and s is the sample standard deviation.
2. Compare the deviations of each individual value to the critical values of Chauvenet’s Criterion Table below. For individual data values with deviations greater than those found in the table, declare those data values to be outliers.
Chauvenet’s Criterion: An Example
Suppose we have the following dataset of 15 values:
The sample mean for this dataset is x = 17.067 and the sample standard deviation is s = 10.096. For each individual data value, we can calculate calculate its deviation as:
Deviation = |xi – x| / s
- The first data value would have a deviation of |4 – 17.067| / 10.096 = 1.294.
- The first data value would have a deviation of |6 – 17.067| / 10.096 = 1.096.
And so on.
We can use the same formula to calculate the deviation of each individual data value:
We can then refer to Chauvenet’s Criterion Table and find that the critical value that corresponds to a sample size of n=15 is 2.128. Thus, any value with a deviation greater than 2.128 can be considered an outlier.
It turns out that the value 42 has a deviation greater than 2.128:
Thus, the value 42 is the only outlier in this dataset.
Cautions on Using Chauvenet’s Criterion
Chauvenet’s Criterion makes the assumption that the values in a dataset are normally distributed. If this assumption is not met, then using Chauvenet’s Criterion to identify outliers is likely not valid.
If you do use this method and find that a value is an outlier, you should first verify that the value is not a result of a data entry error. Sometimes data is simply entered incorrectly.
If the value is a true outlier, you may choose to remove it if it will have a significant impact on your overall analysis. Just be sure to mention that you removed an outlier when you report your results.
Also, this method should only be used on a given dataset once. For example, suppose we use this criterion to identify the value 42 as an outlier in the previous example and remove this value from the dataset.
We then shouldn’t recalculate the sample mean and sample standard deviation and calculate the deviations once again to find more outliers.