An outlier is an observation that lies abnormally far away from other values in a dataset.
Outliers can be problematic because they can affect the results of an analysis.
We will use the following dataset in Excel to illustrate two methods for finding outliers:
Method 1: Use the interquartile range
The interquartile range (IQR) is the difference between the 75th percentile (Q3) and the 25th percentile (Q1) in a dataset. It measures the spread of the middle 50% of values.
We can define an observation to be an outlier if it is 1.5 times the interquartile range greater than the third quartile (Q3) or 1.5 times the interquartile range less than the first quartile (Q1).
The following image shows how to calculate the interquartile range in Excel:
Next, we can use the formula mentioned above to assign a “1” to any value that is an outlier in the dataset:
We see that only one value – 164 – turns out to be an outlier in this dataset.
Method 2: Use z-scores
A z-score tells you how many standard deviations a given value is from the mean. We use the following formula to calculate a z-score:
z = (X – μ) / σ
- X is a single raw data value
- μ is the population mean
- σ is the population standard deviation
We can define an observation to be an outlier if it has a z-score less than -3 or greater than 3.
The following image shows how to calculate the mean and standard deviation for a dataset in Excel:
We can then use the mean and standard deviation to find the z-score for each individual value in the dataset:
We can then assign a “1” to any value that has a z-score less than -3 or greater than 3:
Using this method, we see that there are no outliers in the dataset.
Note: Sometimes a z-score of 2.5 is used instead of 3. In this case, the individual value of 164 would be considered an outlier since it has a z-score greater than 2.5.
When using the z-score method, use your best judgement for which z-score value you consider to be an outlier.
How to Handle Outliers
If an outlier is present in your data, you have a few options:
1. Make sure the outlier is not the result of a data entry error.
Sometimes an individual simply enters the wrong data value when recording data. If an outlier is present, first verify that the value was entered correctly and that it wasn’t an error.
2. Remove the outlier.
If the value is a true outlier, you may choose to remove it if it will have a significant impact on your overall analysis. Just make sure to mention in your final report or analysis that you removed an outlier.
3. Assign a new value to the outlier.
If the outlier is the result of a data entry error, you may decide to assign a new value to it such as the mean or the median of the dataset.