An outlier is an observation that lies abnormally far away from other values in a dataset. Outliers can be problematic because they can affect the results of an analysis.
This tutorial explains how to identify and remove outliers in Python.
How to Identify Outliers in Python
Before you can remove outliers, you must first decide on what you consider to be an outlier. There are two common ways to do so:
1. Use the interquartile range.
The interquartile range (IQR) is the difference between the 75th percentile (Q3) and the 25th percentile (Q1) in a dataset. It measures the spread of the middle 50% of values.
You could define an observation to be an outlier if it is 1.5 times the interquartile range greater than the third quartile (Q3) or 1.5 times the interquartile range less than the first quartile (Q1).
Outliers = Observations > Q3 + 1.5*IQR or Q1 – 1.5*IQR
2. Use z-scores.
A z-score tells you how many standard deviations a given value is from the mean. We use the following formula to calculate a z-score:
z = (X – μ) / σ
- X is a single raw data value
- μ is the population mean
- σ is the population standard deviation
You could define an observation to be an outlier if it has a z-score less than -3 or greater than 3.
Outliers = Observations with z-scores > 3 or < -3
How to Remove Outliers in Python
Once you decide on what you consider to be an outlier, you can then identify and remove them from a dataset. To illustrate how to do so, we’ll use the following pandas DataFrame:
import numpy as np import pandas as pd import scipy.stats as stats #create dataframe with three columns 'A', 'B', 'C' np.random.seed(10) data = pd.DataFrame(np.random.randint(0, 10, size=(100, 3)), columns=['A', 'B', 'C']) #view first 10 rows data[:10] A B C 0 13.315865 7.152790 -15.454003 1 -0.083838 6.213360 -7.200856 2 2.655116 1.085485 0.042914 3 -1.746002 4.330262 12.030374 4 -9.650657 10.282741 2.286301 5 4.451376 -11.366022 1.351369 6 14.845370 -10.798049 -19.777283 7 -17.433723 2.660702 23.849673 8 11.236913 16.726222 0.991492 9 13.979964 -2.712480 6.132042
We can then define and remove outliers using the z-score method or the interquartile range method:
#find absolute value of z-score for each observation z = np.abs(stats.zscore(data)) #only keep rows in dataframe with all z-scores less than absolute value of 3 data_clean = data[(z<3).all(axis=1)] #find how many rows are left in the dataframe data_clean.shape (99,3)
Interquartile range method:
#find Q1, Q3, and interquartile range for each column Q1 = data.quantile(q=.25) Q3 = data.quantile(q=.75) IQR = data.apply(stats.iqr) #only keep rows in dataframe that have values within 1.5*IQR of Q1 and Q3 data_clean = data[~((data < (Q1-1.5*IQR)) | (data > (Q3+1.5*IQR))).any(axis=1)] #find how many rows are left in the dataframe data_clean.shape (89,3)
We can see that the z-score method identified and removed one observation as an outlier, while the interquartile range method identified and removed 11 total observations as outliers.
When to Remove Outliers
If one or more outliers are present in your data, you should first make sure that they’re not a result of data entry error. Sometimes an individual simply enters the wrong data value when recording data.
If the outlier turns out to be a result of a data entry error, you may decide to assign a new value to it such as the mean or the median of the dataset.
If the value is a true outlier, you may choose to remove it if it will have a significant impact on your overall analysis. Just make sure to mention in your final report or analysis that you removed an outlier.
If you’re working with several variables at once, you may want to use the Mahalanobis distance to detect outliers.