How to Remove Outliers in Python


An outlier is an observation that lies abnormally far away from other values in a dataset. Outliers can be problematic because they can affect the results of an analysis.

This tutorial explains how to identify and remove outliers in Python.

How to Identify Outliers in Python

Before you can remove outliers, you must first decide on what you consider to be an outlier. There are two common ways to do so:

1. Use the interquartile range.

The interquartile range (IQR) is the difference between the 75th percentile (Q3) and the 25th percentile (Q1) in a dataset. It measures the spread of the middle 50% of values.

You could define an observation to be an outlier if it is 1.5 times the interquartile range greater than the third quartile (Q3) or 1.5 times the interquartile range less than the first quartile (Q1).

Outliers = Observations > Q3 + 1.5*IQR  or  Q1 – 1.5*IQR

2. Use z-scores.

A z-score tells you how many standard deviations a given value is from the mean. We use the following formula to calculate a z-score:

z = (X – μ) / σ

where:

  • X is a single raw data value
  • μ is the population mean
  • σ is the population standard deviation

You could define an observation to be an outlier if it has a z-score less than -3 or greater than 3.

Outliers = Observations with z-scores > 3 or < -3

How to Remove Outliers in Python

Once you decide on what you consider to be an outlier, you can then identify and remove them from a dataset. To illustrate how to do so, we’ll use the following pandas DataFrame:

import numpy as np
import pandas as pd 
import scipy.stats as stats

#create dataframe with three columns 'A', 'B', 'C'
np.random.seed(10)
data = pd.DataFrame(np.random.randint(0, 10, size=(100, 3)), columns=['A', 'B', 'C'])

#view first 10 rows 
data[:10]

           A          B          C
0  13.315865   7.152790 -15.454003
1  -0.083838   6.213360  -7.200856
2   2.655116   1.085485   0.042914
3  -1.746002   4.330262  12.030374
4  -9.650657  10.282741   2.286301
5   4.451376 -11.366022   1.351369
6  14.845370 -10.798049 -19.777283
7 -17.433723   2.660702  23.849673
8  11.236913  16.726222   0.991492
9  13.979964  -2.712480   6.132042

We can then define and remove outliers using the z-score method or the interquartile range method:

Z-score method:

#find absolute value of z-score for each observation
z = np.abs(stats.zscore(data))

#only keep rows in dataframe with all z-scores less than absolute value of 3 
data_clean = data[(z<3).all(axis=1)]

#find how many rows are left in the dataframe 
data_clean.shape

(99,3)

Interquartile range method:

#find Q1, Q3, and interquartile range for each column
Q1 = data.quantile(q=.25)
Q3 = data.quantile(q=.75)
IQR = data.apply(stats.iqr)

#only keep rows in dataframe that have values within 1.5*IQR of Q1 and Q3
data_clean = data[~((data < (Q1-1.5*IQR)) | (data > (Q3+1.5*IQR))).any(axis=1)]

#find how many rows are left in the dataframe 
data_clean.shape

(89,3)

We can see that the z-score method identified and removed one observation as an outlier, while the interquartile range method identified and removed 11 total observations as outliers.

When to Remove Outliers

If one or more outliers are present in your data, you should first make sure that they’re not a result of data entry error. Sometimes an individual simply enters the wrong data value when recording data.

If the outlier turns out to be a result of a data entry error, you may decide to assign a new value to it such as the mean or the median of the dataset.

If the value is a true outlier, you may choose to remove it if it will have a significant impact on your overall analysis. Just make sure to mention in your final report or analysis that you removed an outlier.

Additional Resources

If you’re working with several variables at once, you may want to use the Mahalanobis distance to detect outliers.

Leave a Reply

Your email address will not be published.