In statistics, binning is the process of placing numerical values into bins.
The most common form of binning is known as equal-width binning, in which we divide a dataset into k bins of equal width.
A less commonly used form of binning is known as equal-frequency binning, in which we divide a dataset into k bins that all have an equal number of frequencies.
This tutorial explains how to perform equal frequency binning in python.
Equal Frequency Binning in Python
Suppose we have a dataset that contains 100 values:
import numpy as np import matplotlib.pyplot as plt #create data np.random.seed(1) data = np.random.randn(100) #view first 5 values data[:5] array([ 1.62434536, -0.61175641, -0.52817175, -1.07296862, 0.86540763])
If we create a histogram to display these values, Python will use equal-width binning by default:
#create histogram with equal-width bins n, bins, patches = plt.hist(data, edgecolor='black') plt.show() #display bin boundaries and frequency per bin bins, n (array([-2.3015387 , -1.85282729, -1.40411588, -0.95540447, -0.50669306, -0.05798165, 0.39072977, 0.83944118, 1.28815259, 1.736864 , 2.18557541]), array([ 3., 1., 6., 17., 19., 20., 14., 12., 5., 3.]))
Each bin has an equal width of approximately .4487, but each bin doesn’t contain an equal amount of observations. For example:
- The first bin extends from -2.3015387 to -1.8528279 and contains 3 observations.
- The second bin extends from -1.8528279 to -1.40411588 and contains 1 observation.
- The third bin extends from -1.40411588 to -0.95540447 and contains 6 observations.
And so on.
To create bins that contain an equal number of observations, we can use the following function:
#define function to calculate equal-frequency bins def equalObs(x, nbin): nlen = len(x) return np.interp(np.linspace(0, nlen, nbin + 1), np.arange(nlen), np.sort(x)) #create histogram with equal-frequency bins n, bins, patches = plt.hist(data, equalObs(data, 10), edgecolor='black') plt.show() #display bin boundaries and frequency per bin bins, n (array([-2.3015387 , -0.93576943, -0.67124613, -0.37528495, -0.20889423, 0.07734007, 0.2344157 , 0.51292982, 0.86540763, 1.19891788, 2.18557541]), array([10., 10., 10., 10., 10., 10., 10., 10., 10., 10.]))
Each bin doesn’t have an equal width, but each bin does contain an equal amount of observations. For example:
- The first bin extends from -2.3015387 to -0.93576943 and contains 10 observations.
- The second bin extends from -0.93576943 to -0.67124613 and contains 10 observations.
- The third bin extends from -0.67124613 to -0.37528495 and contains 10 observations.
And so on.
We can see from the histogram that each bin is clearly not the same width, but each bin does contain the same amount of observations which is confirmed by the fact that each bin height is equal.