A histogram is a chart that helps us visualize the distribution of values in a dataset.
It turns out that the number of bins used in a histogram can have a huge impact on how we interpret the data.
If we use too few bins, the true underlying pattern in the data can be hidden:
And if we use too many bins, we may just be visualizing the noise in a dataset:
Fortunately, we can use a method known as Sturges’ Rule to determine the optimal number of bins to use in a histogram.
Sturges’ Rule uses the following formula to determine the optimal number of bins to use in a histogram:
Optimal Bins = ⌈log2n + 1⌉
- n: The total number of observations in the dataset.
- ⌈ ⌉: Symbols that mean “ceiling” – i.e. round the answer up to the nearest integer.
Example: Sturges’ Rule
Suppose we have the following dataset with n = 31 total observations:
We can use Sturges’ Rule to determine the optimal number of bins to use to visualize these values in a histogram:
Optimal Bins = ⌈log2(31) + 1⌉ = ⌈4.954 + 1⌉ = ⌈5.954⌉ = 6.
According to Sturges’ Rule, we should use 6 bins in the histogram we use to visualize this distribution of values.
Here’s what a histogram with 6 bins would look like for this dataset:
Notice how this seems to be enough bins to get a good idea of the underlying distribution of values without being too many that we’re just visualizing the noise in the data.
Common Values for Sturges’ Rule
The following table shows the optimal number of bins to use in a histogram based on the total number of observations in a dataset, according to Sturges’ Rule:
Alternatives to Sturges’ Rule
Sturges’ Rule is the most common method for determining the optimal number of bins to use in a histogram, but there are several alternative methods including:
The Square-root Rule: Number of bins = ⌈√n⌉
The Rice Rule: Number of bins = ⌈2 * 3√n⌉
The Freedman-Diaconis’ Rule: Number of bins = (2*IQR) / 3√n where IQR is the interquartile range.
Bonus: Sturges’ Rule Calculator
Use this free online calculator to automatically apply Sturges’ Rule to determine the optimal number of bins to use for a histogram based on the size of a dataset.