Aggregation bias occurs when it is wrongly assumed that the trends seen in aggregated data also apply to individual data points.
The easiest way to understand this type of bias is with a simple example.
Example: Aggregation Bias
Suppose researchers want to understand the relationship between the average number of years of education and average household income in a certain state. They obtain aggregated data for 4 different cities within the state and calculate the correlation between average education and average household income.
It turns out that the correlation between average number of years of education and average household income is 0.9632. This is a highly positive correlation coefficient.
The researchers even create a scatterplot to visualize the relationship between average number of years of education and average household income:
Without actually looking at the individual data, they may publish a report that claims that more years of education is strongly positively correlated with household income.
However, suppose a new researcher comes along a year later and obtains data for individual households across the same set of cities. Suppose she creates the following scatterplot of the data:
She calculates the correlation between the two variables and finds that it’s actually only 0.1788 – still a positive correlation but not nearly as strong as the correlation found by the previous researchers.
It turns out that when the data became aggregated, it covered the true trend between education and income that was taking place at the individual level.
In fact, when we look at a city-by-city basis in the scatterplot the relationship between education and income is actually negative!
Effects of Aggregation Bias
Aggregation bias occurs quite often in research simply because it’s often wrongly assumed that the trends that appear at an aggregate level must also appear at an individual level. Unfortunately, this is not always the case as the previous example showed.
Aggregation bias can cause the results of a study to draw the wrong conclusions and can be misleading. This type of bias is particularly harmful when it relates to correlations between variables.
Even if the correlation between aggregated data of two variables is positive, the underlying correlation between the two variables at an individual observation level can actually be:
- Negative correlation
- No correlation
- Positive correlation
The way to avoid this type of bias is to conduct studies using individual data points as opposed to aggregated data points so that the true relationship between two variables can be discovered.