The mean of a dataset represents the average value of the dataset.
It is calculated as:
Mean = Σxi / n
- Σ: A symbol that means “sum”
- xi: The ith observation in a dataset
- n: The total number of observations in the dataset
There are two main advantages of using the mean to describe the “center” or “average” of a dataset:
Advantage #1: The mean uses all of the observations in a dataset in its calculation. In statistics, this is generally a good thing because we say we use all of the available information in a dataset.
Advantage #2: The mean is easy to calculate and interpret. The mean is the sum of all observations divided by the total number of observations. This is both easy to calculate (even by hand) and easy to interpret.
However, there are two potential disadvantages of using the mean to summarize a dataset:
Disadvantage #1: The mean is affected by outliers. If a dataset has an extreme outlier, this affects the mean and causes it to be an unreliable measure of the center of a dataset.
Disadvantage #2: The mean can be misleading with skewed datasets. When a dataset is skewed to the left or right, the mean can be a misleading way to measure the center of a dataset.
The following examples illustrate these advantages and disadvantages in practice.
Example 1: The Advantages of Using the Mean
Suppose we have the following histogram that shows the salaries of residents in a particular city:
Since this distribution is mostly symmetrical (if you split it down the middle, each half would look roughly equal) and there are no outliers, the mean is a useful way to describe the center of this dataset.
The mean turns out to be $63,000, which is located approximately in the center of the distribution:
In this particular example, we were able to use the two advantages of the mean:
Advantage #1: The mean uses all of the observations in a dataset in its calculation.
Since the distribution was mostly symmetrical and there were no extreme outliers, we were able to use every available salary to calculate the mean, which gave us a good idea of the “average” or “typical” salary in this particular city.
Advantage #2: The mean is easy to calculate and interpret. It’s easy to understand that the mean salary of $63,000 represents the “average” salary of an individual in this city.
While some individuals earn much more than this and some earn much less, this mean value gives us a good idea of a “typical” salary in this city.
Example 2: The Disadvantages of Using the Mean
Suppose we have a distribution of salaries that is right skewed and we decide to calculate both the mean and median salary:
The higher values on the tail end of the distribution pull the mean away from the center and towards the long tail.
In this example, the mean tells us that the typical individual earns about $47,000 per year while the median tells us that the typical individual only earns about $32,000 per year, which is much more representative of the typical individual.
In this example, the mean does a poor job of summarizing the “typical” or “average” value in this distribution since the distribution is skewed.
Or suppose we have another distribution that contains information about the square footage of houses on a certain street and we decide to calculate both the mean and median of the dataset:
The mean is influenced by a couple extremely large houses, which causes it to take on a much larger value.
This causes the mean square footage value to be misleading and a poor measure of the “typical” square footage of a house on this street.
The following tutorials provide additional information about the mean and median in statistics: