How to Calculate Mean, Median, and Mode with NumPy

How to Calculate Mean, Median, and Mode with NumPy

In this article, you will learn how to calculate mean, median, and mode using the NumPy library in Python, essential for basic data analysis and statistics.

Introduction

Let’s see how to use NumPy to calculate the mean, median, and mode of a data series.

Setup Your Environment

First thing’s first, check that you have NumPy installed. If you need it, you can get NumPy through pip:

pip install numpy

You can then import NumPy into your Python script with:

import numpy as np

Calculating the Mean

The mean of a dataset, commonly known as the ‘average,’ is found by summing all of the numbers in the data series, then dividing by the quantity of numbers in the series. This serves as an indicator of the data’s central point.

Consider the following data series, for example:

data = [1, 2, 2, 3, 3, 4, 4, 5, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 97, 98, 99, 100]

You can find the mean like so with NumPy’s, np.mean():

mean_value = np.mean(data)
print(f"Mean: {mean_value:.2f}")

This will output:

Mean: 6.50

Note that we are rounding our results to 2 decimal places.

When you need a single datum that represents the data series’ center, especially if the dataset has a symmetrical distribution without any outliers, the mean is a great choice.

Calculating the Median

Ordered from least to greatest, the middle number in the data series is the median. When the series is of an even length, the median becomes an average of the two center numbers. Good for asymmetric series’ of data, the median outshines the mean as it is not thrown off by outliers.

Let’s look at the previous dataset again:

data = [1, 2, 2, 3, 3, 4, 4, 5, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 97, 98, 99, 100]

Find the median with NumPy’s np.median() like this:

median_value = np.median(data)
print(f"Mean: {median_value:.2f}")

The result will be:

Median: 21.17

When the data is skewed or has significant outliers, the median is far better suited to represent the series’ center than the mean.

Calculating the Mode

The mode of a data series is the value within which occurs most frequently. When multiple values occur with the same number of occurrences, this is referred to as multimodal. Particularly applicable to categorical data, the mode reminds us of the most popular category.

There is no direct NumPy mode function, but there are numerous other Python modules that are available to calculate the mode directly, including the Python statistics module and Scipy’s stats module. However, since this is a NumPy tutorial, let’s go ahead and work out a purely NumPy solution ourselves.

Here is a function that we can implement and call upon ourselves to find the mean of a data series, which relies on the NumPy unique function and its returned values of element count and index, and its argmax function to determine which value has the highest count:

def numpy_mode(data):

    # Index and counts of all elements in the array
    (sorted_data, idx, counts) = np.unique(data, return_index=True, return_counts=True)

    # Index of element with highest count (i.e. the mode)
    index = idx[np.argmax(counts)]

    # Return the element with the highest count
    return data[index]

The mode calculation can then be performed:

data = [1, 2, 2, 3, 3, 4, 4, 5, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 97, 98, 99, 100]
mode_value = numpy_mode(data)
print(f"Mode: {mode_value}")

Resulting in the following:

Mode: 5

To identify the most prevalent value in a dataset, particularly with categorical or discrete data, the mode can’t be beat.

If you can master these basic statistics with NumPy, more advance techniques won’t seem so far out of reach.

Leave a Reply

Your email address will not be published. Required fields are marked *