10 Essential Statistical Functions in Python

10 Essential Statistical Functions in Python

If you’re into data analysis, you know that statistics provides essential tools to summarize and understand data. You can use Python libraries like NumPy and SciPy for statistical analysis. But for basic analysis, you can also use Python’s built-in statistics module which offers a variety of functions.

In this tutorial, we’ll go over ten super important statistical functions—from descriptive statistics to linear regression—by coding simple examples. We’ll use Python’s statistics module, introduced in Python 3.4 and improved in subsequent versions. So you only need a recent version of Python, preferably Python 3.11 or later, to follow along; you don’t have to install any external libraries.

You can also follow along with the Google Colab notebook. Let’s get started!

1. Mean

Mean is the average value of a set of observations and is a common measure of central tendency. Because the mean is the sum of all observations divided by the total number of observations it is sensitive to outliers.

Mathematically, the mean is given by:

mean
Where xi are the values in the dataset, and n is the number of values.

First, import the statistics module so that we can use the different functions:

import statistics

To compute the mean, you can use the mean() function from the statistics module:

data = [10, 20, 30, 40, 50]
mean = statistics.mean(data)
print("Mean:", mean)

This outputs:

Mean: 30

If you want a floating point result always and if you’d like to compute the weighted mean, you can use the fmean() function instead.

2. Median

Median of a set of observations is the middle value when the data points are arranged in ascending or descending order. Unlike the mean, the median is generally less affected by outliers.

  • If the number of data points is even, the median is the average of the two middle values.
  • If the number of data points is odd, the median is the middle value.

To compute the median, you can use the median() function like so:

data = [15, 20, 35, 40, 50]
median = statistics.median(data)
print("Median:", median)

This outputs:

Median: 35

Here there are an odd number of data points in data and 35 is the median value.

3. Mode

Mode is the most frequently occurring value in a set of observations. You can find modes for both numerical and nominal data. Sometimes, the data can have more than one mode.

Here’s an example of finding the mode using the mode() function:

data = [1, 2, 2, 3, 4, 4, 4]
mode = statistics.mode(data)
print("Mode:", mode)

The mode of data is:

Mode: 4

When use the mode() function on data with more than one mode, it returns only the mode value that it found first:

data = [1, 2, 2, 2, 3, 4, 4, 4, 7, 7, 7]
mode = statistics.mode(data)
print("Modes:", mode)

Which outputs:

Modes: 2

If you want to get all the modes, use the multimode() function like so:

data = [1, 2, 2, 2, 3, 4, 4, 4, 7, 7, 7]
modes = statistics.multimode(data)
print("Modes:", modes)

This returns all the modes:

Modes: [2, 4, 7]

4. Standard Deviation

Standard deviation is a measure of the dispersion or variability in a dataset. It helps understand the deviation of the observations from the mean.

The sample standard deviation is given by:

stdev

Where xi are the observations, n is the number of values, and μ is the sample mean of the dataset.

The stdev() function computes the sample variance:

data = [12, 15, 22, 29, 35]
std_dev = statistics.stdev(data)
print(f"Standard Deviation: {std_dev:.3f}")

This outputs:

Standard Deviation: 9.555

To find the population standard deviation, you can use the pstdev() function.

5. Variance

Variance, the square of the standard deviation, measures the spread of the dataset. The sample variance of the dataset is given by:

var

Where xi are the observations, n is the number of values, and μ is the sample mean of the dataset.

The variance() function computes the sample variance:

data = [8, 10, 12, 14, 16]
variance = statistics.variance(data)
print(f"Variance: {variance:.2f}")

Here’s the output:

Variance: 10.00

As with standard deviation, you can compute the population variance using the pvariance() function.

6. Covariance

Covariance is used to capture the joint variability of two datasets. It helps capture how the change in one variable affects the other.

You can calculate the covariance using the equation:

cov

Where:

  • xi and yi are the data points of variables X and Y
  • ̄x and ̄y are the means of variables X and Y
  • n is the number of observations

Here’s an example of calculating the covariance between data1 and data2:

data1 = [2, 4, 6, 8, 10]
data2 = [1, 3, 5, 7, 9]
covariance = statistics.covariance(data1, data2)
print("Covariance:", covariance)

You should get a similar output:

Covariance: 10.0

7. Quantiles

Quantiles in a dataset divide the data into continuous equal-sized intervals. They are a statistical tool used to understand the distribution and spread of a data set.

Quantiles are used in box plots and to identify outliers. Common types of quantiles include quartiles, deciles, and percentiles.

Quartiles divide the data into four equal parts. The three quartiles are:

  • Q1 – first quartile or the 25th percentile, below which 25% of the data falls.
  • Q2 – second quartile or the 50th percentile, below which 50% of the data falls.
  • Q3 – third quartile or the 75th percentile, below which 75% of the data falls.

Deciles divide the data into ten equal parts. The nine deciles are the 10th, 20th, …, 90th percentiles.

Percentiles divide the data into 100 equal parts. The 99 percentiles are the 1st, 2nd, …, 99th percentiles.

Here’s an example:

data = [1, 5, 7, 9, 10, 12, 16, 18, 19, 21]
# Quartiles
quantiles = statistics.quantiles(data, n=4)  
print("Quantiles (Quartiles):", quantiles)

And the output:

Quantiles (Quartiles): [6.5, 11.0, 18.25]

8. Correlation

You can use correlation to measure the strength and direction of linear relationship between any two variables or sets of observations.

The correlation() function in the statistics module can calculate:

  • Pearson’s correlation coefficient (a number between -1 and 1) that measures both the strength and direction of linear relationship—such as strong positive or negative or no linear relationship—between variables by setting the method argument to the default value of ‘linear’.
  • Spearman’s rank correlation coefficient that measures the strength of monotonic relationships if you set the method argument to ‘ranked’.

Let’s take an example:

data1 = [1, 2, 3, 4, 5]
data2 = [2, 4, 6, 8, 10]
correlation = statistics.correlation(data1, data2)
print("Correlation:", correlation)

In this case, data2 is data1 scaled by a factor of 2. So the correlation coefficient evaluates to 1:

Correlation: 1.0

9. Linear Regression

Linear regression fits a straight line to model the linear relationship between a dependent variable and an independent variable. It finds the best fit line—the slope and the intercept—through ordinary least squares.

The simple linear regression equation is:

lin-reg

Here, m is the slope and b is the y-intercept.

To find the best fit straight line for the given independent variable x and the dependent variable y, you can use the linear_regression() function like so:

x = [1, 2, 3, 4, 5]
y = [3, 4, 2, 5, 7]
slope, intercept = statistics.linear_regression(x, y)
print("Slope:", slope)
print("Intercept:", intercept)

For this example, the slope and the intercept are as follows:

Slope: 0.9
Intercept: 1.5

10. Normal Distribution

Normal distribution is the probability distribution you’ll run into when working with real-world data. The density function is given by:

normaldist

Where μ is the mean and σ is the standard deviation.

The statistics module provides a NormalDist class to generate normal distributions and work with them. The following snippet shows how to instantiate a normal distribution, calculate the CDF, and the z-score:

# Create a normal distribution with mean 30 and standard deviation 10
normal_dist = statistics.NormalDist(mu=30, sigma=10)

# Calculate the probability of a value less than or equal to 20
probability = normal_dist.cdf(20)
print(f"Probability (CDF) of 20: {probability:.3f}")

# Calculate the z-score for a value
z_score = normal_dist.inv_cdf(0.975)
print(f"Z-score for 97.5th percentile: {z_score:.3f}")

Here’s the output:

Probability (CDF) of 60: 0.159
Z-score for 97.5th percentile: 49.600

Summary

That’s all for this tutorial. Here’s a quick review:

Statistical Function In Python Significance
Mean statistics.mean Calculates the average of the given data
Median statistics.median Finds the middle value in the sorted data
Mode statistics.mode Returns the most frequently occurring value in the data
Standard Deviation statistics.stdev Calculates the sample standard deviation of the data
Variance statistics.variance Measures the spread of the data points around the mean, calculates the sample variance
Covariance statistics.covariance Calculates the covariance between two datasets
Quantiles statistics.quantiles Divides the data into equal-sized intervals
Correlation statistics.correlation Measures the strength and direction of the linear or monotonic relationship between two variables
Linear Regression statistics.linear_regression Performs simple linear regression on two sets of data—observations of the independent variable and dependent variable.
Normal Distribution statistics.NormalDist class The NormalDist class provides methods for working with normal distributions, including calculating probabilities and z-scores 

If you’re looking to learn statistics and perform statistical analysis with programming languages like Python and R, check out 7 Best YouTube Channels to Learn Statistics for Free.

Leave a Reply

Your email address will not be published. Required fields are marked *