If you’re into data analysis, you know that statistics provides essential tools to summarize and understand data. You can use Python libraries like NumPy and SciPy for statistical analysis. But for basic analysis, you can also use Python’s built-in statistics module which offers a variety of functions.

In this tutorial, we’ll go over ten super important statistical functions—from descriptive statistics to linear regression—by coding simple examples. We’ll use Python’s statistics module, introduced in Python 3.4 and improved in subsequent versions. So you only need a recent version of Python, preferably Python 3.11 or later, to follow along; you don’t have to install any external libraries.

You can also follow along with the Google Colab notebook. Let’s get started!

## 1. Mean

Mean is the average value of a set of observations and is a common measure of central tendency. Because the mean is the sum of all observations divided by the total number of observations it is sensitive to outliers.

Mathematically, the mean is given by:

Where *x _{i}* are the values in the dataset, and

*n*is the number of values.

First, import the statistics module so that we can use the different functions:

import statistics

To compute the mean, you can use the **mean()** function from the statistics module:

data = [10, 20, 30, 40, 50] mean = statistics.mean(data) print("Mean:", mean)

This outputs:

Mean: 30

If you want a floating point result always and if you’d like to compute the weighted mean, you can use the fmean() function instead.

## 2. Median

Median of a set of observations is the middle value when the data points are arranged in ascending or descending order. Unlike the mean, the median is generally less affected by outliers.

- If the number of data points is even, the median is the average of the two middle values.
- If the number of data points is odd, the median is the middle value.

To compute the median, you can use the **median()** function like so:

data = [15, 20, 35, 40, 50] median = statistics.median(data) print("Median:", median)

This outputs:

Median: 35

Here there are an odd number of data points in **data** and 35 is the median value.

## 3. Mode

Mode is the most frequently occurring value in a set of observations. You can find modes for both numerical and nominal data. Sometimes, the data can have more than one mode.

Here’s an example of finding the mode using the **mode()** function:

data = [1, 2, 2, 3, 4, 4, 4] mode = statistics.mode(data) print("Mode:", mode)

The mode of **data** is:

Mode: 4

When use the **mode()** function on data with more than one mode, it returns only the mode value that it found first:

data = [1, 2, 2, 2, 3, 4, 4, 4, 7, 7, 7] mode = statistics.mode(data) print("Modes:", mode)

Which outputs:

Modes: 2

If you want to get all the modes, use the **multimode()** function like so:

data = [1, 2, 2, 2, 3, 4, 4, 4, 7, 7, 7] modes = statistics.multimode(data) print("Modes:", modes)

This returns all the modes:

Modes: [2, 4, 7]

## 4. Standard Deviation

Standard deviation is a measure of the dispersion or variability in a dataset. It helps understand the deviation of the observations from the mean.

The sample standard deviation is given by:

Where *x _{i}* are the observations,

*n*is the number of values, and μ is the sample mean of the dataset.

The **stdev()** function computes the sample variance:

data = [12, 15, 22, 29, 35] std_dev = statistics.stdev(data) print(f"Standard Deviation: {std_dev:.3f}")

This outputs:

Standard Deviation: 9.555

To find the population standard deviation, you can use the pstdev() function.

## 5. Variance

Variance, the square of the standard deviation, measures the spread of the dataset. The sample variance of the dataset is given by:

Where *x _{i}* are the observations,

*n*is the number of values, and μ is the sample mean of the dataset.

The **variance()** function computes the sample variance:

data = [8, 10, 12, 14, 16] variance = statistics.variance(data) print(f"Variance: {variance:.2f}")

Here’s the output:

Variance: 10.00

As with standard deviation, you can compute the population variance using the pvariance() function.

## 6. Covariance

Covariance is used to capture the *joint variability* of two datasets. It helps capture how the change in one variable affects the other.

You can calculate the covariance using the equation:

Where:

*x*and_{i}*y*are the data points of variables_{i}*X*and*Y**̄x*and*̄y*are the means of variables*X*and*Y**n*is the number of observations

Here’s an example of calculating the covariance between **data1** and **data2**:

data1 = [2, 4, 6, 8, 10] data2 = [1, 3, 5, 7, 9] covariance = statistics.covariance(data1, data2) print("Covariance:", covariance)

You should get a similar output:

Covariance: 10.0

## 7. Quantiles

Quantiles in a dataset divide the data into continuous equal-sized intervals. They are a statistical tool used to understand the distribution and spread of a data set.

Quantiles are used in box plots and to identify outliers. Common types of quantiles include quartiles, deciles, and percentiles.

**Quartiles** divide the data into four equal parts. The three quartiles are:

- Q1 – first quartile or the 25th percentile, below which 25% of the data falls.
- Q2 – second quartile or the 50th percentile, below which 50% of the data falls.
- Q3 – third quartile or the 75th percentile, below which 75% of the data falls.

**Deciles** divide the data into ten equal parts. The nine deciles are the 10th, 20th, …, 90th percentiles.

**Percentiles** divide the data into 100 equal parts. The 99 percentiles are the 1st, 2nd, …, 99th percentiles.

Here’s an example:

data = [1, 5, 7, 9, 10, 12, 16, 18, 19, 21] # Quartiles quantiles = statistics.quantiles(data, n=4) print("Quantiles (Quartiles):", quantiles)

And the output:

Quantiles (Quartiles): [6.5, 11.0, 18.25]

## 8. Correlation

You can use correlation to measure the strength and direction of linear relationship between any two variables or sets of observations.

The **correlation()** function in the statistics module can calculate:

- Pearson’s correlation coefficient (a number between -1 and 1) that measures both the strength and direction of linear relationship—such as strong positive or negative or no linear relationship—between variables by setting the
*method*argument to the default value of ‘linear’. - Spearman’s rank correlation coefficient that measures the strength of monotonic relationships if you set the
*method*argument to ‘ranked’.

Let’s take an example:

data1 = [1, 2, 3, 4, 5] data2 = [2, 4, 6, 8, 10] correlation = statistics.correlation(data1, data2) print("Correlation:", correlation)

In this case, **data2** is **data1** scaled by a factor of 2. So the correlation coefficient evaluates to 1:

Correlation: 1.0

## 9. Linear Regression

Linear regression fits a straight line to model the linear relationship between a dependent variable and an independent variable. It finds the best fit line—the slope and the intercept—through ordinary least squares.

The simple linear regression equation is:

Here, *m* is the slope and *b* is the y-intercept.

To find the best fit straight line for the given independent variable x and the dependent variable y, you can use the **linear_regression()** function like so:

x = [1, 2, 3, 4, 5] y = [3, 4, 2, 5, 7] slope, intercept = statistics.linear_regression(x, y) print("Slope:", slope) print("Intercept:", intercept)

For this example, the slope and the intercept are as follows:

Slope: 0.9 Intercept: 1.5

## 10. Normal Distribution

Normal distribution is the probability distribution you’ll run into when working with real-world data. The density function is given by:

Where μ is the mean and σ is the standard deviation.

The statistics module provides a **NormalDist** class to generate normal distributions and work with them. The following snippet shows how to instantiate a normal distribution, calculate the CDF, and the z-score:

# Create a normal distribution with mean 30 and standard deviation 10 normal_dist = statistics.NormalDist(mu=30, sigma=10) # Calculate the probability of a value less than or equal to 20 probability = normal_dist.cdf(20) print(f"Probability (CDF) of 20: {probability:.3f}") # Calculate the z-score for a value z_score = normal_dist.inv_cdf(0.975) print(f"Z-score for 97.5th percentile: {z_score:.3f}")

Here’s the output:

Probability (CDF) of 60: 0.159 Z-score for 97.5th percentile: 49.600

## Summary

That’s all for this tutorial. Here’s a quick review:

Statistical Function |
In Python |
Significance |

Mean | statistics.mean | Calculates the average of the given data |

Median | statistics.median | Finds the middle value in the sorted data |

Mode | statistics.mode | Returns the most frequently occurring value in the data |

Standard Deviation | statistics.stdev | Calculates the sample standard deviation of the data |

Variance | statistics.variance | Measures the spread of the data points around the mean, calculates the sample variance |

Covariance | statistics.covariance | Calculates the covariance between two datasets |

Quantiles | statistics.quantiles | Divides the data into equal-sized intervals |

Correlation | statistics.correlation | Measures the strength and direction of the linear or monotonic relationship between two variables |

Linear Regression | statistics.linear_regression | Performs simple linear regression on two sets of data—observations of the independent variable and dependent variable. |

Normal Distribution | statistics.NormalDist class | The NormalDist class provides methods for working with normal distributions, including calculating probabilities and z-scores |

If you’re looking to learn statistics and perform statistical analysis with programming languages like Python and R, check out 7 Best YouTube Channels to Learn Statistics for Free.