How to Calculate Sample & Population Variance in Python


The variance is a way to measure the spread of values in a dataset.

The formula to calculate population variance is:

σ2 = Σ (xi – μ)2 / N

where:

  • Σ: A symbol that means “sum”
  • μ: Population mean
  • xi: The ith element from the population
  • N: Population size

The formula to calculate sample variance is:

s2 = Σ (xix)2 / (n-1)

where:

  • x: Sample mean
  • xi: The ith element from the sample
  • n: Sample size

We can use the variance and pvariance functions from the statistics library in Python to quickly calculate the sample variance and population variance (respectively) for a given array.

from statistics import variance, pvariance

#calculate sample variance
variance(x)

#calculate population variance
pvariance(x)

The following examples show how to use each function in practice.

Example 1: Calculating Sample Variance in Python

The following code shows how to calculate the sample variance of an array in Python:

from statistics import variance 

#define data
data = [4, 8, 12, 15, 9, 6, 14, 18, 12, 9, 16, 17, 17, 20, 14]

#calculate sample variance
variance(data)

22.067

The sample variance turns out to be 22.067.

Example 2: Calculating Population Variance in Python

The following code shows how to calculate the population variance of an array in Python:

from statistics import pvariance 

#define data
data = [4, 8, 12, 15, 9, 6, 14, 18, 12, 9, 16, 17, 17, 20, 14]

#calculate sample variance
pvariance(data)

20.596

The population variance turns out to be 20.596.

Notes on Calculating Sample & Population Variance

Keep in mind the following when calculating the sample and population variance:

  • You should calculate the population variance when the dataset you’re working with represents an entire population, i.e. every value that you’re interested in.
  • You should calculate the sample variance when the dataset you’re working with represents a a sample taken from a larger population of interest.
  • The sample variance of a given array of data will always be larger than the population variance for the same array of a data because there is more uncertainty when calculating the sample variance, thus our estimate of the variance will be larger.

Additional Resources

The following tutorials explain how to calculate other measures of spread in Python:

How to Calculate The Interquartile Range in Python
How to Calculate the Coefficient of Variation in Python
How to Calculate the Standard Deviation of a List in Python

Leave a Reply

Your email address will not be published.