How to Calculate Summary Statistics for a Pandas DataFrame


You can use the following methods to calculate summary statistics for variables in a pandas DataFrame:

Method 1: Calculate Summary Statistics for All Numeric Variables

df.describe()

Method 2: Calculate Summary Statistics for All String Variables

df.describe(include='object')

Method 3: Calculate Summary Statistics Grouped by a Variable

df.groupby('group_column').mean()

df.groupby('group_column').median()

df.groupby('group_column').max()

...

The following examples show how to use each method in practice with the following pandas DataFrame:

import pandas as pd
import numpy as np

#create DataFrame
df = pd.DataFrame({'team': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B'],
                   'points': [18, 22, 19, 14, 14, 11, 20, 28, 30],
                   'assists': [5, np.nan, 7, 9, 12, 9, 9, 4, 5],
                   'rebounds': [11, 8, 10, 6, 6, 5, 9, np.nan, 6]})

#view DataFrame
print(df)

  team  points  assists  rebounds
0    A      18      5.0      11.0
1    A      22      NaN       8.0
2    A      19      7.0      10.0
3    A      14      9.0       6.0
4    B      14     12.0       6.0
5    B      11      9.0       5.0
6    B      20      9.0       9.0
7    B      28      4.0       NaN
8    B      30      5.0       6.0

Example 1: Calculate Summary Statistics for All Numeric Variables

The following code shows how to calculate the summary statistics for each numeric variable in the DataFrame:

df.describe()

	   points	 assists	rebounds
count	9.000000	8.000000	8.000000
mean	19.555556	7.500000	7.625000
std	6.366143	2.725541	2.199838
min	11.000000	4.000000	5.000000
25%	14.000000	5.000000	6.000000
50%	19.000000	8.000000	7.000000
75%	22.000000	9.000000	9.250000
max	30.000000	12.000000	11.000000

We can see the following summary statistics for each of the three numeric variables:

  • count: The count of non-null values
  • mean: The mean value
  • std: The standard deviation
  • min: The minimum value
  • 25%: The value at the 25th percentile
  • 50%: The value at the 50th percentile (also the median)
  • 75%: The value at the 75th percentile
  • max: The maximum value

Example 2: Calculate Summary Statistics for All String Variables

The following code shows how to calculate the summary statistics for each string variable in the DataFrame:

df.describe(include='object')

	team
count	   9
unique	   2
top	   B
freq	   5

We can see the following summary statistics for the one string variable in our DataFrame:

  • count: The count of non-null values
  • unique: The number of unique values
  • top: The most frequently occurring value
  • freq: The count of the most frequently occurring value

Example 3: Calculate Summary Statistics Grouped by a Variable

The following code shows how to calculate the mean value for all numeric variables, grouped by the team variable:

df.groupby('team').mean()

	points	assists	rebounds
team			
A	18.25	7.0	8.75
B	20.60	7.8	6.50

The output displays the mean value for the points, assists, and rebounds variables, grouped by the team variable.

Note that we can use similar syntax to calculate a different summary statistic, such as the median:

df.groupby('team').median()

	points	assists	rebounds
team			
A	18.5	7.0	9.0
B	20.0	9.0	6.0

The output displays the median value for the points, assists, and rebounds variables, grouped by the team variable.

Note: You can find the complete documentation for the describe function in pandas here.

Additional Resources

The following tutorials explain how to perform other common tasks in pandas:

How to Count Observations by Group in Pandas
How to Find the Max Value by Group in Pandas
How to Identify Outliers in Pandas

Leave a Reply

Your email address will not be published.