You can use the following methods to calculate summary statistics for variables in a pandas DataFrame:
Method 1: Calculate Summary Statistics for All Numeric Variables
df.describe()
Method 2: Calculate Summary Statistics for All String Variables
df.describe(include='object')
Method 3: Calculate Summary Statistics Grouped by a Variable
df.groupby('group_column').mean() df.groupby('group_column').median() df.groupby('group_column').max() ...
The following examples show how to use each method in practice with the following pandas DataFrame:
import pandas as pd import numpy as np #create DataFrame df = pd.DataFrame({'team': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B'], 'points': [18, 22, 19, 14, 14, 11, 20, 28, 30], 'assists': [5, np.nan, 7, 9, 12, 9, 9, 4, 5], 'rebounds': [11, 8, 10, 6, 6, 5, 9, np.nan, 6]}) #view DataFrame print(df) team points assists rebounds 0 A 18 5.0 11.0 1 A 22 NaN 8.0 2 A 19 7.0 10.0 3 A 14 9.0 6.0 4 B 14 12.0 6.0 5 B 11 9.0 5.0 6 B 20 9.0 9.0 7 B 28 4.0 NaN 8 B 30 5.0 6.0
Example 1: Calculate Summary Statistics for All Numeric Variables
The following code shows how to calculate the summary statistics for each numeric variable in the DataFrame:
df.describe()
points assists rebounds
count 9.000000 8.000000 8.000000
mean 19.555556 7.500000 7.625000
std 6.366143 2.725541 2.199838
min 11.000000 4.000000 5.000000
25% 14.000000 5.000000 6.000000
50% 19.000000 8.000000 7.000000
75% 22.000000 9.000000 9.250000
max 30.000000 12.000000 11.000000
We can see the following summary statistics for each of the three numeric variables:
- count: The count of non-null values
- mean: The mean value
- std: The standard deviation
- min: The minimum value
- 25%: The value at the 25th percentile
- 50%: The value at the 50th percentile (also the median)
- 75%: The value at the 75th percentile
- max: The maximum value
Example 2: Calculate Summary Statistics for All String Variables
The following code shows how to calculate the summary statistics for each string variable in the DataFrame:
df.describe(include='object') team count 9 unique 2 top B freq 5
We can see the following summary statistics for the one string variable in our DataFrame:
- count: The count of non-null values
- unique: The number of unique values
- top: The most frequently occurring value
- freq: The count of the most frequently occurring value
Example 3: Calculate Summary Statistics Grouped by a Variable
The following code shows how to calculate the mean value for all numeric variables, grouped by the team variable:
df.groupby('team').mean() points assists rebounds team A 18.25 7.0 8.75 B 20.60 7.8 6.50
The output displays the mean value for the points, assists, and rebounds variables, grouped by the team variable.
Note that we can use similar syntax to calculate a different summary statistic, such as the median:
df.groupby('team').median() points assists rebounds team A 18.5 7.0 9.0 B 20.0 9.0 6.0
The output displays the median value for the points, assists, and rebounds variables, grouped by the team variable.
Note: You can find the complete documentation for the describe function in pandas here.
Additional Resources
The following tutorials explain how to perform other common tasks in pandas:
How to Count Observations by Group in Pandas
How to Find the Max Value by Group in Pandas
How to Identify Outliers in Pandas