A pivot table is a type of table that summarizes a dataset using summary statistics.
We can create pivot tables in Python by using the pivot_table function from the pandas package, which uses the following syntax:
pivot_table(data, index=None, values=None, aggfunc=’mean’, fill_value=None, …)
where:
- data: name of DataFrame
- index: columns of DataFrame to group by
- values: columns of DataFrame to aggregate
- aggfunc: aggregate function to use. The default is mean.
- fill_value: value to replace missing values with.
Example: Pivot Tables in Python
Suppose we have the following dataset that contains information about eight basketball players:
import pandas as pd data = {'team': ['E', 'E', 'C', 'C', 'C', 'W', 'W', 'W'], 'player': ['Andy', 'Ben', 'Chris', 'Dwight', 'Elias', 'Frank', 'Greg', 'Hank'], 'rebounds': [12, 14, 13, 7, 8, 8, 9, 13], 'points': [22, 24, 26, 26, 29, 32, 20, 14] } df = pd.DataFrame(data,columns=['team', 'player', 'rebounds','points']) df team player rebounds points 0 E Andy 12 22 1 E Ben 14 24 2 C Chris 13 26 3 C Dwight 7 26 4 C Elias 8 29 5 W Frank 8 32 6 W Greg 9 20 7 W Hank 13 14
To find the mean number of points and rebounds scored by players on each team, we can use the following syntax:
pd.pivot_table(df,index=['team'])
points rebounds
team
C 27 9.333333
E 23 13.000000
W 22 10.000000
To only find the mean number of points scored by players on each team, we can use the following syntax:
pd.pivot_table(df,index=['team'], values=['points'])
points
team
C 27
E 23
W 22
To find the total number of points scored by players on each team, we can use the following syntax:
pd.pivot_table(df,index=['team'], values=['points'], aggfunc='sum')
points
team
C 81
E 46
W 66
To find the max number of points scored by players on each team, we can use the following syntax:
pd.pivot_table(df,index=['team'], values=['points'], aggfunc='max')
points
team
C 29
E 34
W 32
To find the total number of individual players on each team, we can use the following syntax:
pd.pivot_table(df,index=['team'], values=['player'], aggfunc='count')
player
team
C 3
E 2
W 3
Dealing with Missing Values
Suppose you have missing values in your dataset:
import pandas as pd import numpy as np data = {'team': ['E', 'E', 'C', 'C', 'C', 'W', 'W', 'W'], 'player': ['Andy', 'Ben', 'Chris', 'Dwight', 'Elias', 'Frank', 'Greg', 'Hank'], 'rebounds': [12, 14, np.NaN, 7, 8, np.NaN, 9, 13], 'points': [22, 24, 26, 26, 29, 32, np.NaN, 14] } df = pd.DataFrame(data,columns=['team', 'player', 'rebounds','points']) df team player rebounds points 0 E Andy 12.0 22.0 1 E Ben 14.0 24.0 2 C Chris NaN 26.0 3 C Dwight 7.0 26.0 4 C Elias 8.0 29.0 5 W Frank NaN 32.0 6 W Greg 9.0 NaN 7 W Hank 13.0 14.0
To calculate summary statistics, you can use numpy functions which automatically disregard missing values. For example, here’s how to calculate the mean number of points scored by players on each team:
pd.pivot_table(df,index=['team'], values=['points'], aggfunc=[np.mean])
mean
points
team
C 27.0
E 23.0
W 23.0
Filtering Pivot Tables
Once you’ve created a pivot table, you can filter it.
For example, suppose we only want to view the mean statistics for players who are on team W:
pivotTable = pd.pivot_table(df,index=['team'])
pivotTable.query('team == ["W"]')
points rebounds
team
W 23.0 11.0
Or suppose we only want to view the mean statistics for players on teams E or W:
pivotTable.query('team == ["E", "W"]')
points rebounds
team
E 23.0 13.0
W 23.0 11.0
You can find the complete documentation for the pivot_table function here.