You can use the following syntax to group by and perform aggregations on multiple columns in a PySpark DataFrame:
from pyspark.sql.functions import *
#group by team column and aggregate using multiple columns
df.groupBy(df.team.alias('team')).agg(sum('points').alias('sum_pts'),
mean('points').alias('mean_pts'),
count('assists').alias('count_ast')).show()
This particular example groups the rows of the DataFrame by the team column, then performs the following aggregations:
- Calculates the sum of the points column and uses the name sum_pts
- Calculates the mean of the points column and uses the name mean_pts
- Calculates the count of the assists column and uses the name count_ast
The following example shows how to use this syntax in practice.
Example: How to Use Groupby Agg On Multiple Columns in PySpark
Suppose we have the following PySpark DataFrame that contains information about various basketball players:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [['A', 'Guard', 11, 5],
['A', 'Guard', 8, 4],
['A', 'Forward', 22, 3],
['A', 'Forward', 22, 6],
['B', 'Guard', 14, 3],
['B', 'Guard', 14, 5],
['B', 'Forward', 13, 7],
['B', 'Forward', 14, 8],
['C', 'Forward', 23, 2],
['C', 'Guard', 30, 5]]
#define column names
columns = ['team', 'position', 'points', 'assists']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+----+--------+------+-------+
|team|position|points|assists|
+----+--------+------+-------+
| A| Guard| 11| 5|
| A| Guard| 8| 4|
| A| Forward| 22| 3|
| A| Forward| 22| 6|
| B| Guard| 14| 3|
| B| Guard| 14| 5|
| B| Forward| 13| 7|
| B| Forward| 14| 8|
| C| Forward| 23| 2|
| C| Guard| 30| 5|
+----+--------+------+-------+
We can use the following syntax to group the rows by the values in the team column and then calculate several aggregate metrics:
from pyspark.sql.functions import *
#group by team column and aggregate using multiple columns
df.groupBy(df.team.alias('team')).agg(sum('points').alias('sum_pts'),
mean('points').alias('mean_pts'),
count('assists').alias('count_ast')).show()
+----+-------+--------+---------+
|team|sum_pts|mean_pts|count_ast|
+----+-------+--------+---------+
| A| 63| 15.75| 4|
| B| 55| 13.75| 4|
| C| 53| 26.5| 2|
+----+-------+--------+---------+
The resulting DataFrame shows the sum of the points values, the mean of the points values, and the count of assists values for each team.
For example, we can see:
- The sum of points for team A is 63.
- The mean of points for team A is 15.75.
- The count of assists for team A is 4.
And so on.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
How to Sum Multiple Columns in PySpark DataFrame
How to Add Multiple Columns to PySpark DataFrame
How to Add New Rows to PySpark DataFrame