PySpark: How to Use Groupby Agg on Multiple Columns


You can use the following syntax to group by and perform aggregations on multiple columns in a PySpark DataFrame:

from pyspark.sql.functions import *

#group by team column and aggregate using multiple columns
df.groupBy(df.team.alias('team')).agg(sum('points').alias('sum_pts'), 
                                      mean('points').alias('mean_pts'),
                                      count('assists').alias('count_ast')).show()

This particular example groups the rows of the DataFrame by the team column, then performs the following aggregations:

  • Calculates the sum of the points column and uses the name sum_pts
  • Calculates the mean of the points column and uses the name mean_pts
  • Calculates the count of the assists column and uses the name count_ast

The following example shows how to use this syntax in practice.

Example: How to Use Groupby Agg On Multiple Columns in PySpark

Suppose we have the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'Guard', 11, 5],
        ['A', 'Guard', 8, 4],
        ['A', 'Forward', 22, 3],
        ['A', 'Forward', 22, 6],
        ['B', 'Guard', 14, 3],
        ['B', 'Guard', 14, 5],
        ['B', 'Forward', 13, 7],
        ['B', 'Forward', 14, 8],
        ['C', 'Forward', 23, 2],
        ['C', 'Guard', 30, 5]]
  
#define column names
columns = ['team', 'position', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+--------+------+-------+
|team|position|points|assists|
+----+--------+------+-------+
|   A|   Guard|    11|      5|
|   A|   Guard|     8|      4|
|   A| Forward|    22|      3|
|   A| Forward|    22|      6|
|   B|   Guard|    14|      3|
|   B|   Guard|    14|      5|
|   B| Forward|    13|      7|
|   B| Forward|    14|      8|
|   C| Forward|    23|      2|
|   C|   Guard|    30|      5|
+----+--------+------+-------+

We can use the following syntax to group the rows by the values in the team column and then calculate several aggregate metrics:

from pyspark.sql.functions import *

#group by team column and aggregate using multiple columns
df.groupBy(df.team.alias('team')).agg(sum('points').alias('sum_pts'), 
                                      mean('points').alias('mean_pts'),
                                      count('assists').alias('count_ast')).show()

+----+-------+--------+---------+
|team|sum_pts|mean_pts|count_ast|
+----+-------+--------+---------+
|   A|     63|   15.75|        4|
|   B|     55|   13.75|        4|
|   C|     53|    26.5|        2|
+----+-------+--------+---------+

The resulting DataFrame shows the sum of the points values, the mean of the points values, and the count of assists values for each team.

For example, we can see:

  • The sum of points for team A is 63.
  • The mean of points for team A is 15.75.
  • The count of assists for team A is 4.

And so on.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

How to Sum Multiple Columns in PySpark DataFrame
How to Add Multiple Columns to PySpark DataFrame
How to Add New Rows to PySpark DataFrame

Leave a Reply

Your email address will not be published. Required fields are marked *