How to Calculate Sum by Group in PySpark


You can use the following syntax to calculate the sum by group in a PySpark DataFrame:

df.groupBy('team').sum('points').show()

This particular example calculates the sum of the values in the points column, grouped by the values in the team column of the DataFrame.

The following example shows how to use this syntax in practice.

Example: How to Calculate Sum by Group in PySpark

Suppose we have the following PySpark DataFrame that contains information about the points scored by basketball players on various teams:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 11], 
        ['A', 8], 
        ['A', 22], 
        ['B', 22], 
        ['B', 14], 
        ['B', 14],
        ['C', 13],
        ['C', 7],
        ['C', 15]] 
  
#define column names
columns = ['team', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+------+
|team|points|
+----+------+
|   A|    11|
|   A|     8|
|   A|    22|
|   B|    22|
|   B|    14|
|   B|    14|
|   C|    13|
|   C|     7|
|   C|    15|
+----+------+

We can use the following syntax to calculate the sum of the values in the points column, grouped by the values in the team column:

#calculate sum of points, grouped by team
df.groupBy('team').sum('points').show()

+----+-----------+
|team|sum(points)|
+----+-----------+
|   A|         41|
|   B|         50|
|   C|         35|
+----+-----------+

The resulting DataFrame shows the sum of the points values for each team.

For example, we can see:

  • The sum of values for all players on team A was 41.
  • The sum of values for all players on team B was 50.
  • The sum of values for all players on team C was 35.

Note: If there are null values in the points column, the sum function will ignore these values by default.

If you would like to give the sum(points) column a different name, you can use the alias function as follows:

from pyspark.sql.functions import sum

#calculate sum of points, grouped by team
df.groupBy('team').agg(sum('points').alias('points_sum')).show()

+----+----------+
|team|points_sum|
+----+----------+
|   A|        41|
|   B|        50|
|   C|        35|
+----+----------+

The resulting DataFrame shows the sum of points scored by each team and the sum column now uses the name points_sum, just as we specified in the alias function.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

How to Sum Multiple Columns in PySpark DataFrame
How to Add Multiple Columns to PySpark DataFrame
How to Add New Rows to PySpark DataFrame

Featured Posts

Leave a Reply

Your email address will not be published. Required fields are marked *