You can use the following syntax to calculate the sum by group in a PySpark DataFrame:

df.groupBy('team').sum('points').show()

This particular example calculates the sum of the values in the **points** column, grouped by the values in the **team** column of the DataFrame.

The following example shows how to use this syntax in practice.

**Example: How to Calculate Sum by Group in PySpark**

Suppose we have the following PySpark DataFrame that contains information about the points scored by basketball players on various teams:

**from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [['A', 11],
['A', 8],
['A', 22],
['B', 22],
['B', 14],
['B', 14],
['C', 13],
['C', 7],
['C', 15]]
#define column names
columns = ['team', 'points']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+----+------+
|team|points|
+----+------+
| A| 11|
| A| 8|
| A| 22|
| B| 22|
| B| 14|
| B| 14|
| C| 13|
| C| 7|
| C| 15|
+----+------+
**

We can use the following syntax to calculate the sum of the values in the **points** column, grouped by the values in the **team** column:

#calculate sum of points, grouped by team df.groupBy('team').sum('points').show() +----+-----------+ |team|sum(points)| +----+-----------+ | A| 41| | B| 50| | C| 35| +----+-----------+

The resulting DataFrame shows the sum of the points values for each team.

For example, we can see:

- The sum of values for all players on team A was
**41**. - The sum of values for all players on team B was
**50**. - The sum of values for all players on team C was
**35**.

**Note**: If there are null values in the points column, the **sum** function will ignore these values by default.

If you would like to give the **sum(points)** column a different name, you can use the **alias** function as follows:

from pyspark.sql.functions import sum #calculate sum of points, grouped by team df.groupBy('team').agg(sum('points').alias('points_sum')).show() +----+----------+ |team|points_sum| +----+----------+ | A| 41| | B| 50| | C| 35| +----+----------+

The resulting DataFrame shows the sum of points scored by each team and the sum column now uses the name **points_sum**, just as we specified in the **alias** function.

**Additional Resources**

The following tutorials explain how to perform other common tasks in PySpark:

How to Sum Multiple Columns in PySpark DataFrame

How to Add Multiple Columns to PySpark DataFrame

How to Add New Rows to PySpark DataFrame