You can use the following syntax to group by multiple columns and perform an aggregation in a PySpark DataFrame:

df.groupBy('team', 'position').sum('points').show()

This particular example calculates the sum of the values in the **points** column, grouped by the values in the **team** and **position** columns of the DataFrame.

By including multiple column names in the **groupBy** function, you can group by multiple columns before performing some aggregation.

The following example shows how to use this syntax in practice.

**Example: How to Use groupBy On Multiple Columns in PySpark**

Suppose we have the following PySpark DataFrame that contains information about the points scored by various basketball players:

**from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [['A', 'Guard', 11],
['A', 'Guard', 8],
['A', 'Forward', 22],
['A', 'Forward', 22],
['B', 'Guard', 14],
['B', 'Guard', 14],
['B', 'Forward', 13],
['B', 'Forward', 7]]
#define column names
columns = ['team', 'position', 'points']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+----+--------+------+
|team|position|points|
+----+--------+------+
| A| Guard| 11|
| A| Guard| 8|
| A| Forward| 22|
| A| Forward| 22|
| B| Guard| 14|
| B| Guard| 14|
| B| Forward| 13|
| B| Forward| 7|
+----+--------+------+**

We can use the following syntax to calculate the sum of the values in the **points** column, grouped by the values in both the **team** and **position** columns:

#calculate sum of points, grouped by team and position df.groupBy('team', 'position').sum('points').show() +----+--------+-----------+ |team|position|sum(points)| +----+--------+-----------+ | A| Guard| 19| | A| Forward| 44| | B| Guard| 28| | B| Forward| 20| +----+--------+-----------+

The resulting DataFrame shows the sum of the points values, grouped by team and position.

For example, we can see:

- The sum of points for all guards on team A is
**19**. - The sum of points for all forwards on team A is
**44**. - The sum of points for all guards on team B is
**28**. - The sum of points for all forwards on team B is
**20**.

If you would like to give the **sum(points)** column a different name, you can use the **alias** function as follows:

from pyspark.sql.functions import sum #calculate sum of points, grouped by team and position df.groupBy('team', 'position').agg(sum('points').alias('points_sum')).show() +----+--------+----------+ |team|position|points_sum| +----+--------+----------+ | A| Guard| 19| | A| Forward| 44| | B| Guard| 28| | B| Forward| 20| +----+--------+----------+

The resulting DataFrame shows the sum of points scored by each team and position with the sum column now named **points_sum**, just as we specified in the **alias** function.

**Note #1:** In this example we used **sum** as the aggregation metric, but you could use a different metric such as **count**, **mean**, **max**, etc.

**Note #2**: You can find the complete documentation for the PySpark **groupBy** function here.

**Additional Resources**

The following tutorials explain how to perform other common tasks in PySpark:

How to Sum Multiple Columns in PySpark DataFrame

How to Add Multiple Columns to PySpark DataFrame

How to Add New Rows to PySpark DataFrame