You can use the following syntax to group by and perform aggregations on multiple columns in a PySpark DataFrame:

from pyspark.sql.functions import * #group by team column and aggregate using multiple columns df.groupBy(df.team.alias('team')).agg(sum('points').alias('sum_pts'), mean('points').alias('mean_pts'), count('assists').alias('count_ast')).show()

This particular example groups the rows of the DataFrame by the team column, then performs the following aggregations:

- Calculates the sum of the
**points**column and uses the name**sum_pts** - Calculates the mean of the
**points**column and uses the name**mean_pts** - Calculates the count of the
**assists**column and uses the name**count_ast**

The following example shows how to use this syntax in practice.

**Example: How to Use Groupby Agg On Multiple Columns in PySpark**

Suppose we have the following PySpark DataFrame that contains information about various basketball players:

**from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [['A', 'Guard', 11, 5],
['A', 'Guard', 8, 4],
['A', 'Forward', 22, 3],
['A', 'Forward', 22, 6],
['B', 'Guard', 14, 3],
['B', 'Guard', 14, 5],
['B', 'Forward', 13, 7],
['B', 'Forward', 14, 8],
['C', 'Forward', 23, 2],
['C', 'Guard', 30, 5]]
#define column names
columns = ['team', 'position', 'points', 'assists']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+----+--------+------+-------+
|team|position|points|assists|
+----+--------+------+-------+
| A| Guard| 11| 5|
| A| Guard| 8| 4|
| A| Forward| 22| 3|
| A| Forward| 22| 6|
| B| Guard| 14| 3|
| B| Guard| 14| 5|
| B| Forward| 13| 7|
| B| Forward| 14| 8|
| C| Forward| 23| 2|
| C| Guard| 30| 5|
+----+--------+------+-------+
**

We can use the following syntax to group the rows by the values in the **team** column and then calculate several aggregate metrics:

from pyspark.sql.functions import * #group by team column and aggregate using multiple columns df.groupBy(df.team.alias('team')).agg(sum('points').alias('sum_pts'), mean('points').alias('mean_pts'), count('assists').alias('count_ast')).show() +----+-------+--------+---------+ |team|sum_pts|mean_pts|count_ast| +----+-------+--------+---------+ | A| 63| 15.75| 4| | B| 55| 13.75| 4| | C| 53| 26.5| 2| +----+-------+--------+---------+

The resulting DataFrame shows the sum of the points values, the mean of the points values, and the count of assists values for each team.

For example, we can see:

- The sum of points for team A is
**63**. - The mean of points for team A is
**15.75**. - The count of assists for team A is
**4**.

And so on.

**Additional Resources**

The following tutorials explain how to perform other common tasks in PySpark:

How to Sum Multiple Columns in PySpark DataFrame

How to Add Multiple Columns to PySpark DataFrame

How to Add New Rows to PySpark DataFrame