You can use the following methods to calculate the mean value by group in a PySpark DataFrame:

**Method 1: Calculate Mean Grouped by One Column**

#calculate mean of 'points' grouped by 'team' df.groupBy('team').mean('points').show()

**Method 2: Calculate Mean Grouped by Multiple Columns**

#calculate mean of 'points' grouped by 'team' and 'position' df.groupBy('team', 'position').mean('points').show()

The following examples show how to use each method in practice with the following PySpark DataFrame that contains information about various basketball players:

**from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [['A', 'Guard', 11],
['A', 'Guard', 8],
['A', 'Forward', 22],
['A', 'Forward', 22],
['B', 'Guard', 14],
['B', 'Guard', 14],
['B', 'Guard', 13],
['B', 'Forward', 7],
['C', 'Guard', 8],
['C', 'Forward', 5]]
#define column names
columns = ['team', 'position', 'points']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+----+--------+------+
|team|position|points|
+----+--------+------+
| A| Guard| 11|
| A| Guard| 8|
| A| Forward| 22|
| A| Forward| 22|
| B| Guard| 14|
| B| Guard| 14|
| B| Guard| 13|
| B| Forward| 7|
| C| Guard| 8|
| C| Forward| 5|
+----+--------+------+
**

**Example 1: Calculate Mean Grouped by One Column**

We can use the following syntax to calculate the mean value in the **points** column grouped by the values in the **team** column:

#calculate mean of 'points' grouped by 'team' df.groupBy('team').mean('points').show() +----+-----------+ |team|avg(points)| +----+-----------+ | A| 15.75| | B| 12.0| | C| 6.5| +----+-----------+

From the output we can see:

- The average points value for players on team A is
**15.75**. - The average points value for players on team B is
**12**. - The average points value for players on team C is
**6.5**.

**Example 2: Calculate Mean Grouped by Multiple Columns**

We can use the following syntax to calculate the mean value in the **points** column grouped by the values in the **team** and **position** columns:

#calculate mean of 'points' grouped by 'team' and 'position' df.groupBy('team', 'position').mean('points').show() +----+--------+------------------+ |team|position| avg(points)| +----+--------+------------------+ | A| Guard| 9.5| | A| Forward| 22.0| | B| Guard|13.666666666666666| | B| Forward| 7.0| | C| Forward| 5.0| | C| Guard| 8.0| +----+--------+------------------+

From the output we can see:

- The average points value for Guards on team A is
**9.5**. - The average points value for Forwards on team A is
**22**. - The average points value for Guards on team B is
**13.67**.

And so on.

**Additional Resources**

The following tutorials explain how to perform other common tasks in PySpark:

How to Calculate the Mean of a Column in PySpark

How to Calculate Mean of Multiple Columns in PySpark

How to Calculate Sum by Group in PySpark