You can use the following syntax to calculate the mean value across multiple columns in a PySpark DataFrame:

from pyspark.sql import functions as F #define columns to calculate mean for mean_cols = ['game1','game2','game3'] #define function to calculate mean find_mean = F.expr('+'.join(mean_cols))/len(mean_cols) #calculate mean across specific columns df_new = df.withColumn('mean', find_mean)

This particular example creates a new column called **mean **that contains the mean of values across the **game1**, **game2** and **game3** columns in the DataFrame.

The following example shows how to use this syntax in practice.

**Example: How to Calculate Mean of Multiple Columns in PySpark**

Suppose we have the following PySpark DataFrame that contains information about points scored by various basketball players during three different games:

**from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [['Mavs', 25, 11, 10],
['Nets', 22, 8, 14],
['Hawks', 14, 22, 10],
['Kings', 30, 22, 35],
['Bulls', 15, 14, 12],
['Blazers', 10, 14, 18]]
#define column names
columns = ['team', 'game1', 'game2', 'game3']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+-------+-----+-----+-----+
| team|game1|game2|game3|
+-------+-----+-----+-----+
| Mavs| 25| 11| 10|
| Nets| 22| 8| 14|
| Hawks| 14| 22| 10|
| Kings| 30| 22| 35|
| Bulls| 15| 14| 12|
|Blazers| 10| 14| 18|
+-------+-----+-----+-----+**

Suppose we would like to add a new column call **mean **that contains the mean of points scored by each player across all three games.

We can use the following syntax to do so:

from pyspark.sql import functions as F #define columns to calculate mean for mean_cols = ['game1','game2','game3'] #define function to calculate mean find_mean = F.expr('+'.join(mean_cols))/len(mean_cols) #calculate mean across specific columns df_new = df.withColumn('mean', find_mean) #view new DataFrame df_new.show() +-------+-----+-----+-----+------------------+ | team|game1|game2|game3| mean| +-------+-----+-----+-----+------------------+ | Mavs| 25| 11| 10|15.333333333333334| | Nets| 22| 8| 14|14.666666666666666| | Hawks| 14| 22| 10|15.333333333333334| | Kings| 30| 22| 35| 29.0| | Bulls| 15| 14| 12|13.666666666666666| |Blazers| 10| 14| 18| 14.0| +-------+-----+-----+-----+------------------+

Notice that the new **mean **column contains the mean of values across the **game1**, **game2** and **game3** columns.

For example:

- The mean of points for the
**Mavs**player is (25 + 11 + 10) / 3 =**15.33** - The mean of points for the
**Nets**player is (22 + 8 + 14) / 3 =**14.67** - The mean of points for the
**Hawks**player is (14 + 22 + 10) / 3 =**15.33**

And so on.

Note that we used the **withColumn** function to return a new DataFrame with the **mean **column added and all other columns left the same.

You can find the complete documentation for the PySpark **withColumn** function here.

**Additional Resources**

The following tutorials explain how to perform other common tasks in PySpark:

How to Calculate the Mean of a Column in PySpark

How to Calculate the Sum of a Column in PySpark

How to Calculate Sum by Group in PySpark