You can use the following syntax to fill null values with the column mean in a PySpark DataFrame:

from pyspark.sql.functions import mean #define function to fill null values with column mean def fillna_mean(df, include=set()): means = df.agg(*( mean(x).alias(x) for x in df.columns if x in include )) return df.fillna(means.first().asDict()) #fill null values with mean in specific columns df = fillna_mean(df, ['points', 'assists'])

This particular example fills the null values in the **points** and **assists** columns of the DataFrame with their respective column means.

The following example shows how to use this syntax in practice.

**Example: How to Fill Null Values with Mean in PySpark**

Suppose we have the following PySpark DataFrame that contains information about various basketball players:

**from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [['A', 'East', 11, 4],
['A', 'East', 8, 9],
['A', 'East', 10, 3],
['B', 'West', 6, 12],
['B', 'West', None, 2],
['C', 'East', 5, None]]
#define column names
columns = ['team', 'conference', 'points', 'assists']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
| A| East| 11| 4|
| A| East| 8| 9|
| A| East| 10| 3|
| B| West| 6| 12|
| B| West| null| 2|
| C| East| 5| null|
+----+----------+------+-------+**

Notice that both the **points** and **assists** columns have one null value.

We can use the following syntax to fill in the null values in each column with the column mean:

from pyspark.sql.functions import mean #define function to fill null values with column mean def fillna_mean(df, include=set()): means = df.agg(*( mean(x).alias(x) for x in df.columns if x in include )) return df.fillna(means.first().asDict()) #fill null values with mean in specific columns df = fillna_mean(df, ['points', 'assists']) #view updated DataFrame df.show() +----+----------+------+-------+ |team|conference|points|assists| +----+----------+------+-------+ | A| East| 11| 4| | A| East| 8| 9| | A| East| 10| 3| | B| West| 6| 12| | B| West| 8| 5| | C| East| 5| 6| +----+----------+------+-------+

Notice that the null values in both the **points** and **assists** columns have been replaced with their respective column means.

For example, the null value in the **points** column has been replaced with **8**, which represents the mean value in the **points** column.

Similarly, the null value in the **assists** column has been replaced with **6**, which represents the mean value in the **assists** column.

**Note**: You can find the complete documentation for the PySpark **fillna()** function here.

**Additional Resources**

The following tutorials explain how to perform other common tasks in PySpark:

How to Calculate the Mean of a Column in PySpark

How to Calculate Mean of Multiple Columns in PySpark

How to Calculate the Mean by Group in PySpark

How to Calculate a Rolling Mean in PySpark