PySpark: How to Fill Null Values with Mean


You can use the following syntax to fill null values with the column mean in a PySpark DataFrame:

from pyspark.sql.functions import mean

#define function to fill null values with column mean
def fillna_mean(df, include=set()): 
    means = df.agg(*(
        mean(x).alias(x) for x in df.columns if x in include
    ))
    return df.fillna(means.first().asDict())

#fill null values with mean in specific columns
df = fillna_mean(df, ['points', 'assists'])

This particular example fills the null values in the points and assists columns of the DataFrame with their respective column means.

The following example shows how to use this syntax in practice.

Example: How to Fill Null Values with Mean in PySpark

Suppose we have the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11, 4], 
        ['A', 'East', 8, 9], 
        ['A', 'East', 10, 3], 
        ['B', 'West', 6, 12], 
        ['B', 'West', None, 2], 
        ['C', 'East', 5, None]] 
  
#define column names
columns = ['team', 'conference', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|    10|      3|
|   B|      West|     6|     12|
|   B|      West|  null|      2|
|   C|      East|     5|   null|
+----+----------+------+-------+

Notice that both the points and assists columns have one null value.

We can use the following syntax to fill in the null values in each column with the column mean:

from pyspark.sql.functions import mean

#define function to fill null values with column mean
def fillna_mean(df, include=set()): 
    means = df.agg(*(
        mean(x).alias(x) for x in df.columns if x in include
    ))
    return df.fillna(means.first().asDict())

#fill null values with mean in specific columns
df = fillna_mean(df, ['points', 'assists'])

#view updated DataFrame
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|    10|      3|
|   B|      West|     6|     12|
|   B|      West|     8|      5|
|   C|      East|     5|      6|
+----+----------+------+-------+

Notice that the null values in both the points and assists columns have been replaced with their respective column means.

For example, the null value in the points column has been replaced with 8, which represents the mean value in the points column.

Similarly, the null value in the assists column has been replaced with 6, which represents the mean value in the assists column.

Note: You can find the complete documentation for the PySpark fillna() function here.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

How to Calculate the Mean of a Column in PySpark
How to Calculate Mean of Multiple Columns in PySpark
How to Calculate the Mean by Group in PySpark
How to Calculate a Rolling Mean in PySpark

Leave a Reply

Your email address will not be published. Required fields are marked *