PySpark: How to Fill Null Values with Median


You can use the following syntax to fill null values with the column median in a PySpark DataFrame:

from pyspark.sql.functions import median

#define function to fill null values with column median
def fillna_median(df, include=set()): 
    medians = df.agg(*(
        median(x).alias(x) for x in df.columns if x in include
    ))
    return df.fillna(medians.first().asDict())

#fill null values with median in specific columns
df = fillna_median(df, ['points', 'assists'])

This particular example fills the null values in the points and assists columns of the DataFrame with their respective column medians.

The following example shows how to use this syntax in practice.

Example: How to Fill Null Values with Median in PySpark

Suppose we have the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11, 4], 
        ['A', 'East', 8, 9], 
        ['A', 'East', 10, 3], 
        ['B', 'West', 6, 12], 
        ['B', 'West', None, 2], 
        ['C', 'East', 5, None]] 
  
#define column names
columns = ['team', 'conference', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|    10|      3|
|   B|      West|     6|     12|
|   B|      West|  null|      2|
|   C|      East|     5|   null|
+----+----------+------+-------+

Notice that both the points and assists columns have one null value.

We can use the following syntax to fill in the null values in each column with the column median:

from pyspark.sql.functions import median

#define function to fill null values with column median
def fillna_median(df, include=set()): 
    medians = df.agg(*(
        median(x).alias(x) for x in df.columns if x in include
    ))
    return df.fillna(medians.first().asDict())

#fill null values with median in specific columns
df = fillna_median(df, ['points', 'assists'])

#view updated DataFrame
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|    10|      3|
|   B|      West|     6|     12|
|   B|      West|     8|      2|
|   C|      East|     5|      4|
+----+----------+------+-------+

Notice that the null values in both the points and assists columns have been replaced with their respective column medians.

For example, the null value in the points column has been replaced with 8, which represents the median value in the points column.

Similarly, the null value in the assists column has been replaced with 4, which represents the median value in the assists column.

Note: You can find the complete documentation for the PySpark fillna() function here.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

How to Fill Null Values with Mean in PySpark
How to Calculate the Median of a Column in PySpark
How to Calculate the Median by Group in PySpark

Leave a Reply

Your email address will not be published. Required fields are marked *