You can use the following syntax to fill null values with the column median in a PySpark DataFrame:

from pyspark.sql.functions import median #define function to fill null values with column median def fillna_median(df, include=set()): medians = df.agg(*( median(x).alias(x) for x in df.columns if x in include )) return df.fillna(medians.first().asDict()) #fill null values with median in specific columns df = fillna_median(df, ['points', 'assists'])

This particular example fills the null values in the **points** and **assists** columns of the DataFrame with their respective column medians.

The following example shows how to use this syntax in practice.

**Example: How to Fill Null Values with Median in PySpark**

Suppose we have the following PySpark DataFrame that contains information about various basketball players:

**from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [['A', 'East', 11, 4],
['A', 'East', 8, 9],
['A', 'East', 10, 3],
['B', 'West', 6, 12],
['B', 'West', None, 2],
['C', 'East', 5, None]]
#define column names
columns = ['team', 'conference', 'points', 'assists']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
| A| East| 11| 4|
| A| East| 8| 9|
| A| East| 10| 3|
| B| West| 6| 12|
| B| West| null| 2|
| C| East| 5| null|
+----+----------+------+-------+**

Notice that both the **points** and **assists** columns have one null value.

We can use the following syntax to fill in the null values in each column with the column median:

from pyspark.sql.functions import median #define function to fill null values with column median def fillna_median(df, include=set()): medians = df.agg(*( median(x).alias(x) for x in df.columns if x in include )) return df.fillna(medians.first().asDict()) #fill null values with median in specific columns df = fillna_median(df, ['points', 'assists']) #view updated DataFrame df.show() +----+----------+------+-------+ |team|conference|points|assists| +----+----------+------+-------+ | A| East| 11| 4| | A| East| 8| 9| | A| East| 10| 3| | B| West| 6| 12| | B| West| 8| 2| | C| East| 5| 4| +----+----------+------+-------+

Notice that the null values in both the **points** and **assists** columns have been replaced with their respective column medians.

For example, the null value in the **points** column has been replaced with **8**, which represents the median value in the **points** column.

Similarly, the null value in the **assists** column has been replaced with **4**, which represents the median value in the **assists** column.

**Note**: You can find the complete documentation for the PySpark **fillna()** function here.

**Additional Resources**

The following tutorials explain how to perform other common tasks in PySpark:

How to Fill Null Values with Mean in PySpark

How to Calculate the Median of a Column in PySpark

How to Calculate the Median by Group in PySpark