How to Calculate the Median of a Column in PySpark


You can use the following methods to calculate the median of a column in a PySpark DataFrame:

Method 1: Calculate Median for One Specific Column

from pyspark.sql import functions as F

#calculate median of column named 'game1'
df.agg(F.median('game1')).collect()[0][0]

Method 2: Calculate Median for Multiple Columns

from pyspark.sql.functions import median 

#calculate median for game1, game2 and game3 columns
df.select(median(df.game1), median(df.game2), median(df.game3)).show()

The following examples show how to use each method in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Mavs', 25, 11, 10], 
        ['Nets', 22, 8, 14], 
        ['Hawks', 14, 22, 10], 
        ['Kings', 30, 22, 35], 
        ['Bulls', 15, 14, 12], 
        ['Blazers', 10, 14, 18]] 
  
#define column names
columns = ['team', 'game1', 'game2', 'game3'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-------+-----+-----+-----+
|   team|game1|game2|game3|
+-------+-----+-----+-----+
|   Mavs|   25|   11|   10|
|   Nets|   22|    8|   14|
|  Hawks|   14|   22|   10|
|  Kings|   30|   22|   35|
|  Bulls|   15|   14|   12|
|Blazers|   10|   14|   18|
+-------+-----+-----+-----+

Example 1: Calculate Median for One Specific Column

We can use the following syntax to calculate the median of values in the game1 column of the DataFrame only:

from pyspark.sql import functions as F

#calculate median of column named 'game1'
df.agg(F.median('game1')).collect()[0][0]

18.5

The median of values in the game1 column turns out to be 18.5.

We can verify this is correct by manually calculating the median of values in this column:

All values in game1 column: 10, 14, 15, 22, 25, 30

The two “middle” values are 15 and 22. The average of these two values is 18.5, which represents the median.

Example 2: Calculate Median for Multiple Columns

We can use the following syntax to calculate the median of values for the game1, game2 and game3 columns of the DataFrame:

from pyspark.sql.functions import median

#calculate median for game1, game2 and game3 columns
df.select(median(df.game1), median(df.game2), median(df.game3)).show()

+-------------+-------------+-------------+
|median(game1)|median(game2)|median(game3)|
+-------------+-------------+-------------+
|         18.5|         14.0|         13.0|
+-------------+-------------+-------------+

From the output we can see:

  • The median of values in the game1 column is 19.333.
  • The median of values in the game2 column is 14.
  • The median of values in the game3 column is 13.

Note: If there are null values in the column, the median function will ignore these values by default.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

How to Calculate Mean of Multiple Columns in PySpark
How to Calculate the Mean by Group in PySpark
How to Sum Multiple Columns in PySpark

Featured Posts

Leave a Reply

Your email address will not be published. Required fields are marked *