You can use the following methods to calculate the median of a column in a PySpark DataFrame:

**Method 1: Calculate Median for One Specific Column**

from pyspark.sql import functions as F #calculate median of column named 'game1' df.agg(F.median('game1')).collect()[0][0]

**Method 2: Calculate Median for Multiple Columns**

from pyspark.sql.functions import median #calculate median for game1, game2 and game3 columns df.select(median(df.game1), median(df.game2), median(df.game3)).show()

The following examples show how to use each method in practice with the following PySpark DataFrame:

**from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
****data = [['Mavs', 25, 11, 10],
['Nets', 22, 8, 14],
['Hawks', 14, 22, 10],
['Kings', 30, 22, 35],
['Bulls', 15, 14, 12],
['Blazers', 10, 14, 18]]
****#define column names
columns = ['team', 'game1', 'game2', 'game3']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+-------+-----+-----+-----+
| team|game1|game2|game3|
+-------+-----+-----+-----+
| Mavs| 25| 11| 10|
| Nets| 22| 8| 14|
| Hawks| 14| 22| 10|
| Kings| 30| 22| 35|
| Bulls| 15| 14| 12|
|Blazers| 10| 14| 18|
+-------+-----+-----+-----+**

**Example 1: Calculate Median for One Specific Column**

We can use the following syntax to calculate the median of values in the **game1** column of the DataFrame only:

from pyspark.sql import functions as F #calculate median of column named 'game1' df.agg(F.median('game1')).collect()[0][0] 18.5

The median of values in the **game1** column turns out to be **18.5**.

We can verify this is correct by manually calculating the median of values in this column:

All values in game1 column: 10, 14, **15**, **22**, 25, 30

The two “middle” values are **15** and **22**. The average of these two values is **18.5**, which represents the median.

**Example 2: Calculate Median for Multiple Columns**

We can use the following syntax to calculate the median of values for the **game1**, **game2** and **game3** columns of the DataFrame:

from pyspark.sql.functions import median #calculate median for game1, game2 and game3 columns df.select(median(df.game1), median(df.game2), median(df.game3)).show() +-------------+-------------+-------------+ |median(game1)|median(game2)|median(game3)| +-------------+-------------+-------------+ | 18.5| 14.0| 13.0| +-------------+-------------+-------------+

From the output we can see:

- The median of values in the
**game1**column is**19.333**. - The median of values in the
**game2**column is**14**. - The median of values in the
**game3**column is**13**.

**Note**: If there are null values in the column, the **median** function will ignore these values by default.

**Additional Resources**

The following tutorials explain how to perform other common tasks in PySpark:

How to Calculate Mean of Multiple Columns in PySpark

How to Calculate the Mean by Group in PySpark

How to Sum Multiple Columns in PySpark