How to Calculate Percentiles in PySpark (With Examples)


You can use the following methods to calculate percentiles in a PySpark DataFrame:

Method 1: Calculate Percentiles for One Column

import pyspark.sql.functions as F   

#calculate 25th percentile for 'points' column
df.agg(F.expr('percentile(points, array(0.25))')[0].alias('%25')).show()

Method 2: Calculate Percentiles for One Column, Grouped by Another Column

import pyspark.sql.functions as F   

#calculate 25th, 50th and 75th percentile of 'points', grouped by 'team'
df_new = df.groupby('team').agg(F.expr('percentile(points, array(0.25))')[0].alias('%25'),
                                F.expr('percentile(points, array(0.50))')[0].alias('%50'),
                                F.expr('percentile(points, array(0.75))')[0].alias('%75'))

The following examples show how to use each method in practice with the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'Guard', 8], 
        ['A', 'Guard', 15], 
        ['A', 'Forward', 22], 
        ['A', 'Forward', 22], 
        ['A', 'Guard', 32], 
        ['B', 'Guard', 9],
        ['B', 'Guard', 15],
        ['B', 'Forward', 28],
        ['B', 'Guard', 31],
        ['B', 'Forward', 40]] 
  
#define column names
columns = ['team', 'position', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|   Guard|    11|
|   A|   Guard|     8|
|   A| Forward|    22|
|   A| Forward|    22|
|   B|   Guard|    14|
|   B|   Guard|    14|
|   B|   Guard|    13|
|   B| Forward|     7|
|   C|   Guard|     8|
|   C| Forward|     5|
+----+--------+------+

Example 1: Calculate Percentiles for One Column

We can use the following syntax to calculate the 25th percentile for the values in the points column:

import pyspark.sql.functions as F   

#calculate 25th percentile for 'points' column
df.agg(F.expr('percentile(points, array(0.25))')[0].alias('%25')).show()

+----+
| %25|
+----+
|15.0|
+----+

From the output we can see that the 25th percentile of values in the points column is 15.

Example 2: Calculate Percentiles for One Column, Grouped by Another Column

We can use the following syntax to calculate the 25th, 50th and 75th percentile values in the points column grouped by the values in the team column:

import pyspark.sql.functions as F   

#calculate 25th, 50th and 75th percentile of 'points', grouped by 'team'
df_new = df.groupby('team').agg(F.expr('percentile(points, array(0.25))')[0].alias('%25'),
                                F.expr('percentile(points, array(0.50))')[0].alias('%50'),
                                F.expr('percentile(points, array(0.75))')[0].alias('%75'))

#view new DataFrame 
df_new.show()

+----+----+----+----+
|team| %25| %50| %75|
+----+----+----+----+
|   A|15.0|22.0|22.0|
|   B|15.0|28.0|31.0|
+----+----+----+----+

From the output we can see:

  • The 25th percentile of points values for team A is 15.
  • The 50th percentile of points values for team A is 22.
  • The 75th percentile of points values for team A is 22.

And so on.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

How to Calculate the Mean of a Column in PySpark
How to Calculate Mean of Multiple Columns in PySpark
How to Calculate Sum by Group in PySpark

Featured Posts

Leave a Reply

Your email address will not be published. Required fields are marked *