You can use the following methods to calculate percentiles in a PySpark DataFrame:

**Method 1: Calculate Percentiles for One Column**

import pyspark.sql.functions as F #calculate 25th percentile for 'points' column df.agg(F.expr('percentile(points, array(0.25))')[0].alias('%25')).show()

**Method 2: Calculate Percentiles for One Column, Grouped by Another Column**

import pyspark.sql.functions as F #calculate 25th, 50th and 75th percentile of 'points', grouped by 'team' df_new = df.groupby('team').agg(F.expr('percentile(points, array(0.25))')[0].alias('%25'), F.expr('percentile(points, array(0.50))')[0].alias('%50'), F.expr('percentile(points, array(0.75))')[0].alias('%75'))

The following examples show how to use each method in practice with the following PySpark DataFrame that contains information about various basketball players:

**from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [['A', 'Guard', 8],
['A', 'Guard', 15],
['A', 'Forward', 22],
['A', 'Forward', 22],
['A', 'Guard', 32],
['B', 'Guard', 9],
['B', 'Guard', 15],
['B', 'Forward', 28],
['B', 'Guard', 31],
['B', 'Forward', 40]]
#define column names
columns = ['team', 'position', 'points']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+----+--------+------+
|team|position|points|
+----+--------+------+
| A| Guard| 11|
| A| Guard| 8|
| A| Forward| 22|
| A| Forward| 22|
| B| Guard| 14|
| B| Guard| 14|
| B| Guard| 13|
| B| Forward| 7|
| C| Guard| 8|
| C| Forward| 5|
+----+--------+------+
**

**Example 1: Calculate Percentiles for One Column**

We can use the following syntax to calculate the 25th percentile for the values in the **points** column:

import pyspark.sql.functions as F #calculate 25th percentile for 'points' column df.agg(F.expr('percentile(points, array(0.25))')[0].alias('%25')).show() +----+ | %25| +----+ |15.0| +----+

From the output we can see that the 25th percentile of values in the **points** column is **15**.

**Example 2: Calculate Percentiles for One Column, Grouped by Another Column**

We can use the following syntax to calculate the 25th, 50th and 75th percentile values in the **points** column grouped by the values in the **team** column:

import pyspark.sql.functions as F #calculate 25th, 50th and 75th percentile of 'points', grouped by 'team' df_new = df.groupby('team').agg(F.expr('percentile(points, array(0.25))')[0].alias('%25'), F.expr('percentile(points, array(0.50))')[0].alias('%50'), F.expr('percentile(points, array(0.75))')[0].alias('%75')) #view new DataFrame df_new.show() +----+----+----+----+ |team| %25| %50| %75| +----+----+----+----+ | A|15.0|22.0|22.0| | B|15.0|28.0|31.0| +----+----+----+----+

From the output we can see:

- The 25th percentile of points values for team A is
**15**. - The 50th percentile of points values for team A is
**22**. - The 75th percentile of points values for team A is
**22**.

And so on.

**Additional Resources**

The following tutorials explain how to perform other common tasks in PySpark:

How to Calculate the Mean of a Column in PySpark

How to Calculate Mean of Multiple Columns in PySpark

How to Calculate Sum by Group in PySpark