In statistics, quartiles are values that split up a dataset into four equal parts.
When analyzing a distribution, we’re typically interested in the following quartiles:
- First Quartile (Q1): The value located at the 25th percentile
- Second Quartile (Q2): The value located at the 50th percentile
- Third Quartile (Q3): The value located at the 75th percentile
You can use the following syntax to calculate the quartiles for a column in a PySpark DataFrame:
#calculate quartiles of 'points' column df.approxQuantile('points', [0.25, 0.5, 0.75], 0)
The following example shows how to use this syntax in practice.
Example: How to Calculate Quartiles in PySpark
Suppose we have the following PySpark DataFrame that contains information about points scored by various basketball players:
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() #define data data = [['Mavs', 18], ['Nets', 33], ['Lakers', 12], ['Kings', 15], ['Hawks', 19], ['Wizards', 24], ['Magic', 28], ['Jazz', 40], ['Thunder', 24], ['Spurs', 13]] #define column names columns = ['team', 'points'] #create dataframe using data and column names df = spark.createDataFrame(data, columns) #view dataframe df.show() +-------+------+ | team|points| +-------+------+ | Mavs| 18| | Nets| 33| | Lakers| 12| | Kings| 15| | Hawks| 19| |Wizards| 24| | Magic| 28| | Jazz| 40| |Thunder| 24| | Spurs| 13| +-------+------+
We can use the following syntax to calculate the quartiles for the points column:
#calculate quartiles of 'points' column df.approxQuantile('points', [0.25, 0.5, 0.75], 0) [15.0, 19.0, 28.0]
From the output we can see:
- The first quartile is located at 15.
- The second quartile is located at 19.
- The third quartile is located at 28.
By knowing only these three values, we can have a good understanding of the distribution of values in the points column.
Note: You can find the complete documentation for the PySpark approxQuantile function here.
The following tutorials explain how to perform other common tasks in PySpark: