In statistics, **quartiles **are values that split up a dataset into four equal parts.

When analyzing a distribution, we’re typically interested in the following quartiles:

- First Quartile (
**Q1**): The value located at the 25th percentile - Second Quartile (
**Q2**): The value located at the 50th percentile - Third Quartile (
**Q3**): The value located at the 75th percentile

You can use the following syntax to calculate the quartiles for a column in a PySpark DataFrame:

#calculate quartiles of 'points' column df.approxQuantile('points', [0.25, 0.5, 0.75], 0)

The following example shows how to use this syntax in practice.

**Example: How to Calculate Quartiles in PySpark**

Suppose we have the following PySpark DataFrame that contains information about points scored by various basketball players:

**from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [['Mavs', 18],
['Nets', 33],
['Lakers', 12],
['Kings', 15],
['Hawks', 19],
['Wizards', 24],
['Magic', 28],
['Jazz', 40],
['Thunder', 24],
['Spurs', 13]]
#define column names
columns = ['team', 'points']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+-------+------+
| team|points|
+-------+------+
| Mavs| 18|
| Nets| 33|
| Lakers| 12|
| Kings| 15|
| Hawks| 19|
|Wizards| 24|
| Magic| 28|
| Jazz| 40|
|Thunder| 24|
| Spurs| 13|
+-------+------+
**

We can use the following syntax to calculate the quartiles for the **points** column:

**#calculate quartiles of 'points' column
df.approxQuantile('points', [0.25, 0.5, 0.75], 0)
[15.0, 19.0, 28.0]
**

From the output we can see:

- The first quartile is located at
**15**. - The second quartile is located at
**19**. - The third quartile is located at
**28**.

By knowing only these three values, we can have a good understanding of the distribution of values in the **points** column.

**Note**: You can find the complete documentation for the PySpark **approxQuantile** function here.

