The easiest way to obtain a list of unique values in a PySpark DataFrame column is to use the **distinct** function.

This tutorial provides several examples of how to use this function with the following PySpark DataFrame:

from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() #define data data = [['A', 'East', 11], ['A', 'East', 8], ['A', 'East', 10], ['B', 'West', 6], ['B', 'West', 6], ['C', 'East', 5]] #define column names columns = ['team', 'conference', 'points'] #create DataFrame using data and column names df = spark.createDataFrame(data, columns) #view DataFrame df.show() +----+----------+------+ |team|conference|points| +----+----------+------+ | A| East| 11| | A| East| 8| | A| East| 10| | B| West| 6| | B| West| 6| | C| East| 5| +----+----------+------+

**Example 1: Find Unique Values in a Column**

We can use the following syntax to find the unique values in the **team** column of the DataFrame:

df.select('team').distinct().show() +----+ |team| +----+ | A| | B| | C| +----+

We can see that the unique values in the **team** column are **A**, **B** and **C**.

**Example 2: Find and Sort Unique Values in a Column**

Suppose we used the following syntax to find the unique values in the **points** column:

df.select('points').distinct().show() +------+ |points| +------+ | 11| | 8| | 10| | 6| | 5| +------+

The output displays the unique values but they aren’t sorted in any way.

If we’d like, we can use the following syntax to find the unique values in the **points** column and return them sorted in **ascending order**:

#find unique values in points column df_points = df.select('points').distinct() #display unique values in ascending orderdf_points.orderBy('points').show() +------+ |points| +------+ | 5| | 6| | 8| | 10| | 11| +------+

We can also use the argument **ascending=False** to return the unique values in **descending order** instead:

#find unique values in points column df_points = df.select('points').distinct() #display unique values in descending orderdf_points.orderBy('points', ascending=False).show() +------+ |points| +------+ | 11| | 10| | 8| | 6| | 5| +------+

**Example 3: Find and Count Unique Values in a Column**

The following code shows how to find and count the occurrence of unique values in the **team** column of the DataFrame:

df.groupBy('team').count().show() +----+-----+ |team|count| +----+-----+ | A| 3| | B| 2| | C| 1| +----+-----+

From the output we can see the three unique values (**A**, **B**, **C**) along with the number of times each unique value occurs.

**Additional Resources**

The following tutorials explain how to perform other common tasks in PySpark:

PySpark: How to Select Columns by Index in DataFrame

PySpark: How to Select Rows Based on Column Values

PySpark: How to Select Rows Based on Column Values