You can use the following methods to select distinct rows in a PySpark DataFrame:

**Method 1: Select Distinct Rows in DataFrame**

#display distinct rows only df.distinct().show()

**Method 2: Select Distinct Values from Specific Column**

#display distinct values from 'team' column only df.select('team').distinct().show()

**Method 3: Count Distinct Rows in DataFrame**

#count number of distinct rows df.distinct().count()

The following examples show how to use each of these methods in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() #define data data = [['A', 'Guard', 11], ['A', 'Guard', 8], ['A', 'Forward', 22], ['A', 'Forward', 22], ['B', 'Guard', 14], ['B', 'Guard', 14], ['B', 'Forward', 13], ['B', 'Forward', 7]] #define column names columns = ['team', 'position', 'points'] #create DataFrame using data and column names df = spark.createDataFrame(data, columns) #view DataFrame df.show() +----+--------+------+ |team|position|points| +----+--------+------+ | A| Guard| 11| | A| Guard| 8| | A| Forward| 22| | A| Forward| 22| | B| Guard| 14| | B| Guard| 14| | B| Forward| 13| | B| Forward| 7| +----+--------+------+

**Example 1: Select Distinct Rows in DataFrame**

We can use the following syntax to select the distinct rows in the DataFrame:

#display distinct rows only df.distinct().show() +----+--------+------+ |team|position|points| +----+--------+------+ | A| Guard| 11| | A| Guard| 8| | A| Forward| 22| | B| Guard| 14| | B| Forward| 13| | B| Forward| 7| +----+--------+------+

Notice that each row in the resulting DataFrame is distinct.

**Example 2: Select Distinct Values from Specific Column in DataFrame**

We can use the following syntax to select the distinct values from the **team** column in the DataFrame:

#display distinct values from 'team' column only df.select('team').distinct().show() +----+ |team| +----+ | A| | B| +----+

The output shows the two distinct values from the **team** column: **A** and **B**.

**Example 3: Count Distinct Rows in DataFrame**

We can use the following syntax to count the number of distinct rows in the DataFrame:

#count number of distinct rows df.distinct().count() 6

The output tells us that there are **6** distinct rows in the entire DataFrame.

**Additional Resources**

The following tutorials explain how to perform other common tasks in PySpark:

PySpark: How to Select Columns by Index in DataFrame

PySpark: How to Select Rows by Index in DataFrame

PySpark: How to Find Unique Values in a Column