You can use the following methods to select distinct rows in a PySpark DataFrame:
Method 1: Select Distinct Rows in DataFrame
#display distinct rows only
df.distinct().show()
Method 2: Select Distinct Values from Specific Column
#display distinct values from 'team' column only df.select('team').distinct().show()
Method 3: Count Distinct Rows in DataFrame
#count number of distinct rows
df.distinct().count()
The following examples show how to use each of these methods in practice with the following PySpark DataFrame:
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() #define data data = [['A', 'Guard', 11], ['A', 'Guard', 8], ['A', 'Forward', 22], ['A', 'Forward', 22], ['B', 'Guard', 14], ['B', 'Guard', 14], ['B', 'Forward', 13], ['B', 'Forward', 7]] #define column names columns = ['team', 'position', 'points'] #create DataFrame using data and column names df = spark.createDataFrame(data, columns) #view DataFrame df.show() +----+--------+------+ |team|position|points| +----+--------+------+ | A| Guard| 11| | A| Guard| 8| | A| Forward| 22| | A| Forward| 22| | B| Guard| 14| | B| Guard| 14| | B| Forward| 13| | B| Forward| 7| +----+--------+------+
Example 1: Select Distinct Rows in DataFrame
We can use the following syntax to select the distinct rows in the DataFrame:
#display distinct rows only
df.distinct().show()
+----+--------+------+
|team|position|points|
+----+--------+------+
| A| Guard| 11|
| A| Guard| 8|
| A| Forward| 22|
| B| Guard| 14|
| B| Forward| 13|
| B| Forward| 7|
+----+--------+------+
Notice that each row in the resulting DataFrame is distinct.
Example 2: Select Distinct Values from Specific Column in DataFrame
We can use the following syntax to select the distinct values from the team column in the DataFrame:
#display distinct values from 'team' column only
df.select('team').distinct().show()
+----+
|team|
+----+
| A|
| B|
+----+
The output shows the two distinct values from the team column: A and B.
Example 3: Count Distinct Rows in DataFrame
We can use the following syntax to count the number of distinct rows in the DataFrame:
#count number of distinct rows
df.distinct().count()
6
The output tells us that there are 6 distinct rows in the entire DataFrame.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
PySpark: How to Select Columns by Index in DataFrame
PySpark: How to Select Rows by Index in DataFrame
PySpark: How to Find Unique Values in a Column