PySpark: How to Drop Duplicate Rows from DataFrame


There are three common ways to drop duplicate rows from a PySpark DataFrame:

Method 1: Drop Rows with Duplicate Values Across All Columns

#drop rows that have duplicate values across all columns
df_new = df.dropDuplicates()

Method 2: Drop Rows with Duplicate Values Across Specific Columns

#drop rows that have duplicate values across 'team' and 'position' columns
df_new = df.dropDuplicates(['team', 'position'])

Method 3: Drop Rows with Duplicate Values in One Specific Column

#drop rows that have duplicate values in 'team' column
df_new = df.dropDuplicates(['team'])

The following examples show how to use each method in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'Guard', 11], 
        ['A', 'Guard', 8], 
        ['A', 'Forward', 22], 
        ['A', 'Forward', 22], 
        ['B', 'Guard', 14], 
        ['B', 'Guard', 14],
        ['B', 'Forward', 13],
        ['B', 'Forward', 7]] 
  
#define column names
columns = ['team', 'position', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|   Guard|    11|
|   A|   Guard|     8|
|   A| Forward|    22|
|   A| Forward|    22|
|   B|   Guard|    14|
|   B|   Guard|    14|
|   B| Forward|    13|
|   B| Forward|     7|
+----+--------+------+

Example 1: Drop Rows with Duplicate Values Across All Columns

We can use the following syntax to drop rows that have duplicate values across all columns in the DataFrame:

#drop rows that have duplicate values across all columns
df_new = df.dropDuplicates()

#view DataFrame without duplicates
df_new.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|   Guard|    11|
|   A|   Guard|     8|
|   A| Forward|    22|
|   B|   Guard|    14|
|   B| Forward|    13|
|   B| Forward|     7|
+----+--------+------+

A total of two rows were dropped from the DataFrame.

Example 2: Drop Rows with Duplicate Values Across Specific Columns

We can use the following syntax to drop rows that have duplicate values across the team and position columns in the DataFrame:

#drop rows that have duplicate values across 'team' and 'position' columns
df_new = df.dropDuplicates(['team', 'position'])

#view DataFrame without duplicates
df_new.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|   Guard|    11|
|   A| Forward|    22|
|   B|   Guard|    14|
|   B| Forward|    13|
+----+--------+------+

Notice that the resulting DataFrame has no rows with duplicate values across both the team and position columns.

Example 3: Drop Rows with Duplicate Values in One Specific Column

We can use the following syntax to drop rows that have duplicate values in the team column of the DataFrame:

#drop rows that have duplicate values in 'team' column
df_new = df.dropDuplicates(['team'])

#view DataFrame without duplicates
df_new.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|   Guard|    11|
|   B|   Guard|    14|
+----+--------+------+

Notice that the resulting DataFrame has no rows with duplicate values in the team column.

Note: When duplicate rows are identified, only the first duplicate row is kept in the DataFrame while all other duplicate rows are dropped.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

PySpark: How to Select Rows Based on Column Values
PySpark: How to Select Rows by Index in DataFrame
PySpark: How to Select Columns by Index in DataFrame

Featured Posts

Leave a Reply

Your email address will not be published. Required fields are marked *