How to Count Number of Duplicate Rows in PySpark


You can use the following syntax to count the number of duplicate rows in a PySpark DataFrame:

import pyspark.sql.functions as F

df.groupBy(df.columns)\
    .count()\
    .where(F.col('count') > 1)\
    .select(F.sum('count'))\
    .show()

The following example shows how to use this syntax in practice.

Example: Count Number of Duplicate Rows in PySpark DataFrame

Suppose we have the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'Guard', 11], 
        ['A', 'Guard', 8], 
        ['A', 'Forward', 22], 
        ['A', 'Forward', 22], 
        ['B', 'Guard', 14], 
        ['B', 'Guard', 14],
        ['B', 'Forward', 13],
        ['B', 'Forward', 7]] 
  
#define column names
columns = ['team', 'position', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
|   A|   Guard|    11|
|   A|   Guard|     8|
|   A| Forward|    22|
|   A| Forward|    22|
|   B|   Guard|    14|
|   B|   Guard|    14|
|   B| Forward|    13|
|   B| Forward|     7|
+----+--------+------+

We can use the following syntax to count the number of duplicate rows in the DataFrame:

import pyspark.sql.functions as F

#count number of duplicate rows in DataFrame
df.groupBy(df.columns)\
    .count()\
    .where(F.col('count') > 1)\
    .select(F.sum('count'))\
    .show()

+----------+
|sum(count)|
+----------+
|         4|
+----------+

We can see that there are 4 total duplicate rows in the DataFrame.

To view these duplicate rows, simply remove the last select function from the previous code:

import pyspark.sql.functions as F

#view duplicate rows in DataFrame
df.groupBy(df.columns)\
    .count()\
    .where(F.col('count') > 1)\
    .show()

+----+--------+------+-----+
|team|position|points|count|
+----+--------+------+-----+
|   A| Forward|    22|    2|
|   B|   Guard|    14|    2|
+----+--------+------+-----+

We can see that team A, position Forward and points 22 occurs 2 times.

We can see that team B, position Guard and points 14 also occurs 2 times.

These represent the 4 duplicate rows.

Also note that you could use the collect function instead of the show function to return only a single value that represents the number of duplicate rows:

import pyspark.sql.functions as F

#count number of duplicate rows in DataFrame
df.groupBy(df.columns)\
    .count()\
    .where(F.col('count') > 1)\
    .select(F.sum('count'))\
    .collect()[0][0]

4

Notice that this particular syntax returns only the number 4.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

PySpark: How to Drop Duplicate Rows from DataFrame
PySpark: How to Drop Multiple Columns from DataFrame
PySpark: How to Add New Rows to DataFrame

Leave a Reply

Your email address will not be published. Required fields are marked *