You can use the following syntax to count the number of duplicate rows in a PySpark DataFrame:
import pyspark.sql.functions as F
df.groupBy(df.columns)\
.count()\
.where(F.col('count') > 1)\
.select(F.sum('count'))\
.show()
The following example shows how to use this syntax in practice.
Example: Count Number of Duplicate Rows in PySpark DataFrame
Suppose we have the following PySpark DataFrame that contains information about various basketball players:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [['A', 'Guard', 11],
['A', 'Guard', 8],
['A', 'Forward', 22],
['A', 'Forward', 22],
['B', 'Guard', 14],
['B', 'Guard', 14],
['B', 'Forward', 13],
['B', 'Forward', 7]]
#define column names
columns = ['team', 'position', 'points']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+----+--------+------+
|team|position|points|
+----+--------+------+
| A| Guard| 11|
| A| Guard| 8|
| A| Forward| 22|
| A| Forward| 22|
| B| Guard| 14|
| B| Guard| 14|
| B| Forward| 13|
| B| Forward| 7|
+----+--------+------+
We can use the following syntax to count the number of duplicate rows in the DataFrame:
import pyspark.sql.functions as F
#count number of duplicate rows in DataFrame
df.groupBy(df.columns)\
.count()\
.where(F.col('count') > 1)\
.select(F.sum('count'))\
.show()
+----------+
|sum(count)|
+----------+
| 4|
+----------+
We can see that there are 4 total duplicate rows in the DataFrame.
To view these duplicate rows, simply remove the last select function from the previous code:
import pyspark.sql.functions as F
#view duplicate rows in DataFrame
df.groupBy(df.columns)\
.count()\
.where(F.col('count') > 1)\
.show()
+----+--------+------+-----+
|team|position|points|count|
+----+--------+------+-----+
| A| Forward| 22| 2|
| B| Guard| 14| 2|
+----+--------+------+-----+
We can see that team A, position Forward and points 22 occurs 2 times.
We can see that team B, position Guard and points 14 also occurs 2 times.
These represent the 4 duplicate rows.
Also note that you could use the collect function instead of the show function to return only a single value that represents the number of duplicate rows:
import pyspark.sql.functions as F
#count number of duplicate rows in DataFrame
df.groupBy(df.columns)\
.count()\
.where(F.col('count') > 1)\
.select(F.sum('count'))\
.collect()[0][0]
4
Notice that this particular syntax returns only the number 4.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
PySpark: How to Drop Duplicate Rows from DataFrame
PySpark: How to Drop Multiple Columns from DataFrame
PySpark: How to Add New Rows to DataFrame