PySpark: Get Rows Which Are Not in Another DataFrame


You can use the following syntax to get the rows in one PySpark DataFrame which are not in another DataFrame:

df1.exceptAll(df2).show()

This particular example will return all of the rows from the DataFrame named df1 that are not in the DataFrame named df2.

The following example shows how to use this syntax in practice.

Example: Get Rows from One DataFrame that Are Not in Another DataFrame

Suppose we have the following DataFrame named df1:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data1 = [['A', 18], 
         ['B', 22], 
         ['C', 19], 
         ['D', 14],
         ['E', 30]]

#define column names
columns1 = ['team', 'points'] 
  
#create dataframe using data and column names
df1 = spark.createDataFrame(data1, columns1) 
  
#view dataframe
df1.show()

+----+------+
|team|points|
+----+------+
|   A|    18|
|   B|    22|
|   C|    19|
|   D|    14|
|   E|    30|
+----+------+

And suppose we have another DataFrame named df2:

#define data
data2 = [['A', 18], 
         ['B', 22], 
         ['C', 19], 
         ['F', 22],
         ['G', 29]]

#define column names
columns2 = ['team', 'points'] 
  
#create dataframe using data and column names
df2 = spark.createDataFrame(data2, columns2) 
  
#view dataframe
df2.show()

+----+------+
|team|points|
+----+------+
|   A|    18|
|   B|    22|
|   C|    19|
|   F|    22|
|   G|    29|
+----+------+

We can use the following syntax to return all rows that exist in df1 that do not exist in df2:

#display all rows in df1 that do not exist in df2
df1.exceptAll(df2).show() 

+----+------+
|team|points|
+----+------+
|   D|    14|
|   E|    30|
+----+------+

We can see that there are exactly two rows from the first DataFrame that do not exist in the second DataFrame.

Note: You can find the complete documentation for the PySpark exceptAll function here.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

How to Do a Right Join in PySpark
How to Do a Left Join in PySpark
How to Do a Left Join on Multiple Columns in PySpark

Featured Posts

Leave a Reply

Your email address will not be published. Required fields are marked *