You can use the following syntax in PySpark to filter DataFrame rows where a value in a particular column is not in a particular list:
#define array of values my_array = ['A', 'D', 'E'] #filter DataFrame to only contain rows where 'team' is not in my_array df.filter(~df.team.isin(my_array)).show()
This particular example will filter the DataFrame to only contain rows where the value in the team column is not equal to A, D, or E.
The following example shows how to use this syntax in practice.
Example: How to Use “IS NOT IN” in PySpark
Suppose we have the following PySpark DataFrame that contains information about various basketball players:
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() #define data data = [['A', 'East', 11, 4], ['A', 'East', 8, 9], ['A', 'East', 10, 3], ['B', 'West', 6, 12], ['B', 'West', 6, 4], ['C', 'East', 5, 2], ['D', 'East', 14, 2], ['E', 'West', 25, 2]] #define column names columns = ['team', 'conference', 'points', 'assists'] #create dataframe using data and column names df = spark.createDataFrame(data, columns) #view dataframe df.show() +----+----------+------+-------+ |team|conference|points|assists| +----+----------+------+-------+ | A| East| 11| 4| | A| East| 8| 9| | A| East| 10| 3| | B| West| 6| 12| | B| West| 6| 4| | C| East| 5| 2| | D| East| 14| 2| | E| West| 25| 2| +----+----------+------+-------+
We can use the following syntax to filter the DataFrame to only show rows where the value in the team column is not equal to A, D, or E:
#define array of values my_array = ['A', 'D', 'E'] #filter DataFrame to only contain rows where 'team' is not in my_array df.filter(~df.team.isin(my_array)).show() +----+----------+------+-------+ |team|conference|points|assists| +----+----------+------+-------+ | B| West| 6| 12| | B| West| 6| 4| | C| East| 5| 2| +----+----------+------+-------+
The resulting DataFrame only contains rows where the value in the team column is not equal to A, D, or E.
Note: The tilde ( ~ ) operator is used in PySpark to represent NOT.
By using this operator along with the isin function, we are able to filter the DataFrame to only contain rows where the value in a particular column is not in a list of values.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
PySpark: How to Use “Is Not Null”
PySpark: How to Select Columns by Index in DataFrame
PySpark: How to Select Rows Based on Column Values