PySpark: How to Filter Rows Based on Values in a List


You can use the following syntax to filter a PySpark DataFrame for rows that contain a value from a specific list:

#specify values to filter for
my_list = ['Mavs', 'Kings', 'Spurs']

#filter for rows where team is in list
df.filter(df.team.isin(my_list)).show()

This particular example filters the DataFrame to only contain rows where the value in the team column is equal to one of the values in the list that we specified.

The following example shows how to use this syntax in practice.

Example: How to Filter Rows Based on Values in List in PySpark

Suppose we have the following PySpark DataFrame that contains information about points scored by various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Mavs', 18], 
        ['Nets', 33], 
        ['Lakers', 12], 
        ['Mavs', 15], 
        ['Kings', 19],
        ['Wizards', 24],
        ['Magic', 28],
        ['Nets', 40],
        ['Mavs', 24],
        ['Spurs', 13]]
  
#define column names
columns = ['team', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-------+------+
|   team|points|
+-------+------+
|   Mavs|    18|
|   Nets|    33|
| Lakers|    12|
|   Mavs|    15|
|  Kings|    19|
|Wizards|    24|
|  Magic|    28|
|   Nets|    40|
|   Mavs|    24|
|  Spurs|    13|
+-------+------+

We can use the following syntax to filter the DataFrame for rows where the team column is equal to a team name in a specific list:

#specify values to filter for
my_list = ['Mavs', 'Kings', 'Spurs']

#filter for rows where team is in list
df.filter(df.team.isin(my_list)).show()

+-----+------+
| team|points|
+-----+------+
| Mavs|    18|
| Mavs|    15|
|Kings|    19|
| Mavs|    24|
|Spurs|    13|
+-----+------+

Notice that each of the rows in the filtered DataFrame have a team value equal to either Mavs, Kings or Spurs, which are the three team names that we specified in our list.

Note #1: The isin function is case-sensitive.

Note #2: You can find the complete documentation for the PySpark isin function here.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

PySpark: How to Use “OR” Operator
PySpark: How to Use “AND” Operator
PySpark: How to Use “NOT IN” Operator

Leave a Reply

Your email address will not be published. Required fields are marked *