PySpark: Filter for Rows that Contain One of Multiple Values


You can use the following syntax to filter for rows in a PySpark DataFrame that contain one of multiple values:

#define array of substrings to search for
my_values = ['ets', 'urs']
regex_values = "|".join(my_values)

filter DataFrame where team column contains any substring from array
df.filter(df.team.rlike(regex_values)).show()

The following example shows how to use this syntax in practice.

Example: Filter for Rows that Contain One of Multiple Values in PySpark

Suppose we have the following PySpark DataFrame that contains information about points scored by various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Mavs', 14], 
        ['Nets', 22], 
        ['Nets', 31], 
        ['Cavs', 27], 
        ['Kings', 26], 
        ['Spurs', 40],
        ['Lakers', 23],
        ['Spurs', 17],] 
  
#define column names
columns = ['team', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+------+------+
|  team|points|
+------+------+
|  Mavs|    14|
|  Nets|    22|
|  Nets|    31|
|  Cavs|    27|
| Kings|    26|
| Spurs|    40|
|Lakers|    23|
| Spurs|    17|
+------+------+

We can use the following syntax to filter the DataFrame to only contain rows where the team column contains “ets” or “urs” somewhere in the string:

#define array of substrings to search for
my_values = ['ets', 'urs']
regex_values = "|".join(my_values)

filter DataFrame where team column contains any substring from array
df.filter(df.team.rlike(regex_values)).show()

+-----+------+
| team|points|
+-----+------+
| Nets|    22|
| Nets|    31|
|Spurs|    40|
+-----+------+

Notice that each of the rows in the resulting DataFrame contains either “ets” or “urs” in the team column.

Note: We used the rlike function to search for partial string matches in the team column. You can find the complete documentation the PySPark rlike function here.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

PySpark: How to Use “OR” Operator
PySpark: How to Use “AND” Operator
PySpark: How to Use “NOT IN” Operator

Leave a Reply

Your email address will not be published. Required fields are marked *