By default, the contains function in PySpark is case-sensitive.
However, you can use the following syntax to use a case-insensitive “contains” to filter a DataFrame where rows contain a specific string, regardless of case:
from pyspark.sql.functions import upper #perform case-insensitive filter for rows that contain 'AVS' in team column df.filter(upper(df.team).contains('AVS')).show()
The following example shows how to use this syntax in practice.
Example: How to Use Case-Insensitive “Contains” in PySpark
Suppose we have the following PySpark DataFrame that contains information about points scored by various basketball players:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [['Mavs', 14],
['Nets', 22],
['Nets', 31],
['Cavs', 27],
['CAVS', 26],
['Spurs', 40],
['mavs', 23],
['MAVS', 17],]
#define column names
columns = ['team', 'points']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+-----+------+
| team|points|
+-----+------+
| Mavs| 14|
| Nets| 22|
| Nets| 31|
| Cavs| 27|
| CAVS| 26|
|Spurs| 40|
| mavs| 23|
| MAVS| 17|
+-----+------+
Suppose we use the following syntax to filter the DataFrame to only contain rows where the team column contains “AVS” somewhere in the string:
#filter DataFrame where team column contains 'AVS' df.filter(df.team.contains('AVS')).show() +----+------+ |team|points| +----+------+ |CAVS| 26| |MAVS| 17| +----+------+
Notice that this syntax performs a case-sensitive search by default and only returns the rows where the team column contains “AVS” in all uppercase.
However, suppose we would like to perform a case-insensitive search and return all rows where the team column contains “AVS”, regardless of case.
We can use the following syntax to do so:
from pyspark.sql.functions import upper #perform case-insensitive filter for rows that contain 'AVS' in team column df.filter(upper(df.team).contains('AVS')).show() +----+------+ |team|points| +----+------+ |Mavs| 14| |Cavs| 27| |CAVS| 26| |mavs| 23| |MAVS| 17| +----+------+
Notice that this syntax performs a case-insensitive search and returns all rows where the team column contains “AVS”, regardless of case.
Note: We used the upper function to first convert all strings in the team column to uppercase and then searched for “AVS”, which is the equivalent of using a case-sensitive “contains” filter.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
PySpark: How to Use “OR” Operator
PySpark: How to Use “AND” Operator
PySpark: How to Use “NOT IN” Operator