You can use the following syntax to filter a PySpark DataFrame using a LIKE operator:
df.filter(df.team.like('%avs%')).show()
This particular example filters the DataFrame to only show rows where the string in the team column has a pattern like “avs” somewhere in the string.
The following example shows how to use this syntax in practice.
Example: How to Filter Using LIKE Operator in PySpark
Suppose we have the following PySpark DataFrame that contains information about points scored by various basketball players:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [['Mavs', 18],
['Nets', 33],
['Lakers', 12],
['Mavs', 15],
['Cavs', 19],
['Wizards', 24],
['Cavs', 28],
['Nets', 40],
['Mavs', 24],
['Spurs', 13]]
#define column names
columns = ['team', 'points']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+-------+------+
| team|points|
+-------+------+
| Mavs| 18|
| Nets| 33|
| Lakers| 12|
| Mavs| 15|
| Cavs| 19|
|Wizards| 24|
| Cavs| 28|
| Nets| 40|
| Mavs| 24|
| Spurs| 13|
+-------+------+
We can use the following syntax to filter the DataFrame to only contain rows where the team column contains the pattern “avs” somewhere in the string:
#filter DataFrame where team column contains pattern like 'avs' df.filter(df.team.like('%avs%')).show() +----+------+ |team|points| +----+------+ |Mavs| 18| |Mavs| 15| |Cavs| 19| |Cavs| 28| |Mavs| 24| +----+------+
Notice that each of the rows in the resulting DataFrame contain “avs” in the team column.
Note: You can find the complete documentation for the PySpark like function here.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
PySpark: How to Use “OR” Operator
PySpark: How to Use “AND” Operator
PySpark: How to Use “NOT IN” Operator