You can use the following methods to filter the rows of a PySpark DataFrame based on values in a Boolean column:
Method 1: Filter Based on Values in One Boolean Column
#filter for rows where value in 'all_star' column is True df.filter(df.all_star==True).show()
Method 2: Filter Based on Values in Multiple Boolean Columns
#filter for rows where value in 'all_star' and 'starter' columns are both True df.filter((df.all_star==True) & (df.starter==True)).show()
The following examples show how to use each method in practice with the following PySpark DataFrame that contains information about various basketball players:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [['A', 18, True, False],
['B', 20, False, True],
['C', 25, True, True],
['D', 40, True, True],
['E', 34, True, False],
['F', 32, False, False],
['G', 19, False, False]]
#define column names
columns = ['team', 'points', 'all_star', 'starter']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+----+------+--------+-------+
|team|points|all_star|starter|
+----+------+--------+-------+
| A| 18| true| false|
| B| 20| false| true|
| C| 25| true| true|
| D| 40| true| true|
| E| 34| true| false|
| F| 32| false| false|
| G| 19| false| false|
+----+------+--------+-------+
Example 1: Filter Based on Values in One Boolean Column
We can use the following syntax to filter the DataFrame to only contain rows where the value in the all_star column is true:
#filter for rows where value in 'all_star' column is True df.filter(df.all_star==True).show() +----+------+--------+-------+ |team|points|all_star|starter| +----+------+--------+-------+ | A| 18| true| false| | C| 25| true| true| | D| 40| true| true| | E| 34| true| false| +----+------+--------+-------+
Notice that each of the rows in the filtered DataFrame have a value of true in the all_star column.
Example 2: Filter Based on Values in Multiple Boolean Columns
We can use the following syntax to filter the DataFrame to only contain rows where the value in the all_star column is true and the value in the starter column is true:
#filter for rows where value in 'all_star' and 'starter' columns are both True df.filter((df.all_star==True) & (df.starter==True)).show() +----+------+--------+-------+ |team|points|all_star|starter| +----+------+--------+-------+ | C| 25| true| true| | D| 40| true| true| +----+------+--------+-------+
Notice that each of the rows in the filtered DataFrame have a value of true in both the all_star and starter columns.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
PySpark: How to Filter Using “Contains”
PySpark: How to Filter Rows Using LIKE Operator
PySpark: How to Filter Rows Based on Values in a List