You can use the following syntax to drop rows from a PySpark DataFrame based on multiple conditions:
import pyspark.sql.functions as F #drop rows where team is 'A' and points > 10 df_new = df.filter(~((F.col('team') == 'A') & (F.col('points') > 10)))
This particular example will drop all rows from the DataFrame where the value in the team column is ‘A’ and the value in the points column is greater than 10.
The following example shows how to use this syntax in practice.
Example: Drop Rows Based on Multiple Conditions in PySpark
Suppose we have the following PySpark DataFrame that contains information about various basketball players:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [['A', 'Guard', 11],
['A', 'Guard', 8],
['A', 'Forward', 22],
['A', 'Forward', 22],
['B', 'Guard', 14],
['B', 'Guard', 14],
['B', 'Forward', 13],
['B', 'Forward', 7]]
#define column names
columns = ['team', 'position', 'points']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+----+--------+------+
|team|position|points|
+----+--------+------+
| A| Guard| 11|
| A| Guard| 8|
| A| Forward| 22|
| A| Forward| 22|
| B| Guard| 14|
| B| Guard| 14|
| B| Forward| 13|
| B| Forward| 7|
+----+--------+------+
We can use the following syntax to drop all rows from the DataFrame where the value in the team column is ‘A’ and the value in the points column is greater than 10.
import pyspark.sql.functions as F #drop rows where team is 'A' and points > 10 df_new = df.filter(~((F.col('team') == 'A') & (F.col('points') > 10))) #view new DataFrame df_new.show() +----+--------+------+ |team|position|points| +----+--------+------+ | A| Guard| 8| | B| Guard| 14| | B| Guard| 14| | B| Forward| 13| | B| Forward| 7| +----+--------+------+
Notice that all three rows in the DataFrame where the value in the team column is ‘A’ and the value in the points column is greater than 10 have been dropped.
Note that a row must meet both of these conditions to be dropped from the DataFrame.
Note #1: We used a single & symbol to filter based on two conditions but you can include more & symbols if you’d like to filter by even more conditions.
Note #2: You can find the complete documentation for the PySpark filter function here.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
PySpark: How to Replace Zero with Null
PySpark: How to Replace String in Column
PySpark: How to Drop Duplicate Rows from DataFrame