PySpark: How to Drop Rows that Contain a Specific Value


You can use the following methods to drop rows in a PySpark DataFrame that contain a specific value:

Method 1: Drop Rows with Specific Value

#drop rows where value in 'conference' column is equal to 'West'
df_new = df.filter(df.conference != 'West')

Method 2: Drop Rows with One of Several Specific Values

from pyspark.sql.functions import col

#drop rows where value in 'team' column is equal to 'A' or 'D'
df_new = df.filter(~col('team').isin(['A','D']))

The following examples show how to use each method in practice with the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11], 
        ['A', 'East', 8], 
        ['A', 'East', 10], 
        ['B', 'West', 6], 
        ['B', 'West', 6], 
        ['C', 'East', 5],
        ['C', 'East', 15],
        ['C', 'West', 31],
        ['D', 'West', 24]] 
  
#define column names
columns = ['team', 'conference', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+
|team|conference|points|
+----+----------+------+
|   A|      East|    11|
|   A|      East|     8|
|   A|      East|    10|
|   B|      West|     6|
|   B|      West|     6|
|   C|      East|     5|
|   C|      East|    15|
|   C|      West|    31|
|   D|      West|    24|
+----+----------+------+

Example 1: Drop Rows with Specific Value in PySpark

We can use the following syntax to drop rows that contain the value ‘West’ in the conference column of the DataFrame:

#drop rows where value in 'conference' column is equal to 'West'
df_new = df.filter(df.conference != 'West')

#view new DataFrame
df_new.show()

+----+----------+------+
|team|conference|points|
+----+----------+------+
|   A|      East|    11|
|   A|      East|     8|
|   A|      East|    10|
|   C|      East|     5|
|   C|      East|    15|
+----+----------+------+

Notice that all rows in the DataFrame that contained the value ‘West’ in the conference column have been dropped.

Example 2: Drop Rows with One of Several Specific Values in PySpark

We can use the following syntax to drop rows that contain the value ‘A’ or ‘D’ in the team column of the DataFrame:

from pyspark.sql.functions import col

#drop rows where value in 'team' column is equal to 'A' or 'D'
df_new = df.filter(~col('team').isin(['A','D']))

#view new DataFrame
df_new.show()

+----+----------+------+
|team|conference|points|
+----+----------+------+
|   B|      West|     6|
|   B|      West|     6|
|   C|      East|     5|
|   C|      East|    15|
|   C|      West|    31|
+----+----------+------+

Notice that all rows in the DataFrame that contained the value ‘A’ or ‘D’ in the team column have been dropped.

Note: You can find the complete documentation for the PySpark filter function here.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

PySpark: How to Replace Zero with Null
PySpark: How to Replace String in Column
PySpark: How to Drop Duplicate Rows from DataFrame

Leave a Reply

Your email address will not be published. Required fields are marked *