How to Use “OR” Operator in PySpark (With Examples)


There are two common ways to filter a PySpark DataFrame by using an “OR” operator:

Method 1: Use “OR”

#filter DataFrame where points is greater than 9 or team equals "B"
df.filter('points>9 or team=="B"').show()

Method 2: Use | Symbol

#filter DataFrame where points is greater than 9 or team equals "B"
df.filter((df.points>9) | (df.team=="B")).show()

The following examples show how to use each method in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11, 4], 
        ['A', 'East', 8, 9], 
        ['A', 'East', 10, 3], 
        ['B', 'West', 6, 12], 
        ['B', 'West', 6, 4], 
        ['C', 'East', 5, 2]] 
  
#define column names
columns = ['team', 'conference', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|    10|      3|
|   B|      West|     6|     12|
|   B|      West|     6|      4|
|   C|      East|     5|      2|
+----+----------+------+-------+

Example 1: Filter DataFrame Using “OR”

We can use the following syntax with the filter function and the word or to filter the DataFrame to only contain rows where the value in the points column is greater than 9 or the value in the team column is equal to B:

#filter DataFrame where points is greater than 9 or team equals "B"
df.filter('points>9 or team=="B"').show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|    10|      3|
|   B|      West|     6|     12|
|   B|      West|     6|      4|
+----+----------+------+-------+

 Notice that each of the rows in the resulting DataFrame meet at least one of the following conditions:

  • The value in the points column is greater than 9
  • The value in the team column is equal to “B”

Also note that in this example we only used one or operator but you can combine as many or operators as you’d like inside the filter function to filter using even more conditions.

Example 2: Filter DataFrame Using | Symbol

We can use the following syntax with the filter function and the | symbol to filter the DataFrame to only contain rows where the value in the points column is greater than 9 or the value in the team column is equal to B:

#filter DataFrame where points is greater than 9 or team equals "B"
df.filter((df.points>9) | (df.team=="B")).show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|    10|      3|
|   B|      West|     6|     12|
|   B|      West|     6|      4|
+----+----------+------+-------+

Notice that each of the rows in the resulting DataFrame meet at least one of the following conditions:

  • The value in the points column is greater than 9
  • The value in the team column is equal to “B”

Also note that this DataFrame matches the DataFrame from the previous example.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

PySpark: How to Select Multiple Columns
PySpark: How to Select Columns with Alias
PySpark: How to Select Columns by Index

Featured Posts

Leave a Reply

Your email address will not be published. Required fields are marked *