How to Filter by Date Range in PySpark (With Example)


You can use the following syntax to filter rows in a PySpark DataFrame based on a date range:

#specify start and end dates
dates = ('2019-01-01', '2022-01-01')

#filter DataFrame to only show rows between start and end dates
df.filter(df.start_date.between(*dates)).show()

This particular example filters the DataFrame to only contain rows where the date in the start_date column of the DataFrame is between 2019-01-01 and 2022-01-01.

The following example shows how to use this syntax in practice.

Example: How to Filter by Date Range in PySpark

Suppose we have the following PySpark DataFrame that contains information about the start date for various employees at a company:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', '2017-10-25'],
        ['B', '2018-10-11'],
        ['C', '2018-10-17'],
        ['D', '2019-12-21'],
        ['E', '2021-04-14'],
        ['F', '2022-06-26']] 
  
#define column names
columns = ['employee', 'start_date']
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+--------+----------+
|employee|start_date|
+--------+----------+
|       A|2017-10-25|
|       B|2018-10-11|
|       C|2018-10-17|
|       D|2019-12-21|
|       E|2021-04-14|
|       F|2022-06-26|
+--------+----------+

We can use the following syntax to filter the DataFrame to only contain rows where the date in the start_date column of the DataFrame is between 2019-01-01 and 2022-01-01:

#specify start and end dates
dates = ('2019-01-01', '2022-01-01')

#filter DataFrame to only show rows between start and end dates
df.filter(df.start_date.between(*dates)).show()

+--------+----------+
|employee|start_date|
+--------+----------+
|       D|2019-12-21|
|       E|2021-04-14|
+--------+----------+

Notice that the DataFrame has been filtered to only show the rows with the two dates in the start_date column that fall between 2019-01-01 and 2022-01-01.

Note that if you only want to know how many rows have a date within a specific date range, then you can use the count function as follows:

#specify start and end dates
dates = ('2019-01-01', '2022-01-01')

#count number of rows in DataFrame that fall between start and end dates
df.filter(df.start_date.between(*dates)).count()

2

This tells us that there are two rows in the DataFrame where the date in the start_date column falls between 2019-01-01 and 2022-01-01.

Note: You can find the complete documentation for the PySpark between function here.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

How to Add Days to a Date Column in PySpark
How to Convert String to Date in PySpark
How to Convert Timestamp to Date in PySpark

Leave a Reply

Your email address will not be published. Required fields are marked *