You can use the following methods to calculate a difference between two dates in PySpark:
Method 1: Calculate Difference Between Dates in Days
from pyspark.sql import functions as F
df.withColumn('diff_days', F.datediff(F.to_date('end_date'), F.to_date('start_date'))).show()
Method 2: Calculate Difference Between Dates in Months
from pyspark.sql import functions as F
df.withColumn('diff_months', F.round(F.months_between(F.to_date('end_date'), F.to_date('start_date')),2)).show()
Method 3: Calculate Difference Between Dates in Years
from pyspark.sql import functions as F
df.withColumn('diff_years', F.round(F.months_between(F.to_date('end_date'), F.to_date('start_date'))/12,2)).show()
Each of these examples calculates a date difference using the ending dates from the end_date column and the starting dates from the start_date column in the DataFrame.
The following example shows how to use each method in practice.
Example: Calculate Difference Between Dates in PySpark
Suppose we have the following PySpark DataFrame that contains information about the start date and end date for various employees at a company:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [['A', '2020-10-25', '2023-01-15'],
['B', '2013-10-11', '2029-01-18'],
['C', '2015-10-17', '2022-04-15'],
['D', '2022-12-21', '2023-04-23'],
['E', '2021-04-14', '2023-07-25'],
['F', '2021-06-26', '2021-07-12']]
#define column names
columns = ['employee', 'start_date', 'end_date']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+--------+----------+----------+
|employee|start_date| end_date|
+--------+----------+----------+
| A|2020-10-25|2023-01-15|
| B|2013-10-11|2029-01-18|
| C|2015-10-17|2022-04-15|
| D|2022-12-21|2023-04-23|
| E|2021-04-14|2023-07-25|
| F|2021-06-26|2021-07-12|
+--------+----------+----------+
We can use the following syntax to calculate the date difference between each start and end date in terms of days, months and years:
from pyspark.sql import functions as F
#create new DataFrame with date differences columns
df.withColumn('diff_days', F.datediff(F.to_date('end_date'), F.to_date('start_date')))\
.withColumn('diff_months', F.round(F.months_between(F.to_date('end_date'), F.to_date('start_date')),2))\
.withColumn('diff_years', F.round(F.months_between(F.to_date('end_date'), F.to_date('start_date'))/12,2)).show()
+--------+----------+----------+---------+-----------+----------+
|employee|start_date| end_date|diff_days|diff_months|diff_years|
+--------+----------+----------+---------+-----------+----------+
| A|2020-10-25|2023-01-15| 812| 26.68| 2.22|
| B|2013-10-11|2029-01-18| 5578| 183.23| 15.27|
| C|2015-10-17|2022-04-15| 2372| 77.94| 6.49|
| D|2022-12-21|2023-04-23| 123| 4.06| 0.34|
| E|2021-04-14|2023-07-25| 832| 27.35| 2.28|
| F|2021-06-26|2021-07-12| 16| 0.55| 0.05|
+--------+----------+----------+---------+-----------+----------+
Notice that three new columns were added to the DataFrame that show the date difference between each start and end date in terms of days, months and years.
For example:
- The are 812 days between 2020-10-25 and 2023-01-15.
- The are 26.68 months between 2020-10-25 and 2023-01-15.
- The are 2.22 years between 2020-10-25 and 2023-01-15.
And so on.
Note that we use the round function to round the date difference for the months and years to two decimal places but you can choose to round to however many decimal places you’d like.
Also note that we used the withColumn function several times to return a new DataFrame with several columns added and all other existing columns left the same.
You can find the complete documentation for the PySpark withColumn function here.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
How to Add Days to a Date Column in PySpark
How to Convert String to Date in PySpark
How to Convert Timestamp to Date in PySpark