You can use the following methods to calculate a difference between two dates in PySpark:

**Method 1: Calculate Difference Between Dates in Days**

from pyspark.sql import functions as F df.withColumn('diff_days', F.datediff(F.to_date('end_date'), F.to_date('start_date'))).show()

**Method 2: Calculate Difference Between Dates in Months**

from pyspark.sql import functions as F df.withColumn('diff_months', F.round(F.months_between(F.to_date('end_date'), F.to_date('start_date')),2)).show()

**Method 3: Calculate Difference Between Dates in Years**

from pyspark.sql import functions as F df.withColumn('diff_years', F.round(F.months_between(F.to_date('end_date'), F.to_date('start_date'))/12,2)).show()

Each of these examples calculates a date difference using the ending dates from the **end_date** column and the starting dates from the **start_date** column in the DataFrame.

The following example shows how to use each method in practice.

**Example: Calculate Difference Between Dates in PySpark**

Suppose we have the following PySpark DataFrame that contains information about the start date and end date for various employees at a company:

**from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [['A', '2020-10-25', '2023-01-15'],
['B', '2013-10-11', '2029-01-18'],
['C', '2015-10-17', '2022-04-15'],
['D', '2022-12-21', '2023-04-23'],
['E', '2021-04-14', '2023-07-25'],
['F', '2021-06-26', '2021-07-12']]
#define column names
columns = ['employee', 'start_date', 'end_date']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+--------+----------+----------+
|employee|start_date| end_date|
+--------+----------+----------+
| A|2020-10-25|2023-01-15|
| B|2013-10-11|2029-01-18|
| C|2015-10-17|2022-04-15|
| D|2022-12-21|2023-04-23|
| E|2021-04-14|2023-07-25|
| F|2021-06-26|2021-07-12|
+--------+----------+----------+**

We can use the following syntax to calculate the date difference between each start and end date in terms of days, months and years:

from pyspark.sql import functions as F #create new DataFrame with date differences columns df.withColumn('diff_days', F.datediff(F.to_date('end_date'), F.to_date('start_date')))\ .withColumn('diff_months', F.round(F.months_between(F.to_date('end_date'), F.to_date('start_date')),2))\ .withColumn('diff_years', F.round(F.months_between(F.to_date('end_date'), F.to_date('start_date'))/12,2)).show() +--------+----------+----------+---------+-----------+----------+ |employee|start_date| end_date|diff_days|diff_months|diff_years| +--------+----------+----------+---------+-----------+----------+ | A|2020-10-25|2023-01-15| 812| 26.68| 2.22| | B|2013-10-11|2029-01-18| 5578| 183.23| 15.27| | C|2015-10-17|2022-04-15| 2372| 77.94| 6.49| | D|2022-12-21|2023-04-23| 123| 4.06| 0.34| | E|2021-04-14|2023-07-25| 832| 27.35| 2.28| | F|2021-06-26|2021-07-12| 16| 0.55| 0.05| +--------+----------+----------+---------+-----------+----------+

Notice that three new columns were added to the DataFrame that show the date difference between each start and end date in terms of days, months and years.

For example:

- The are
**812 days**between 2020-10-25 and 2023-01-15. - The are
**26.68 months**between 2020-10-25 and 2023-01-15. - The are
**2.22 years**between 2020-10-25 and 2023-01-15.

And so on.

Note that we use the **round** function to round the date difference for the months and years to two decimal places but you can choose to round to however many decimal places you’d like.

Also note that we used the **withColumn** function several times to return a new DataFrame with several columns added and all other existing columns left the same.

You can find the complete documentation for the PySpark **withColumn** function here.

**Additional Resources**

The following tutorials explain how to perform other common tasks in PySpark:

How to Add Days to a Date Column in PySpark

How to Convert String to Date in PySpark

How to Convert Timestamp to Date in PySpark