PySpark: How to Calculate Difference Between Two Times


You can use the following syntax to calculate a difference between two times in a PySpark DataFrame:

from pyspark.sql.functions import col

df_new = df.withColumn('seconds_diff', col('end_time').cast('long') - col('start_time').cast('long'))\
           .withColumn('minutes_diff', (col('end_time').cast('long') - col('start_time').cast('long'))/60)\
           .withColumn('hours_diff', (col('end_time').cast('long') - col('start_time').cast('long'))/3600)

This particular example calculates the difference between the times in the start_time and end_time columns in a DataFrame in terms of seconds, minutes and hours.

The following example shows how to use this syntax in practice.

Example: How to Calculate Difference Between Two Times in PySpark

Suppose we have the following PySpark DataFrame that contains a column of start times for some activity and a column of end times:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

from pyspark.sql import functions as F

#define data
data = [['2023-01-15 04:14:22', '2023-01-18 04:15:00'],
        ['2023-02-24 10:55:01', '2023-02-24 11:14:30'],
        ['2023-07-14 18:34:59', '2023-07-14 18:35:22'],
        ['2023-10-30 22:20:05', '2023-11-02 07:55:00']] 
  
#define column names
columns = ['start_time', 'end_time'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)

#convert string columns to timestamp
df = df.withColumn('start_time', F.to_timestamp('start_time', 'yyyy-MM-dd HH:mm:ss'))\
       .withColumn('end_time', F.to_timestamp('end_time', 'yyyy-MM-dd HH:mm:ss'))
         
#view DataFrame
df.show()

+-------------------+-------------------+
|         start_time|           end_time|
+-------------------+-------------------+
|2023-01-15 04:14:22|2023-01-18 04:15:00|
|2023-02-24 10:55:01|2023-02-24 11:14:30|
|2023-07-14 18:34:59|2023-07-14 18:35:22|
|2023-10-30 22:20:05|2023-11-02 07:55:00|
+-------------------+-------------------+

We can use the following syntax to create a new DataFrame that contains three new columns that display the time difference between each start and end time in terms of seconds, minutes and hours:

from pyspark.sql.functions import col
#create new DataFrame with time differences
df_new = df.withColumn('seconds_diff', col('end_time').cast('long') - col('start_time').cast('long'))\
           .withColumn('minutes_diff', (col('end_time').cast('long') - col('start_time').cast('long'))/60)\
           .withColumn('hours_diff', (col('end_time').cast('long') - col('start_time').cast('long'))/3600)

#view new DataFrame
df_new.show()

+-------------------+-------------------+------------+-------------------+--------------------+
|         start_time|           end_time|seconds_diff|       minutes_diff|          hours_diff|
+-------------------+-------------------+------------+-------------------+--------------------+
|2023-01-15 04:14:22|2023-01-18 04:15:00|      259238|  4320.633333333333|   72.01055555555556|
|2023-02-24 10:55:01|2023-02-24 11:14:30|        1169| 19.483333333333334| 0.32472222222222225|
|2023-07-14 18:34:59|2023-07-14 18:35:22|          23|0.38333333333333336|0.006388888888888889|
|2023-10-30 22:20:05|2023-11-02 07:55:00|      207295| 3454.9166666666665|  57.581944444444446|
+-------------------+-------------------+------------+-------------------+--------------------+

The resulting DataFrame contains the following three new columns:

  • seconds_diff: The difference between each start and end time in seconds.
  • minutes_diff: The difference between each start and end time in minutes.
  • hours_diff: The difference between each start and end time in hours.

Note that we used the withColumn function three times to return a new DataFrame with three columns added to the existing DataFrame.

You can find the complete documentation for the PySpark withColumn function here.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

How to Convert String to Timestamp in PySpark
How to Convert Timestamp to Date in PySpark
How to Calculate a Difference Between Two Dates in PySpark

Featured Posts

Leave a Reply

Your email address will not be published. Required fields are marked *