PySpark: How to Add Days to a Date Column


You can use the following syntax to add a specific number of days to a date column in a PySpark DataFrame:

from pyspark.sql import functions as F

df.withColumn('date_plus_5', F.date_add(df['date'], 5)).show()

This particular example creates a new column called date_plus_5 that adds 5 days to each date in the date column.

The following example shows how to use this syntax in practice.

Example: How to Add Days to a Date Column in PySpark

Suppose we have the following PySpark DataFrame that contains information about sales made on various dates at some company:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['2023-01-15', 225],
        ['2023-02-24', 260],
        ['2023-07-14', 413],
        ['2023-10-30', 368],
        ['2023-11-03', 322],
        ['2023-11-26', 278]] 
  
#define column names
columns = ['date', 'sales'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----------+-----+
|      date|sales|
+----------+-----+
|2023-01-15|  225|
|2023-02-24|  260|
|2023-07-14|  413|
|2023-10-30|  368|
|2023-11-03|  322|
|2023-11-26|  278|
+----------+-----+

Suppose we would like to add a new column that adds 5 days to each date in the date column.

We can use the following syntax to do so:

from pyspark.sql import functions as F

#add 5 days to each date in 'date' column
df.withColumn('date_plus_5', F.date_add(df['date'], 5)).show()

+----------+-----+-----------+
|      date|sales|date_plus_5|
+----------+-----+-----------+
|2023-01-15|  225| 2023-01-20|
|2023-02-24|  260| 2023-03-01|
|2023-07-14|  413| 2023-07-19|
|2023-10-30|  368| 2023-11-04|
|2023-11-03|  322| 2023-11-08|
|2023-11-26|  278| 2023-12-01|
+----------+-----+-----------+

Notice that the new date_plus_5 column contains each of the dates from the date column with five days added.

Note that if you would instead like to subtract 5 days, you could use the date_sub() function instead:

from pyspark.sql import functions as F

#subtract 5 days from each date in 'date' column
df.withColumn('date_sub_5', F.date_sub(df['date'], 5)).show()

+----------+-----+----------+
|      date|sales|date_sub_5|
+----------+-----+----------+
|2023-01-15|  225|2023-01-10|
|2023-02-24|  260|2023-02-19|
|2023-07-14|  413|2023-07-09|
|2023-10-30|  368|2023-10-25|
|2023-11-03|  322|2023-10-29|
|2023-11-26|  278|2023-11-21|
+----------+-----+----------+

Notice that the new date_sub_5 column contains each of the dates from the date column with five days subtracted.

Note that we used the withColumn function to return a new DataFrame with the date_sub_5 column added and all other columns left the same.

You can find the complete documentation for the PySpark withColumn function here.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

How to Convert String to Date in PySpark
How to Convert Timestamp to Date in PySpark
How to Check Data Type of Columns in PySpark

Featured Posts

Leave a Reply

Your email address will not be published. Required fields are marked *