You can use the following syntax to round dates to the first day of the month in a PySpark DataFrame:
import pyspark.sql.functions as F
#add new column that rounds date to first day of month
df_new = df.withColumn('first_day_of_month', F.trunc('date', 'month'))
This particular example creates a new column named first_day_of_month that rounds each date in the date column to the first day of the month.
The following example shows how to use this syntax in practice.
Example: How to Round Date to First Day of Month in PySpark
Suppose we have the following PySpark DataFrame that contains information about the sales made on various days at some company:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [['2023-04-11', 22],
['2023-04-15', 14],
['2023-04-17', 12],
['2023-05-21', 15],
['2023-05-23', 30],
['2023-10-26', 45],
['2023-10-28', 32],
['2023-10-29', 47]]
#define column names
columns = ['date', 'sales']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+----------+-----+
| date|sales|
+----------+-----+
|2023-04-11| 22|
|2023-04-15| 14|
|2023-04-17| 12|
|2023-05-21| 15|
|2023-05-23| 30|
|2023-10-26| 45|
|2023-10-28| 32|
|2023-10-29| 47|
+----------+-----+
Suppose we would like to round each date in the date column to the first day of the month.
We can use the following syntax to do so:
import pyspark.sql.functions as F
#add new column that rounds date to first day of month
df_new = df.withColumn('first_day_of_month', F.trunc('date', 'month'))
#view new DataFrame
df_new.show()
+----------+-----+------------------+
| date|sales|first_day_of_month|
+----------+-----+------------------+
|2023-04-11| 22| 2023-04-01|
|2023-04-15| 14| 2023-04-01|
|2023-04-17| 12| 2023-04-01|
|2023-05-21| 15| 2023-05-01|
|2023-05-23| 30| 2023-05-01|
|2023-10-26| 45| 2023-10-01|
|2023-10-28| 32| 2023-10-01|
|2023-10-29| 47| 2023-10-01|
+----------+-----+------------------+
The new first_day_of_month column contains each date from the date column rounded to the first day of the month.
For example, we can see:
- The date 2023-04-11 has been rounded to 2023-04-01.
- The date 2023-04-15 has been rounded to 2023-04-01.
- The date 2023-04-17 has been rounded to 2023-04-01.
And so on.
Note: You can find the complete documentation for the PySpark trunc function here.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
How to Add Days to a Date Column in PySpark
How to Convert String to Date in PySpark
How to Convert Timestamp to Date in PySpark