You can use the following syntax to create a date column from year, month and day columns in a PySpark DataFrame:
from pyspark.sql import functions as F
df_new = df.withColumn('date', F.make_date('year', 'month', 'day'))
This particular example creates a new column called date by using the values in the year, month and day columns.
The following example shows how to use this syntax in practice.
Example: Create Date Column from Year, Month and Day
Suppose we have the following PySpark DataFrame that contains three columns to represent the year, month and day of a given date:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [[2021, 10, 30],
[2021, 12, 3],
[2022, 1, 14],
[2022, 3, 22],
[2022, 5, 24],
[2023, 3, 21],
[2023, 7, 18],
[2023, 11, 4]]
#define column names
columns = ['year', 'month', 'day']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+----+-----+---+
|year|month|day|
+----+-----+---+
|2021| 10| 30|
|2021| 12| 3|
|2022| 1| 14|
|2022| 3| 22|
|2022| 5| 24|
|2023| 3| 21|
|2023| 7| 18|
|2023| 11| 4|
+----+-----+---+
We can use the following syntax to create a new column called date that creates a date from the existing values in the year, month and day columns:
from pyspark.sql import functions as F
#create new DataFrame with 'date' column
df_new = df.withColumn('date', F.make_date('year', 'month', 'day'))
#view new DataFrame
df_new.show()
+----+-----+---+----------+
|year|month|day| date|
+----+-----+---+----------+
|2021| 10| 30|2021-10-30|
|2021| 12| 3|2021-12-03|
|2022| 1| 14|2022-01-14|
|2022| 3| 22|2022-03-22|
|2022| 5| 24|2022-05-24|
|2023| 3| 21|2023-03-21|
|2023| 7| 18|2023-07-18|
|2023| 11| 4|2023-11-04|
+----+-----+---+----------+
Notice that the new DataFrame contains a date column with dates created from the existing year, month and day columns.
We can use the following syntax to verify that the data type of the new date column is indeed a date:
#check data type of new 'date' column
dict(df_new.dtypes)['date']
'date'
The new column does indeed have a data type of date.
Note that we used the withColumn function to return a new DataFrame with a new column added and all other existing columns left the same.
You can find the complete documentation for the PySpark withColumn function here.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
How to Add Days to a Date Column in PySpark
How to Convert String to Date in PySpark
How to Convert Timestamp to Date in PySpark