You can use the following syntax to extract the month from a date in a PySpark DataFrame:
from pyspark.sql.functions import month
df_new = df.withColumn('month', month(df['date']))
This particular example creates a new column called month that extracts the month from the date in the date column.
The following example shows how to use this syntax in practice.
Example: How to Extract Month from Date in PySpark
Suppose we have the following PySpark DataFrame that contains information about the sales made on various days at some company:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [['2023-04-11', 22],
['2023-04-15', 14],
['2023-04-17', 12],
['2023-05-21', 15],
['2023-05-23', 30],
['2023-10-26', 45],
['2023-10-28', 32],
['2023-10-29', 47]]
#define column names
columns = ['date', 'sales']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+----------+-----+
| date|sales|
+----------+-----+
|2023-04-11| 22|
|2023-04-15| 14|
|2023-04-17| 12|
|2023-05-21| 15|
|2023-05-23| 30|
|2023-10-26| 45|
|2023-10-28| 32|
|2023-10-29| 47|
+----------+-----+
Suppose we would like to extract the month from each date in the date column.
We can use the following syntax to do so:
from pyspark.sql.functions import month
#extract month from date column
df_new = df.withColumn('month', month(df['date']))
#view new DataFrame
df_new.show()
+----------+-----+-----+
| date|sales|month|
+----------+-----+-----+
|2023-04-11| 22| 4|
|2023-04-15| 14| 4|
|2023-04-17| 12| 4|
|2023-05-21| 15| 5|
|2023-05-23| 30| 5|
|2023-10-26| 45| 10|
|2023-10-28| 32| 10|
|2023-10-29| 47| 10|
+----------+-----+-----+
The new month column contains the month of each date in the date column.
Also note that you could use the date_format function if you’d like to return the name of the month instead:
from pyspark.sql.functions import *
#extract month name from date column
df_new = df.withColumn('month', date_format('date', 'MMMM'))
#view new DataFrame
df_new.show()
+----------+-----+-------+
| date|sales| month|
+----------+-----+-------+
|2023-04-11| 22| April|
|2023-04-15| 14| April|
|2023-04-17| 12| April|
|2023-05-21| 15| May|
|2023-05-23| 30| May|
|2023-10-26| 45|October|
|2023-10-28| 32|October|
|2023-10-29| 47|October|
+----------+-----+-------+
The new month column now contains the name of the month for each date in the date column.
Note that we used the withColumn function to add a new column called month to the DataFrame while keeping all existing columns the same.
Note: You can find the complete documentation for the PySpark withColumn function here.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
How to Add Days to a Date Column in PySpark
How to Convert String to Date in PySpark
How to Convert Timestamp to Date in PySpark