PySpark: How to Extract Month from Date


You can use the following syntax to extract the month from a date in a PySpark DataFrame:

from pyspark.sql.functions import month

df_new = df.withColumn('month', month(df['date']))

This particular example creates a new column called month that extracts the month from the date in the date column.

The following example shows how to use this syntax in practice.

Example: How to Extract Month from Date in PySpark

Suppose we have the following PySpark DataFrame that contains information about the sales made on various days at some company:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['2023-04-11', 22],
        ['2023-04-15', 14],
        ['2023-04-17', 12],
        ['2023-05-21', 15],
        ['2023-05-23', 30],
        ['2023-10-26', 45],
        ['2023-10-28', 32],
        ['2023-10-29', 47]]
  
#define column names
columns = ['date', 'sales']
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----------+-----+
|      date|sales|
+----------+-----+
|2023-04-11|   22|
|2023-04-15|   14|
|2023-04-17|   12|
|2023-05-21|   15|
|2023-05-23|   30|
|2023-10-26|   45|
|2023-10-28|   32|
|2023-10-29|   47|
+----------+-----+

Suppose we would like to extract the month from each date in the date column.

We can use the following syntax to do so:

from pyspark.sql.functions import month

#extract month from date column
df_new = df.withColumn('month', month(df['date']))

#view new DataFrame
df_new.show()

+----------+-----+-----+
|      date|sales|month|
+----------+-----+-----+
|2023-04-11|   22|    4|
|2023-04-15|   14|    4|
|2023-04-17|   12|    4|
|2023-05-21|   15|    5|
|2023-05-23|   30|    5|
|2023-10-26|   45|   10|
|2023-10-28|   32|   10|
|2023-10-29|   47|   10|
+----------+-----+-----+

The new month column contains the month of each date in the date column.

Also note that you could use the date_format function if you’d like to return the name of the month instead:

from pyspark.sql.functions import *

#extract month name from date column
df_new = df.withColumn('month', date_format('date', 'MMMM'))

#view new DataFrame
df_new.show()

+----------+-----+-------+
|      date|sales|  month|
+----------+-----+-------+
|2023-04-11|   22|  April|
|2023-04-15|   14|  April|
|2023-04-17|   12|  April|
|2023-05-21|   15|    May|
|2023-05-23|   30|    May|
|2023-10-26|   45|October|
|2023-10-28|   32|October|
|2023-10-29|   47|October|
+----------+-----+-------+

The new month column now contains the name of the month for each date in the date column.

Note that we used the withColumn function to add a new column called month to the DataFrame while keeping all existing columns the same.

Note: You can find the complete documentation for the PySpark withColumn function here.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

How to Add Days to a Date Column in PySpark
How to Convert String to Date in PySpark
How to Convert Timestamp to Date in PySpark

Featured Posts

Leave a Reply

Your email address will not be published. Required fields are marked *