How to Group by Week in PySpark DataFrame


You can use the following syntax to group rows by week in a PySpark DataFrame:

from pyspark.sql.functions import weekofyear, sum

df.groupBy(weekofyear('date').alias('week')).agg(sum('sales').alias('sum_sales')).show()

This particular example groups the rows of the DataFrame by week based on the date in the date column and then calculates the sum of the values in the sales column by week.

The following example shows how to use this syntax in practice.

Example: How to Group by Week in PySpark

Suppose we have the following PySpark DataFrame that contains information about the sales made on various days at some company:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['2023-04-11', 22],
        ['2023-04-15', 14],
        ['2023-04-17', 12],
        ['2023-05-21', 15],
        ['2023-05-23', 30],
        ['2023-10-26', 45],
        ['2023-10-28', 32],
        ['2023-10-29', 47]]
  
#define column names
columns = ['date', 'sales']
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----------+-----+
|      date|sales|
+----------+-----+
|2023-04-11|   22|
|2023-04-15|   14|
|2023-04-17|   12|
|2023-05-21|   15|
|2023-05-23|   30|
|2023-10-26|   45|
|2023-10-28|   32|
|2023-10-29|   47|
+----------+-----+

Suppose we would like to calculate the sum of the sales, grouped by week.

We can use the following syntax to do so:

from pyspark.sql.functions import weekofyear, sum

#calculate sum of sales by week
df.groupBy(weekofyear('date').alias('week')).agg(sum('sales').alias('sum_sales')).show()

+----+---------+
|week|sum_sales|
+----+---------+
|  15|       36|
|  16|       12|
|  20|       15|
|  21|       30|
|  43|      124|
+----+---------+

The resulting DataFrame shows the sum of sales by week.

For example, we can see:

  • The sum of sales for the 15th week of the year was 36.
  • The sum of sales for the 16th week of the year was 12.
  • The sum of sales for the 20th week of the year was 15.

And so on.

Note that you can also aggregate sales by a different metric if you’d like.

For example, you could use the following syntax to calculate the total count of sales, grouped by week:

from pyspark.sql.functions import weekofyear, count

#calculate count of sales by week
df.groupBy(weekofyear('date').alias('week')).agg(count('sales').alias('cnt_sales')).show()

+----+---------+
|week|cnt_sales|
+----+---------+
|  15|        2|
|  16|        1|
|  20|        1|
|  21|        1|
|  43|        3|
+----+---------+

The resulting DataFrame now shows the count of sales by week.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

How to Add Days to a Date Column in PySpark
How to Convert String to Date in PySpark
How to Convert Timestamp to Date in PySpark

Featured Posts

Leave a Reply

Your email address will not be published. Required fields are marked *