You can use the following syntax to group rows by date in a PySpark DataFrame:
from pyspark.sql.types import DateType
#calculate sum of sales by date
df.groupBy(df['ts'].cast(DateType()).alias('date'))\
.agg(sum('sales').alias('sum_sales')).show())
This particular example groups the rows of the DataFrame by date based on the values in the ts column and then calculates the sum of the values in the sales column by date.
The following example shows how to use this syntax in practice.
Example: How to Group by Date in PySpark
Suppose we have the following PySpark DataFrame that contains information about the sales made on various timestamps at some company:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
from pyspark.sql import functions as F
#define data
data = [['2023-01-15 04:14:22', 20],
['2023-01-15 10:55:01', 30],
['2023-01-15 18:34:59', 15],
['2023-01-16 21:20:25', 12],
['2023-01-16 22:20:05', 15],
['2023-01-17 04:17:02', 41]]
#define column names
columns = ['ts', 'sales']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#convert string column to timestamp
df = df.withColumn('ts', F.to_timestamp('ts', 'yyyy-MM-dd HH:mm:ss'))
#view dataframe
df.show()
+-------------------+-----+
| ts|sales|
+-------------------+-----+
|2023-01-15 04:14:22| 20|
|2023-01-15 10:55:01| 30|
|2023-01-15 18:34:59| 15|
|2023-01-16 21:20:25| 12|
|2023-01-16 22:20:05| 15|
|2023-01-17 04:17:02| 41|
+-------------------+-----+
Suppose we would like to calculate the sum of the sales, grouped by date.
We can use the following syntax to do so:
from pyspark.sql.types import DateType
#calculate sum of sales by date
df.groupBy(df['ts'].cast(DateType()).alias('date'))\
.agg(sum('sales').alias('sum_sales')).show())
+----------+---------+
| date|sum_sales|
+----------+---------+
|2023-01-15| 65|
|2023-01-16| 27|
|2023-01-17| 41|
+----------+---------+
The resulting DataFrame shows the sum of sales by date.
For example, we can see:
- The sum of sales for 2023-01-15 is 65.
- The sum of sales for 2023-01-16 is 27.
- The sum of sales for 2023-01-17 is 41.
Note that you can also aggregate sales by a different metric if you’d like.
For example, you could use the following syntax to calculate the count of sales, grouped by date:
from pyspark.sql.types import DateType
#calculate count of sales by date
df.groupBy(df['ts'].cast(DateType()).alias('date'))\
.agg(count('sales').alias('count_sales')).show())
+----------+-----------+
| date|count_sales|
+----------+-----------+
|2023-01-15| 3|
|2023-01-16| 2|
|2023-01-17| 1|
+----------+-----------+
The resulting DataFrame now shows the count of sales by date.
Note that we used the cast function with DateType to extract the date from each timestamp in the ts column.
We then used the agg function to calculate aggregate metrics, grouped by the unique dates.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
How to Add Days to a Date Column in PySpark
How to Convert String to Date in PySpark
How to Convert Timestamp to Date in PySpark