You can use the following methods to calculate a cumulative sum in a PySpark DataFrame:
Method 1: Calculate Cumulative Sum of One Column
from pyspark.sql import Window
from pyspark.sql import functions as F
#define window for calculating cumulative sum
my_window = (Window.orderBy('day')
.rowsBetween(Window.unboundedPreceding, 0))
#create new DataFrame that contains cumulative sales column
df_new = df.withColumn('cum_sales', F.sum('sales').over(my_window))
Method 2: Calculate Cumulative Sum of One Column, Grouped by Another Column
from pyspark.sql import Window
from pyspark.sql import functions as F
#define window for calculating cumulative sum
my_window = (Window.partitionBy('store').orderBy('day')
.rowsBetween(Window.unboundedPreceding, 0))
#create new DataFrame that contains cumulative sales, grouped by store
df_new = df.withColumn('cum_sales', F.sum('sales').over(my_window))
The following examples show how to use each method in practice.
Example 1: Calculate Cumulative Sum of One Column
Suppose we have the following PySpark DataFrame that contains information about the sales made during 10 consecutive days at some grocery store:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [[1, 11],
[2, 8],
[3, 4],
[4, 5],
[5, 5],
[6, 8],
[7, 7],
[8, 7],
[9, 6],
[10, 4]]
#define column names
columns = ['day', 'sales']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+---+-----+
|day|sales|
+---+-----+
| 1| 11|
| 2| 8|
| 3| 4|
| 4| 5|
| 5| 5|
| 6| 8|
| 7| 7|
| 8| 7|
| 9| 6|
| 10| 4|
+---+-----+
We can use the following syntax to calculate the cumulative sum of the values in the sales column:
from pyspark.sql import Window
from pyspark.sql import functions as F
#define window for calculating cumulative sum
my_window = (Window.orderBy('day')
.rowsBetween(Window.unboundedPreceding, 0))
#create new DataFrame that contains cumulative sales column
df_new = df.withColumn('cum_sales', F.sum('sales').over(my_window))
#view new DataFrame
df_new.show()
+---+-----+---------+
|day|sales|cum_sales|
+---+-----+---------+
| 1| 11| 11|
| 2| 8| 19|
| 3| 4| 23|
| 4| 5| 28|
| 5| 5| 33|
| 6| 8| 41|
| 7| 7| 48|
| 8| 7| 55|
| 9| 6| 61|
| 10| 4| 65|
+---+-----+---------+
The resulting DataFrame contains a new column called cum_sales that shows the cumulative values in the sales column.
Example 2: Calculate Cumulative Sum of One Column, Grouped by Another Column
Suppose we have the following PySpark DataFrame that contains information about the sales made during 5 consecutive days at two different grocery stores:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [['A', 1, 11],
['A', 2, 8],
['A', 3, 4],
['A', 4, 5],
['A', 5, 5],
['B', 6, 8],
['B', 7, 7],
['B', 8, 7],
['B', 9, 6],
['B', 10, 4]]
#define column names
columns = ['store', 'day', 'sales']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+-----+---+-----+
|store|day|sales|
+-----+---+-----+
| A| 1| 11|
| A| 2| 8|
| A| 3| 4|
| A| 4| 5|
| A| 5| 5|
| B| 6| 8|
| B| 7| 7|
| B| 8| 7|
| B| 9| 6|
| B| 10| 4|
+-----+---+-----+
We can use the following syntax to calculate the cumulative sum of the values in the sales column, grouped by the values in the store column:
from pyspark.sql import Window
from pyspark.sql import functions as F
#define window for calculating cumulative sum
my_window = (Window.partitionBy('store').orderBy('day')
.rowsBetween(Window.unboundedPreceding, 0))
#create new DataFrame that contains cumulative sales, grouped by store
df_new = df.withColumn('cum_sales', F.sum('sales').over(my_window))
#view new DataFrame
df_new.show()
+-----+---+-----+---------+
|store|day|sales|cum_sales|
+-----+---+-----+---------+
| A| 1| 11| 11|
| A| 2| 8| 19|
| A| 3| 4| 23|
| A| 4| 5| 28|
| A| 5| 5| 33|
| B| 6| 8| 8|
| B| 7| 7| 15|
| B| 8| 7| 22|
| B| 9| 6| 28|
| B| 10| 4| 32|
+-----+---+-----+---------+
The resulting DataFrame contains a new column called cum_sales that shows the cumulative values in the sales column, grouped by the values in the store column.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
How to Calculate the Sum of a Column in PySpark
How to Sum Multiple Columns in PySpark
How to Calculate Sum by Group in PySpark