You can use the following syntax to calculate a rolling mean in a PySpark DataFrame:

from pyspark.sql import Window from pyspark.sql import functions as F #define window for calculating rolling mean w = (Window.orderBy('day').rowsBetween(-3, 0)) #create new DataFrame that contains 4-day rolling mean column df_new = df.withColumn('rolling_mean', F.avg('sales').over(w))

This particular example creates a new column that contains the 4-day rolling average of values in the **sales** column of the DataFrame.

The following example shows how to use this syntax in practice.

**Example: How to Calculate a Rolling Mean in PySpark**

Suppose we have the following PySpark DataFrame that contains information about the sales made during 10 consecutive days at some grocery store:

**from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [[1, 11],
[2, 8],
[3, 4],
[4, 5],
[5, 5],
[6, 8],
[7, 7],
[8, 7],
[9, 6],
[10, 4]]
#define column names
columns = ['day', 'sales']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+---+-----+
|day|sales|
+---+-----+
| 1| 11|
| 2| 8|
| 3| 4|
| 4| 5|
| 5| 5|
| 6| 8|
| 7| 7|
| 8| 7|
| 9| 6|
| 10| 4|
+---+-----+
**

We can use the following syntax to calculate the 4-day rolling mean of values in the **sales **column:

from pyspark.sql import Window from pyspark.sql import functions as F #define window for calculating rolling mean w = (Window.orderBy('day').rowsBetween(-3, 0)) #create new DataFrame that contains 4-day rolling mean column df_new = df.withColumn('rolling_mean', F.avg('sales').over(w)) #view new DataFrame df_new.show() +---+-----+-----------------+ |day|sales| rolling_mean| +---+-----+-----------------+ | 1| 11| 11.0| | 2| 8| 9.5| | 3| 4|7.666666666666667| | 4| 5| 7.0| | 5| 5| 5.5| | 6| 8| 5.5| | 7| 7| 6.25| | 8| 7| 6.75| | 9| 6| 7.0| | 10| 4| 6.0| +---+-----+-----------------+

The resulting DataFrame contains a new column called **rolling_mean **that shows the rolling mean of the values in the **sales** column for the most recent 4 days.

For example, the rolling mean of values in the sales column on **day 4** is calculated as:

Rolling Mean = (11 + 8 + 4 + 5) / 4 = **7**

And the rolling mean of values in the sales column on **day 5** is calculated as:

Rolling Mean = (8 + 4 + 5 + 5) / 4 = **5.5**

And so on.

Note that you can calculate a rolling average using a different number of previous periods by simply changing the first value in the **rowsBetween** function.

For example, you could instead calculate a 5-day rolling average by using **rowsBetween(-4, 0)** instead.

**Additional Resources**

The following tutorials explain how to perform other common tasks in PySpark:

How to Calculate the Mean of a Column in PySpark

How to Calculate Mean of Multiple Columns in PySpark

How to Calculate the Mean by Group in PySpark