You can use the following syntax to use Window.partitionBy() with multiple columns in PySpark:
from pyspark.sql.window import Window partition_cols = ['col1', 'col2'] w = Window.partitionBy(*partition_cols)
This particular example passes the columns named col1 and col2 to the partitionBy function.
Note that the * operator is used to unpack an iterable into a function call.
Thus, by using the * operator we’re able to pass each of the elements in partition_cols without specifying each element individually.
The following example shows how to use this syntax in practice.
Example: How to Use partitionBy() with Multiple Columns in PySpark
Suppose we have the following PySpark DataFrame that contains information about basketball players on various teams:
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() #define data data = [['A', 'Guard', 11], ['A', 'Guard', 8], ['A', 'Forward', 21], ['A', 'Forward', 22], ['A', 'Forward', 30], ['B', 'Guard', 14], ['B', 'Guard', 14], ['B', 'Forward', 13], ['B', 'Forward', 7]] #define column names columns = ['team', 'position', 'points'] #create dataframe using data and column names df = spark.createDataFrame(data, columns) #view dataframe df.show() +----+--------+------+ |team|position|points| +----+--------+------+ | A| Guard| 11| | A| Guard| 8| | A| Forward| 21| | A| Forward| 22| | A| Forward| 30| | B| Guard| 14| | B| Guard| 14| | B| Forward| 13| | B| Forward| 7| +----+--------+------+
Suppose we would like to add a new column named id that contains row numbers for each row in the DataFrame, grouped by the team and position columns.
To do so, we can use the following syntax to pass each of these columns to the partitionBy function and then add a new column that contains row numbers:
from pyspark.sql.functions import row_number,lit from pyspark.sql.window import Window #specify columns to partition by partition_cols = ['team', 'position'] #specify window w = Window.partitionBy(*partition_cols).orderBy(lit('A')) #add column called 'id' that contains row numbers df = df.withColumn('id', row_number().over(w)) #view updated DataFrame df.show() +----+--------+------+---+ |team|position|points| id| +----+--------+------+---+ | A| Forward| 21| 1| | A| Forward| 22| 2| | A| Forward| 30| 3| | A| Guard| 11| 1| | A| Guard| 8| 2| | B| Forward| 13| 1| | B| Forward| 7| 2| | B| Guard| 14| 1| | B| Guard| 14| 2| +----+--------+------+---+
The resulting DataFrame contains row numbers for each row, grouped by the team and position columns.
Note #1: In this example, we passed two column names to the partitionBy function but you can include as many column names as you’d like.
Note #2: You can find the complete documentation for the PySpark partitionBy function here.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
PySpark: How to Select Columns by Index in DataFrame
PySpark: How to Select Rows Based on Column Values
PySpark: How to Find Unique Values in a Column