PySpark: How to Use partitionBy() with Multiple Columns

You can use the following syntax to use Window.partitionBy() with multiple columns in PySpark:

from pyspark.sql.window import Window

partition_cols = ['col1', 'col2']

w = Window.partitionBy(*partition_cols)

This particular example passes the columns named col1 and col2 to the partitionBy function.

Note that the * operator is used to unpack an iterable into a function call.

Thus, by using the * operator we’re able to pass each of the elements in partition_cols without specifying each element individually.

The following example shows how to use this syntax in practice.

Example: How to Use partitionBy() with Multiple Columns in PySpark

Suppose we have the following PySpark DataFrame that contains information about basketball players on various teams:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'Guard', 11], 
        ['A', 'Guard', 8], 
        ['A', 'Forward', 21], 
        ['A', 'Forward', 22],
        ['A', 'Forward', 30], 
        ['B', 'Guard', 14], 
        ['B', 'Guard', 14],
        ['B', 'Forward', 13],
        ['B', 'Forward', 7]] 
#define column names
columns = ['team', 'position', 'points'] 
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
#view dataframe

|   A|   Guard|    11|
|   A|   Guard|     8|
|   A| Forward|    21|
|   A| Forward|    22|
|   A| Forward|    30|
|   B|   Guard|    14|
|   B|   Guard|    14|
|   B| Forward|    13|
|   B| Forward|     7|

Suppose we would like to add a new column named id that contains row numbers for each row in the DataFrame, grouped by the team and position columns.

To do so, we can use the following syntax to pass each of these columns to the partitionBy function and then add a new column that contains row numbers:

from pyspark.sql.functions import row_number,lit
from pyspark.sql.window import Window

#specify columns to partition by
partition_cols = ['team', 'position']

#specify window
w = Window.partitionBy(*partition_cols).orderBy(lit('A'))

#add column called 'id' that contains row numbers
df = df.withColumn('id', row_number().over(w))

#view updated DataFrame

|team|position|points| id|
|   A| Forward|    21|  1|
|   A| Forward|    22|  2|
|   A| Forward|    30|  3|
|   A|   Guard|    11|  1|
|   A|   Guard|     8|  2|
|   B| Forward|    13|  1|
|   B| Forward|     7|  2|
|   B|   Guard|    14|  1|
|   B|   Guard|    14|  2|

The resulting DataFrame contains row numbers for each row, grouped by the team and position columns.

Note #1: In this example, we passed two column names to the partitionBy function but you can include as many column names as you’d like.

Note #2: You can find the complete documentation for the PySpark partitionBy function here.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

PySpark: How to Select Columns by Index in DataFrame
PySpark: How to Select Rows Based on Column Values
PySpark: How to Find Unique Values in a Column

Featured Posts

Leave a Reply

Your email address will not be published. Required fields are marked *