You can use the following syntax to use Window.orderBy() and sort in descending order in PySpark:
from pyspark.sql.functions import row_number, desc from pyspark.sql.window import Window #specify window w = Window.partitionBy('team').orderBy(desc('points')) #add column called 'id' that contains row numbers df = df.withColumn('id', row_number().over(w))
This particular example adds a new column to the DataFrame called id that contains row numbers for each row, grouped by team and sorted by the values in the points column in descending order.
The following example shows how to use this syntax in practice.
Example: How to Use Window.orderBy() Descending in PySpark
Suppose we have the following PySpark DataFrame that contains information about basketball players on various teams:
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() #define data data = [['A', 'Guard', 11], ['A', 'Guard', 8], ['A', 'Forward', 21], ['A', 'Forward', 22], ['A', 'Forward', 30], ['B', 'Guard', 14], ['B', 'Guard', 14], ['B', 'Forward', 13], ['B', 'Forward', 7]] #define column names columns = ['team', 'position', 'points'] #create dataframe using data and column names df = spark.createDataFrame(data, columns) #view dataframe df.show() +----+--------+------+ |team|position|points| +----+--------+------+ | A| Guard| 11| | A| Guard| 8| | A| Forward| 21| | A| Forward| 22| | A| Forward| 30| | B| Guard| 14| | B| Guard| 14| | B| Forward| 13| | B| Forward| 7| +----+--------+------+
Suppose we would like to add a new column named id that contains row numbers for each row in the DataFrame, grouped by the team column and sorted based on the values in the points column in descending order.
We can use the following syntax to do so:
from pyspark.sql.functions import row_number, desc from pyspark.sql.window import Window #specify window w = Window.partitionBy('team').orderBy(desc('points')) #add column called 'id' that contains row numbers df = df.withColumn('id', row_number().over(w)) #view DataFrame df.show() +----+--------+------+---+ |team|position|points| id| +----+--------+------+---+ | A| Forward| 30| 1| | A| Forward| 22| 2| | A| Forward| 21| 3| | A| Guard| 11| 4| | A| Guard| 8| 5| | B| Guard| 14| 1| | B| Guard| 14| 2| | B| Forward| 13| 3| | B| Forward| 7| 4| +----+--------+------+---+
The new column named id contains row numbers for each row in the DataFrame, grouped by the team column and sorted based on the values in the points column in descending order.
For example:
- The row with the largest points value for team A receives an id value of 1.
- The row with the next largest points value for team A receives an id value of 2.
And so on.
Note #1: If we didn’t use the desc function within the orderBy function, the row numbers would have been assigned based on the values in the points column in ascending order instead.
Note #2: You can find the complete documentation for the PySpark Window.orderBy function here.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
PySpark: How to Select Columns by Index in DataFrame
PySpark: How to Select Rows Based on Column Values
PySpark: How to Find Unique Values in a Column