PySpark: How to Order by Multiple Columns


You can use the following syntax to order a PySpark DataFrame by multiple columns:

df.orderBy(['team', 'position', 'points']).show()

This particular example will order the rows of the DataFrame based on the values in the team column, then by the values in the position column, then by the values in the points column.

The orderBy function sorts values in ascending order (smallest to largest) by default, but you can use the following syntax to instead sort values in descending order:

df.orderBy(['team', 'position', 'points'], ascending=False).show()

The following example shows how to order by multiple columns in practice.

Example: How to Order PySpark DataFrame by Multiple Columns

Suppose we have the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'Guard', 11, 4], 
        ['A', 'Forward', 8, 5], 
        ['B', 'Guard', 22, 6], 
        ['A', 'Forward', 22, 7], 
        ['C', 'Guard', 14, 12], 
        ['A', 'Guard', 14, 8],
        ['B', 'Forward', 13, 9],
        ['B', 'Center', 7, 9]]
  
#define column names
columns = ['team', 'position', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+--------+------+-------+
|team|position|points|assists|
+----+--------+------+-------+
|   A|   Guard|    11|      4|
|   A| Forward|     8|      5|
|   B|   Guard|    22|      6|
|   A| Forward|    22|      7|
|   C|   Guard|    14|     12|
|   A|   Guard|    14|      8|
|   B| Forward|    13|      9|
|   B|  Center|     7|      9|
+----+--------+------+-------+

We can use the following syntax to order the rows based on the values in the team column, then by the values in the position column, then by the values in the points column:

df.orderBy(['team', 'position', 'points']).show()

+----+--------+------+-------+
|team|position|points|assists|
+----+--------+------+-------+
|   A| Forward|     8|      5|
|   A| Forward|    22|      7|
|   A|   Guard|    11|      4|
|   A|   Guard|    14|      8|
|   B|  Center|     7|      9|
|   B| Forward|    13|      9|
|   B|   Guard|    22|      6|
|   C|   Guard|    14|     12|
+----+--------+------+-------+

Notice that the rows are now ordered in the following way:

  • First, by the values in the team column (from A to Z)
  • Then, by the values in the position column (from A to Z)
  • Then, by the values in the points column (from smallest to largest)

We could also use the argument ascending=False to instead order the values in descending order:

df.orderBy(['team', 'position', 'points'], ascending=False).show()

+----+--------+------+-------+
|team|position|points|assists|
+----+--------+------+-------+
|   C|   Guard|    14|     12|
|   B|   Guard|    22|      6|
|   B| Forward|    13|      9|
|   B|  Center|     7|      9|
|   A|   Guard|    14|      8|
|   A|   Guard|    11|      4|
|   A| Forward|    22|      7|
|   A| Forward|     8|      5|
+----+--------+------+-------+

Notice that the rows are now ordered in the following way:

  • First, by the values in the team column (from Z to A)
  • Then, by the values in the position column (from Z to A)
  • Then, by the values in the points column (from largest to smallest)

Note: You can find the complete documentation for the PySpark orderBy function here.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

PySpark: How to Select Rows Based on Column Values
PySpark: How to Select Columns by Index in DataFrame
PySpark: How to Reorder Columns

Featured Posts

Leave a Reply

Your email address will not be published. Required fields are marked *