You can use the following syntax to order a PySpark DataFrame by multiple columns:
df.orderBy(['team', 'position', 'points']).show()
This particular example will order the rows of the DataFrame based on the values in the team column, then by the values in the position column, then by the values in the points column.
The orderBy function sorts values in ascending order (smallest to largest) by default, but you can use the following syntax to instead sort values in descending order:
df.orderBy(['team', 'position', 'points'], ascending=False).show()
The following example shows how to order by multiple columns in practice.
Example: How to Order PySpark DataFrame by Multiple Columns
Suppose we have the following PySpark DataFrame that contains information about various basketball players:
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() #define data data = [['A', 'Guard', 11, 4], ['A', 'Forward', 8, 5], ['B', 'Guard', 22, 6], ['A', 'Forward', 22, 7], ['C', 'Guard', 14, 12], ['A', 'Guard', 14, 8], ['B', 'Forward', 13, 9], ['B', 'Center', 7, 9]] #define column names columns = ['team', 'position', 'points', 'assists'] #create dataframe using data and column names df = spark.createDataFrame(data, columns) #view dataframe df.show() +----+--------+------+-------+ |team|position|points|assists| +----+--------+------+-------+ | A| Guard| 11| 4| | A| Forward| 8| 5| | B| Guard| 22| 6| | A| Forward| 22| 7| | C| Guard| 14| 12| | A| Guard| 14| 8| | B| Forward| 13| 9| | B| Center| 7| 9| +----+--------+------+-------+
We can use the following syntax to order the rows based on the values in the team column, then by the values in the position column, then by the values in the points column:
df.orderBy(['team', 'position', 'points']).show()
+----+--------+------+-------+
|team|position|points|assists|
+----+--------+------+-------+
| A| Forward| 8| 5|
| A| Forward| 22| 7|
| A| Guard| 11| 4|
| A| Guard| 14| 8|
| B| Center| 7| 9|
| B| Forward| 13| 9|
| B| Guard| 22| 6|
| C| Guard| 14| 12|
+----+--------+------+-------+
Notice that the rows are now ordered in the following way:
- First, by the values in the team column (from A to Z)
- Then, by the values in the position column (from A to Z)
- Then, by the values in the points column (from smallest to largest)
We could also use the argument ascending=False to instead order the values in descending order:
df.orderBy(['team', 'position', 'points'], ascending=False).show()
+----+--------+------+-------+
|team|position|points|assists|
+----+--------+------+-------+
| C| Guard| 14| 12|
| B| Guard| 22| 6|
| B| Forward| 13| 9|
| B| Center| 7| 9|
| A| Guard| 14| 8|
| A| Guard| 11| 4|
| A| Forward| 22| 7|
| A| Forward| 8| 5|
+----+--------+------+-------+
Notice that the rows are now ordered in the following way:
- First, by the values in the team column (from Z to A)
- Then, by the values in the position column (from Z to A)
- Then, by the values in the points column (from largest to smallest)
Note: You can find the complete documentation for the PySpark orderBy function here.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
PySpark: How to Select Rows Based on Column Values
PySpark: How to Select Columns by Index in DataFrame
PySpark: How to Reorder Columns