PySpark: How to Reshape DataFrame from Long to Wide


You can use the following syntax to convert a PySpark DataFrame from a long format to a wide format:

df_wide = df.groupBy('team').pivot('player').sum('points')

In this scenario, the values from the team column will be shown along the rows, the values from the player column will be used as the column names, and the sum of values from the points column will be used as the values inside the DataFrame.

The following example shows how to use this syntax in practice.

Related: The Difference Between Wide vs. Long Data

Example: Reshape PySpark DataFrame from Long to Wide

Suppose we have the following PySpark DataFrame in a long format:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 1, 18], 
        ['A', 2, 33], 
        ['A', 3, 12], 
        ['A', 4, 15], 
        ['B', 1, 19],
        ['B', 2, 24],
        ['B', 3, 28],
        ['B', 4, 16]]

#define column names
columns = ['team', 'player', 'points']

#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+------+------+
|team|player|points|
+----+------+------+
|   A|     1|    18|
|   A|     2|    33|
|   A|     3|    12|
|   A|     4|    15|
|   B|     1|    19|
|   B|     2|    24|
|   B|     3|    28|
|   B|     4|    16|
+----+------+------+

We can use the following syntax to reshape this DataFrame from a long format to a wide format:

#create wide DataFrame
df_wide = df.groupBy('team').pivot('player').sum('points')

#view wide DataFrame
df_wide.show()

+----+---+---+---+---+
|team|  1|  2|  3|  4|
+----+---+---+---+---+
|   B| 19| 24| 28| 16|
|   A| 18| 33| 12| 15|
+----+---+---+---+---+

The DataFrame is now in a wide format.

The team is now shown along the rows, the player numbers are used as columns, and the sum of the points values are shown inside the DataFrame.

Note that we could instead use player as the rows and team as the columns if we’d like:

#create wide DataFrame
df_wide = df.groupBy('player').pivot('team').sum('points')

#view wide DataFrame
df_wide.show()

+------+---+---+
|player|  A|  B|
+------+---+---+
|     1| 18| 19|
|     3| 12| 28|
|     2| 33| 24|
|     4| 15| 16|
+------+---+---+

This DataFrame is also in a wide format.

Note: You can find the complete documentation for the PySpark pivot function here.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

How to Create a Pivot Table in PySpark
How to Unpivot a PySpark DataFrame
How to Sort Pivot Table by Column in PySpark

Leave a Reply

Your email address will not be published. Required fields are marked *