You can use the melt function with the following basic syntax to convert a PySpark DataFrame from a wide format to a long format:
df_long = df.melt(ids=['team'], values=['Guard', 'Forward', 'Center'], variableColumnName='position', valueColumnName='points')
This particular example converts a wide DataFrame named df to a long DataFrame named df_long.
The following example shows how to use this syntax in practice.
Related: The Difference Between Wide vs. Long Data
Example: Reshape PySpark DataFrame from Wide to Long
Suppose we have the following PySpark DataFrame in a wide format:
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() #define data data = [['A', 22, 34, 17 ], ['B', 25, 10, 12]] #define column names columns = ['team', 'Guard', 'Forward', 'Center'] #create dataframe using data and column names df = spark.createDataFrame(data, columns) #view dataframe df.show() +----+-----+-------+------+ |team|Guard|Forward|Center| +----+-----+-------+------+ | A| 22| 34| 17| | B| 25| 10| 12| +----+-----+-------+------+
We can use the following syntax to reshape this DataFrame from a wide format to a long format:
#create long DataFrame
df_long = df.melt(ids=['team'], values=['Guard', 'Forward', 'Center'],
variableColumnName='position',
valueColumnName='points')
#view long DataFrame
df_long.show()
+----+--------+------+
|team|position|points|
+----+--------+------+
| A| Guard| 22|
| A| Forward| 34|
| A| Center| 17|
| B| Guard| 25|
| B| Forward| 10|
| B| Center| 12|
+----+--------+------+
The DataFrame is now in a long format.
The team is now shown along the rows, the positions are used as values in the second column, and the points values are shown in the third column.
Note that we used the arguments variableColumnName and valueColumnName to specify the names to use for the second and third columns.
Note: You can find the complete documentation for the PySpark melt function here.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
How to Create a Pivot Table in PySpark
How to Unpivot a PySpark DataFrame
How to Sort Pivot Table by Column in PySpark