You can use the following syntax to explode a column that contains arrays in a PySpark DataFrame into multiple rows:
from pyspark.sql.functions import explode #explode points column into rows df_new = df.withColumn('points', explode(df.points))
This particular example explodes the arrays in the points column of a DataFrame into multiple rows.
The following example shows how to use this syntax in practice.
Example: How to Explode Array into Rows in a PySpark DataFrame
Suppose we have the following PySpark DataFrame that contains information about points scored in three different games by various basketball players:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [['A', 'Guard', [11, 8, 25]],
['A', 'Forward', [14, 20, 22]],
['B', 'Guard', [21, 30, 6]],
['B', 'Forward', [22, 12, 34]]]
#define column names
columns = ['team', 'position', 'points']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+----+--------+------------+
|team|position| points|
+----+--------+------------+
| A| Guard| [11, 8, 25]|
| A| Forward|[14, 20, 22]|
| B| Guard| [21, 30, 6]|
| B| Forward|[22, 12, 34]|
+----+--------+------------+
Notice that the points column currently contains arrays.
We can use the following syntax to explode the values from each of these arrays into their own rows:
from pyspark.sql.functions import explode #explode points column into rows df_new = df.withColumn('points', explode(df.points)) #view new DataFrame df_new.show() +----+--------+------+ |team|position|points| +----+--------+------+ | A| Guard| 11| | A| Guard| 8| | A| Guard| 25| | A| Forward| 14| | A| Forward| 20| | A| Forward| 22| | B| Guard| 21| | B| Guard| 30| | B| Guard| 6| | B| Forward| 22| | B| Forward| 12| | B| Forward| 34| +----+--------+------+
Notice that each of the values in the arrays from the points column have been exploded into their own rows.
Note: You can find the complete documentation for the PySpark explode function here.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
PySpark: Get Rows Which Are Not in Another DataFrame
PySpark: How to Combine Rows with Same Column Values
PySpark: How to Drop Duplicate Rows from DataFrame