PySpark: How to Split String Column into Multiple Columns


You can use the following syntax to split a string column into multiple columns in a PySpark DataFrame:

from pyspark.sql.functions import split

#split team column using dash as delimiter
df_new = df.withColumn('location', split(df.team, '-').getItem(0)) \
           .withColumn('name', split(df.team, '-').getItem(1))

This particular example uses the split function to split the string in the team column of the DataFrame into two new columns called location and name based on where the dash occurs in the string.

The following example shows how to use this syntax in practice.

Example: Split String into Multiple Columns in PySpark

Suppose we have the following PySpark DataFrame that contains information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Dallas-Mavs', 18], 
        ['Brooklyn-Nets', 33], 
        ['LA-Lakers', 12], 
        ['Houston-Rockets', 15], 
        ['Atlanta-Hawks', 19],
        ['Boston-Celtics', 24],
        ['Orlando-Magic', 28]]

#define column names
columns = ['team', 'points']

#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+---------------+------+
|           team|points|
+---------------+------+
|    Dallas-Mavs|    18|
|  Brooklyn-Nets|    33|
|      LA-Lakers|    12|
|Houston-Rockets|    15|
|  Atlanta-Hawks|    19|
| Boston-Celtics|    24|
|  Orlando-Magic|    28|
+---------------+------+

Suppose we would like to split the strings in the team column into two new columns based on where the dash occurs in the strings.

We can use the following syntax to do so:

from pyspark.sql.functions import split

#split team column using dash as delimiter
df_new = df.withColumn('location', split(df.team, '-').getItem(0)) \
           .withColumn('name', split(df.team, '-').getItem(1))

#view new DataFrame
df_new.show()

+---------------+------+--------+-------+
|           team|points|location|   name|
+---------------+------+--------+-------+
|    Dallas-Mavs|    18|  Dallas|   Mavs|
|  Brooklyn-Nets|    33|Brooklyn|   Nets|
|      LA-Lakers|    12|      LA| Lakers|
|Houston-Rockets|    15| Houston|Rockets|
|  Atlanta-Hawks|    19| Atlanta|  Hawks|
| Boston-Celtics|    24|  Boston|Celtics|
|  Orlando-Magic|    28| Orlando|  Magic|
+---------------+------+--------+-------+

Notice that the strings in the team column have been split into two new columns called location and name based on where the dash occurred in the string.

Note that we used the split function to split each string, which resulted in two new strings.

We then used getItem(0) to extract the first string and getItem(1) to extract the second string for each team.

Note: You can find the complete documentation for the PySpark split function here.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

PySpark: How to Concatenate Columns
PySpark: How to Check if Column Contains String
PySpark: How to Replace String in Column
PySpark: How to Convert String to Integer

Leave a Reply

Your email address will not be published. Required fields are marked *