PySpark: How to Split String in Column and Get Last Item


You can use the following syntax to split a string column in a PySpark DataFrame and get the last item resulting from the split:

from pyspark.sql.functions import split, col, size

#create new column that contains only last item from employees column
df_new = df.withColumn('new', split('employees', ' '))\
           .withColumn('new', col('new')[size('new') -1])

This particular example splits the string in the employees column using a space as the delimiter, then extracts the last item from the split and displays it in a new column named last.

The following example shows how to use this syntax in practice.

Example: Split String and Get Last Item in PySpark

Suppose we have the following PySpark DataFrame that contains information employee names and total sales at various companies:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Andy Bob Chad', 200],
        ['Doug Eric', 139],
        ['Frank Greg Henry', 187],
        ['Ian John Ken Liam', 349]]
  
#define column names
columns = ['employees', 'sales'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-----------------+-----+
|        employees|sales|
+-----------------+-----+
|    Andy Bob Chad|  200|
|        Doug Eric|  139|
| Frank Greg Henry|  187|
|Ian John Ken Liam|  349|
+-----------------+-----+

Suppose we would like to split the strings in the employees column and display the last item resulting from each split in a new column.

We can use the following syntax to do so:

from pyspark.sql.functions import split, col, size

#create new column that contains only last item from employees column
df_new = df.withColumn('new', split('employees', ' '))\
           .withColumn('new', col('new')[size('new') -1])

#view new DataFrame
df_new.show()

+-----------------+-----+-----+
|        employees|sales| last|
+-----------------+-----+-----+
|    Andy Bob Chad|  200| Chad|
|        Doug Eric|  139| Eric|
| Frank Greg Henry|  187|Henry|
|Ian John Ken Liam|  349| Liam|
+-----------------+-----+-----+

Notice that the new column named last contains the last name from each of the lists in the employees column.

Also note that this syntax was able to get the last item from each list even though the lists had different lengths.

Note: You can find the complete documentation for the PySpark split function here.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

PySpark: How to Concatenate Columns
PySpark: How to Check if Column Contains String
PySpark: How to Replace String in Column
PySpark: How to Convert String to Integer

Featured Posts

Leave a Reply

Your email address will not be published. Required fields are marked *