How to Concatenate Columns in PySpark (With Examples)


You can use the following methods to concatenate strings from multiple columns in PySpark:

Method 1: Concatenate Columns

from pyspark.sql.functions import concat

df_new = df.withColumn('team', concat(df.location, df.name))

This particular example uses the concat function to concatenate together the strings in the location and name columns into a new column called team.

Method 2: Concatenate Columns with Separator

from pyspark.sql.functions import concat_ws

df_new = df.withColumn('team', concat_ws(' ', df.location, df.name))

This particular example uses the concat_ws function to concatenate together the strings in the location and name columns into a new column called team, using a space as a separator between the strings. 

The following examples show how to use each method in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Dallas', 'Mavs', 18], 
        ['Brooklyn', 'Nets', 33], 
        ['LA', 'Lakers', 12], 
        ['Boston', 'Celtics', 15], 
        ['Houston', 'Rockets', 19],
        ['Washington', 'Wizards', 24],
        ['Orlando', 'Magic', 28]] 
  
#define column names
columns = ['location', 'name', 'points'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----------+-------+------+
|  location|   name|points|
+----------+-------+------+
|    Dallas|   Mavs|    18|
|  Brooklyn|   Nets|    33|
|        LA| Lakers|    12|
|    Boston|Celtics|    15|
|   Houston|Rockets|    19|
|Washington|Wizards|    24|
|   Orlando|  Magic|    28|
+----------+-------+------+

Related: How to Concatenate Strings in PowerShell

Example 1: Concatenate Columns in PySpark

We can use the following syntax to concatenate together the strings in the location and name columns into a new column called team:

from pyspark.sql.functions import concat

#concatenate strings in location and name columns
df_new = df.withColumn('team', concat(df.location, df.name))

#view new DataFrame
df_new.show()

+----------+-------+------+-----------------+
|  location|   name|points|             team|
+----------+-------+------+-----------------+
|    Dallas|   Mavs|    18|       DallasMavs|
|  Brooklyn|   Nets|    33|     BrooklynNets|
|        LA| Lakers|    12|         LALakers|
|    Boston|Celtics|    15|    BostonCeltics|
|   Houston|Rockets|    19|   HoustonRockets|
|Washington|Wizards|    24|WashingtonWizards|
|   Orlando|  Magic|    28|     OrlandoMagic|
+----------+-------+------+-----------------+

The new team column concatenates together the strings in the location and name columns.

Note: You can find the complete documentation for the PySpark concat function here.

Example 2: Concatenate Columns with Separator in PySpark

We can use the following syntax to concatenate together the strings in the location and name columns into a new column called team, using a space as a separator:

from pyspark.sql.functions import concat_ws

#concatenate strings in location and name columns, using space as separator
df_new = df.withColumn('team', concat_ws(' ', df.location, df.name)) 

#view new DataFrame
df_new.show()

+----------+-------+------+------------------+
|  location|   name|points|              team|
+----------+-------+------+------------------+
|    Dallas|   Mavs|    18|       Dallas Mavs|
|  Brooklyn|   Nets|    33|     Brooklyn Nets|
|        LA| Lakers|    12|         LA Lakers|
|    Boston|Celtics|    15|    Boston Celtics|
|   Houston|Rockets|    19|   Houston Rockets|
|Washington|Wizards|    24|Washington Wizards|
|   Orlando|  Magic|    28|     Orlando Magic|
+----------+-------+------+------------------+

The new team column concatenates together the strings in the location and name columns, using a space as a separator.

Note: You can find the complete documentation for the PySpark concat_ws function here.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

PySpark: How to Check if Column Contains String
PySpark: How to Replace String in Column
PySpark: How to Convert String to Integer

Featured Posts

Leave a Reply

Your email address will not be published. Required fields are marked *