You can use the following methods to concatenate strings from multiple columns in PySpark:
Method 1: Concatenate Columns
from pyspark.sql.functions import concat df_new = df.withColumn('team', concat(df.location, df.name))
This particular example uses the concat function to concatenate together the strings in the location and name columns into a new column called team.
Method 2: Concatenate Columns with Separator
from pyspark.sql.functions import concat_ws df_new = df.withColumn('team', concat_ws(' ', df.location, df.name))
This particular example uses the concat_ws function to concatenate together the strings in the location and name columns into a new column called team, using a space as a separator between the strings.
The following examples show how to use each method in practice with the following PySpark DataFrame:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [['Dallas', 'Mavs', 18],
['Brooklyn', 'Nets', 33],
['LA', 'Lakers', 12],
['Boston', 'Celtics', 15],
['Houston', 'Rockets', 19],
['Washington', 'Wizards', 24],
['Orlando', 'Magic', 28]]
#define column names
columns = ['location', 'name', 'points']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+----------+-------+------+
| location| name|points|
+----------+-------+------+
| Dallas| Mavs| 18|
| Brooklyn| Nets| 33|
| LA| Lakers| 12|
| Boston|Celtics| 15|
| Houston|Rockets| 19|
|Washington|Wizards| 24|
| Orlando| Magic| 28|
+----------+-------+------+
Example 1: Concatenate Columns in PySpark
We can use the following syntax to concatenate together the strings in the location and name columns into a new column called team:
from pyspark.sql.functions import concat #concatenate strings in location and name columns df_new = df.withColumn('team', concat(df.location, df.name)) #view new DataFrame df_new.show() +----------+-------+------+-----------------+ | location| name|points| team| +----------+-------+------+-----------------+ | Dallas| Mavs| 18| DallasMavs| | Brooklyn| Nets| 33| BrooklynNets| | LA| Lakers| 12| LALakers| | Boston|Celtics| 15| BostonCeltics| | Houston|Rockets| 19| HoustonRockets| |Washington|Wizards| 24|WashingtonWizards| | Orlando| Magic| 28| OrlandoMagic| +----------+-------+------+-----------------+
The new team column concatenates together the strings in the location and name columns.
Note: You can find the complete documentation for the PySpark concat function here.
Example 2: Concatenate Columns with Separator in PySpark
We can use the following syntax to concatenate together the strings in the location and name columns into a new column called team, using a space as a separator:
from pyspark.sql.functions import concat_ws #concatenate strings in location and name columns, using space as separator df_new = df.withColumn('team', concat_ws(' ', df.location, df.name)) #view new DataFrame df_new.show() +----------+-------+------+------------------+ | location| name|points| team| +----------+-------+------+------------------+ | Dallas| Mavs| 18| Dallas Mavs| | Brooklyn| Nets| 33| Brooklyn Nets| | LA| Lakers| 12| LA Lakers| | Boston|Celtics| 15| Boston Celtics| | Houston|Rockets| 19| Houston Rockets| |Washington|Wizards| 24|Washington Wizards| | Orlando| Magic| 28| Orlando Magic| +----------+-------+------+------------------+
The new team column concatenates together the strings in the location and name columns, using a space as a separator.
Note: You can find the complete documentation for the PySpark concat_ws function here.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
PySpark: How to Check if Column Contains String
PySpark: How to Replace String in Column
PySpark: How to Convert String to Integer