PySpark: How to Use Groupby and Concatenate Strings


You can use the following syntax to group by one column in a PySpark DataFrame and then concatenate multiple strings together that belong to the same group:

import pyspark.sql.functions as F

#group by store and concatenate list of employee names
df_new = df.groupby('store')\
           .agg(F.concat_ws(', ', F.collect_list(df.employee))\
           .alias('employee_names'))

This particular example groups the rows of  the DataFrame based on the values in the store column and then concatenates all of the strings in the employee column that belong to the same group.

The following example shows how to use this syntax in practice.

Example: How to Use groupBy and Concatenate Strings in PySpark

Suppose we have the following PySpark DataFrame that contains information about employees at some company:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 1, 'Andy'], 
        ['A', 1, 'Bob'], 
        ['A', 2, 'Chad'], 
        ['B', 2, 'Diane'], 
        ['B', 1, 'Eric'],
        ['B', 4, 'Frida'],
        ['C', 2, 'Greg'],
        ['C', 3, 'Henry']]
  
#define column names
columns = ['store', 'quarter', 'employee'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-----+-------+--------+
|store|quarter|employee|
+-----+-------+--------+
|    A|      1|    Andy|
|    A|      1|     Bob|
|    A|      2|    Chad|
|    B|      2|   Diane|
|    B|      1|    Eric|
|    B|      4|   Frida|
|    C|      2|    Greg|
|    C|      3|   Henry|
+-----+-------+--------+

Suppose we would like to group by the values in the store column and then concatenate together each of the strings in the employee column that belong to the same group.

We can use the following syntax to do so:

import pyspark.sql.functions as F

#group by store and concatenate list of employee names
df_new = df.groupby('store')\
           .agg(F.concat_ws(', ', F.collect_list(df.employee))\
           .alias('employee_names'))

#view new DataFrame
df_new.show()

+-----+------------------+
|store|    employee_names|
+-----+------------------+
|    A|   Andy, Bob, Chad|
|    B|Diane, Eric, Frida|
|    C|       Greg, Henry|
+-----+------------------+

The new employee_names column uses a comma to concatenate together the names of the employees who work at the same store.

For example:

  • Andy, Bob and Chad all work at store A.
  • Diane, Eric and Frida all work at store B.
  • Greg and Henry both work at store C.

Note that we used the concat_ws function to concatenate together the employee names using a comma as a separator.

However, we could specify a different separator to use when concatenating the strings if we’d like.

For example, we can use the following syntax to concatenate the names of the employees using the & symbol as the separator:

import pyspark.sql.functions as F

#group by store and concatenate list of employee names
df_new = df.groupby('store')\
           .agg(F.concat_ws(' & ', F.collect_list(df.employee))\
           .alias('employee_names'))

#view new DataFrame
df_new.show()

+-----+--------------------+
|store|      employee_names|
+-----+--------------------+
|    A|   Andy & Bob & Chad|
|    B|Diane & Eric & Frida|
|    C|        Greg & Henry|
+-----+--------------------+

The employee_names column now uses the & symbol to concatenate together the names of the employees who work at the same store.

Note: We used the alias function to specify a name to use for the new column that concatenates together the strings of employees.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

PySpark: How to Use groupBy on Multiple Columns
PySpark: How to Use groupBy with Count Distinct
PySpark: How to Calculate Percentage of Total with groupBy

Leave a Reply

Your email address will not be published. Required fields are marked *