You can use the following syntax to coalesce the values from multiple columns into one in a PySpark DataFrame:
from pyspark.sql.functions import coalesce #coalesce values from points, assists and rebounds columns df = df.withColumn('coalesce', coalesce(df.points, df.assists, df.rebounds))
This particular example creates a new column named coalesce that coalesces the values from the points, assists and rebounds columns into one column.
Note that the coalesce function simply returns the first non-null value in each row among the columns that you specify.
The following example shows how to use this syntax in practice.
Example: Coalesce Values from Multiple Columns into One in PySpark
Suppose we have the following PySpark DataFrame that contains information about the points, assists and rebounds for various basketball players:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [[None, None, 3],
[None, 7, 4],
[19, 7, None],
[None, 9, None],
[14, None, 6]]
#define column names
columns = ['points', 'assists', 'rebounds']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+------+-------+--------+
|points|assists|rebounds|
+------+-------+--------+
| null| null| 3|
| null| 7| 4|
| 19| 7| null|
| null| 9| null|
| 14| null| 6|
+------+-------+--------+
We can use the following syntax to create a new column named coalesce that coalesces the values from the points, assists and rebounds columns:
from pyspark.sql.functions import coalesce #coalesce values from points, assists and rebounds columns df = df.withColumn('coalesce', coalesce(df.points, df.assists, df.rebounds)) #view updated DataFrame df.show() +------+-------+--------+--------+ |points|assists|rebounds|coalesce| +------+-------+--------+--------+ | null| null| 3| 3| | null| 7| 4| 7| | 19| 7| null| 19| | null| 9| null| 9| | 14| null| 6| 14| +------+-------+--------+--------+
Here is how the value in the coalesce column was chosen:
- First row: The first non-null value was 3.
- Second row: The first non-null value was 7.
- Third row: The first non-null value was 19.
- Fourth row: The first non-null value was 9.
- Fifth row: The first non-null value was 14.
Note: You can find the complete documentation for the PySpark coalesce function here.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
PySpark: How to Check if Column Contains String
PySpark: How to Replace String in Column
PySpark: How to Convert String to Integer