You can use the following methods to multiply two columns in a PySpark DataFrame:
Method 1: Multiply Two Columns
df_new = df.withColumn('revenue', df.price * df.amount)
This particular example creates a new column called revenue that multiplies the values in the price and amount columns.
Method 2: Multiply Two Columns Based on Condition
from pyspark.sql.functions import when df_new = df.withColumn('revenue', when(df.type=='refund', 0)\ .otherwise(df.price * df.amount))
This particular example creates a new column called revenue that returns 0 if the value in the type column is ‘refund’, otherwise it returns the product of the values in the price and amount columns.
The following examples show how to use each method in practice.
Example 1: Multiply Two Columns
Suppose we have the following PySpark DataFrame that contains information about the price of various items at some store and the amount sold:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [[14, 2],
[10, 3],
[20, 4],
[12, 3],
[7, 3],
[12, 5],
[10, 2],
[10, 3]]
#define column names
columns = ['price', 'amount']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+-----+------+
|price|amount|
+-----+------+
| 14| 2|
| 10| 3|
| 20| 4|
| 12| 3|
| 7| 3|
| 12| 5|
| 10| 2|
| 10| 3|
+-----+------+
We can use the following syntax to create a new column called revenue that multiplies the values in the price and amount columns:
#create new column called 'revenue' that multiplies price by amount df_new = df.withColumn('revenue', df.price * df.amount) #view new DataFrame df_new.show() +-----+------+-------+ |price|amount|revenue| +-----+------+-------+ | 14| 2| 28| | 10| 3| 30| | 20| 4| 80| | 12| 3| 36| | 7| 3| 21| | 12| 5| 60| | 10| 2| 20| | 10| 3| 30| +-----+------+-------+
Notice that the values in the new revenue column are the product of the values in the price and amount columns.
Example 2: Multiply Two Columns Based on Condition
Suppose we have the following PySpark DataFrame that contains information about the price of various items at some store, the amount sold, and whether or not the transaction was a sale or a refund:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [[14, 2, 'sale'],
[10, 3, 'sale'],
[20, 4, 'refund'],
[12, 3, 'sale'],
[7, 3, 'refund'],
[12, 5, 'refund'],
[10, 2, 'sale'],
[10, 3, 'sale']]
#define column names
columns = ['price', 'amount', 'type']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+-----+------+------+
|price|amount| type|
+-----+------+------+
| 14| 2| sale|
| 10| 3| sale|
| 20| 4|refund|
| 12| 3| sale|
| 7| 3|refund|
| 12| 5|refund|
| 10| 2| sale|
| 10| 3| sale|
+-----+------+------+
We can use the following syntax to create a new column called revenue that returns 0 if the value in the type column is ‘refund’ or the product of the values in the price and amount columns otherwise:
from pyspark.sql.functions import when #create new column called 'revenue' df_new = df.withColumn('revenue', when(df.type=='refund', 0)\ .otherwise(df.price * df.amount)) #view new DataFrame df_new.show() +-----+------+------+-------+ |price|amount| type|revenue| +-----+------+------+-------+ | 14| 2| sale| 28| | 10| 3| sale| 30| | 20| 4|refund| 0| | 12| 3| sale| 36| | 7| 3|refund| 0| | 12| 5|refund| 0| | 10| 2| sale| 20| | 10| 3| sale| 30| +-----+------+------+-------+
Notice that the values in the new revenue column are dependent on the corresponding values in the type column.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
PySpark: How to Replace Zero with Null
PySpark: How to Replace String in Column
PySpark: How to Check Data Type of Columns in DataFrame