How to Multiply Two Columns in PySpark (With Examples)


You can use the following methods to multiply two columns in a PySpark DataFrame:

Method 1: Multiply Two Columns

df_new = df.withColumn('revenue', df.price * df.amount)

This particular example creates a new column called revenue that multiplies the values in the price and amount columns.

Method 2: Multiply Two Columns Based on Condition

from pyspark.sql.functions import when

df_new = df.withColumn('revenue', when(df.type=='refund', 0)\
                       .otherwise(df.price * df.amount))

This particular example creates a new column called revenue that returns 0 if the value in the type column is ‘refund’, otherwise it returns the product of the values in the price and amount columns.

The following examples show how to use each method in practice.

Example 1: Multiply Two Columns

Suppose we have the following PySpark DataFrame that contains information about the price of various items at some store and the amount sold:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [[14, 2], 
        [10, 3], 
        [20, 4], 
        [12, 3], 
        [7, 3],
        [12, 5],
        [10, 2],
        [10, 3]]
  
#define column names
columns = ['price', 'amount'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-----+------+
|price|amount|
+-----+------+
|   14|     2|
|   10|     3|
|   20|     4|
|   12|     3|
|    7|     3|
|   12|     5|
|   10|     2|
|   10|     3|
+-----+------+

We can use the following syntax to create a new column called revenue that multiplies the values in the price and amount columns:

#create new column called 'revenue' that multiplies price by amount
df_new = df.withColumn('revenue', df.price * df.amount)

#view new DataFrame
df_new.show()

+-----+------+-------+
|price|amount|revenue|
+-----+------+-------+
|   14|     2|     28|
|   10|     3|     30|
|   20|     4|     80|
|   12|     3|     36|
|    7|     3|     21|
|   12|     5|     60|
|   10|     2|     20|
|   10|     3|     30|
+-----+------+-------+

Notice that the values in the new revenue column are the product of the values in the price and amount columns.

Example 2: Multiply Two Columns Based on Condition

Suppose we have the following PySpark DataFrame that contains information about the price of various items at some store, the amount sold, and whether or not the transaction was a sale or a refund:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [[14, 2, 'sale'], 
        [10, 3, 'sale'], 
        [20, 4, 'refund'], 
        [12, 3, 'sale'], 
        [7, 3, 'refund'],
        [12, 5, 'refund'],
        [10, 2, 'sale'],
        [10, 3, 'sale']]
  
#define column names
columns = ['price', 'amount', 'type'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+-----+------+------+
|price|amount|  type|
+-----+------+------+
|   14|     2|  sale|
|   10|     3|  sale|
|   20|     4|refund|
|   12|     3|  sale|
|    7|     3|refund|
|   12|     5|refund|
|   10|     2|  sale|
|   10|     3|  sale|
+-----+------+------+

We can use the following syntax to create a new column called revenue that returns 0 if the value in the type column is ‘refund’ or the product of the values in the price and amount columns otherwise:

from pyspark.sql.functions import when

#create new column called 'revenue'
df_new = df.withColumn('revenue', when(df.type=='refund', 0)\
                       .otherwise(df.price * df.amount))

#view new DataFrame
df_new.show()

+-----+------+------+-------+
|price|amount|  type|revenue|
+-----+------+------+-------+
|   14|     2|  sale|     28|
|   10|     3|  sale|     30|
|   20|     4|refund|      0|
|   12|     3|  sale|     36|
|    7|     3|refund|      0|
|   12|     5|refund|      0|
|   10|     2|  sale|     20|
|   10|     3|  sale|     30|
+-----+------+------+-------+

Notice that the values in the new revenue column are dependent on the corresponding values in the type column.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

PySpark: How to Replace Zero with Null
PySpark: How to Replace String in Column
PySpark: How to Check Data Type of Columns in DataFrame

Leave a Reply

Your email address will not be published. Required fields are marked *