You can use the following methods to add a new column with a constant value to a PySpark DataFrame:
Method 1: Add New Column with Constant Numeric Value
from pyspark.sql.functions import lit #add new column called 'salary' with value of 100 for each row df.withColumn('salary', lit(100)).show()
Method 2: Add New Column with Constant String Value
from pyspark.sql.functions import lit #add new column called 'league' with value of 'NBA' for each row df.withColumn('league', lit('NBA')).show()
The following examples show how to use each method in practice with the following PySpark DataFrame:
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() #define data data = [['A', 'East', 11, 4], ['A', 'East', 8, 9], ['A', 'East', 10, 3], ['B', 'West', 6, 12], ['B', 'West', 6, 4], ['C', 'East', 5, 2]] #define column names columns = ['team', 'conference', 'points', 'assists'] #create dataframe using data and column names df = spark.createDataFrame(data, columns) #view dataframe df.show() +----+----------+------+-------+ |team|conference|points|assists| +----+----------+------+-------+ | A| East| 11| 4| | A| East| 8| 9| | A| East| 10| 3| | B| West| 6| 12| | B| West| 6| 4| | C| East| 5| 2| +----+----------+------+-------+
Example 1: Add New Column with Constant Numeric Value
We can use the following syntax to add a new column to the DataFrame called salary that contains a value of 100 for each row:
from pyspark.sql.functions import lit #add new column called 'salary' with value of 100 for each row df.withColumn('salary', lit(100)).show() +----+----------+------+-------+------+ |team|conference|points|assists|salary| +----+----------+------+-------+------+ | A| East| 11| 4| 100| | A| East| 8| 9| 100| | A| East| 10| 3| 100| | B| West| 6| 12| 100| | B| West| 6| 4| 100| | C| East| 5| 2| 100| +----+----------+------+-------+------+
Notice that the new column called salary has been added to the end of the DataFrame and each value in this new column is equal to 100, just as we specified.
Example 2: Add New Column with Constant String Value
We can use the following syntax to add a new column to the DataFrame called league that contains a value of ‘NBA’ for each row:
from pyspark.sql.functions import lit #add new column called 'league' with value of 'NBA' for each row df.withColumn('league', lit('NBA')).show() +----+----------+------+-------+------+ |team|conference|points|assists|league| +----+----------+------+-------+------+ | A| East| 11| 4| NBA| | A| East| 8| 9| NBA| | A| East| 10| 3| NBA| | B| West| 6| 12| NBA| | B| West| 6| 4| NBA| | C| East| 5| 2| NBA| +----+----------+------+-------+------+
Notice that the new column called league has been added to the end of the DataFrame and each value in this new column is equal to NBA, just as we specified.
Note #1: The withColumn function returns a new DataFrame with a specific column modified and all other columns left the same.
Note #2: The lit function creates a column with a literal value.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
PySpark: How to Select Columns by Index in DataFrame
PySpark: How to Check Data Type of Columns in DataFrame
PySpark: How to Print One Column of a DataFrame