You can use the following methods to create a new column in a PySpark DataFrame that contains random numbers:
Method 1: Create New Column with Random Decimal Numbers
from pyspark.sql.functions import rand #create new column named 'rand' that contains random floats between 0 and 100 df.withColumn('rand', rand(seed=23)*100).show()
Method 2: Create New Column with Random Integers
from pyspark.sql.functions import rand, round #create new column named 'rand' that contains random integers between 0 and 100 df.withColumn('rand', round(rand(seed=23)*100, 0)).show()
The following examples show how to use each method in practice with the following PySpark DataFrame:
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() #define data data = [['Mavs', 18], ['Nets', 33], ['Lakers', 12], ['Kings', 15], ['Hawks', 19], ['Wizards', 24], ['Magic', 28], ['Jazz', 40], ['Thunder', 24], ['Spurs', 13]] #define column names columns = ['team', 'points'] #create dataframe using data and column names df = spark.createDataFrame(data, columns) #view dataframe df.show() +-------+------+ | team|points| +-------+------+ | Mavs| 18| | Nets| 33| | Lakers| 12| | Kings| 15| | Hawks| 19| |Wizards| 24| | Magic| 28| | Jazz| 40| |Thunder| 24| | Spurs| 13| +-------+------+
Example 1: Create New Column with Random Decimal Numbers
We can use the followings syntax to add a new column to the DataFrame named rand that contains random decimal numbers between 0 and 100:
from pyspark.sql.functions import rand #create new column named 'rand' that contains random floats between 0 and 100 df.withColumn('rand', rand(seed=23)*100).show() +-------+------+------------------+ | team|points| rand| +-------+------+------------------+ | Mavs| 18| 93.88044512577216| | Nets| 33|39.432553969527554| | Lakers| 12|23.260361399084918| | Kings| 15| 2.339183228862929| | Hawks| 19| 82.53753350983487| |Wizards| 24| 88.94415403143505| | Magic| 28| 80.81524027081029| | Jazz| 40| 59.56629641640896| |Thunder| 24| 27.62195585886885| | Spurs| 13| 70.43214981152886| +-------+------+------------------+
Notice that the new rand column contains random decimal numbers between 0 and 100.
Note #1: By specifying a value for seed within the rand() function, we will be able to generate the same random numbers each time we run the code.
Note #2: The rand() function returns a value between 0 and 1 by default. Thus, the number that we multiply the rand() function by specifies the max number that can be returned. In this example, we set the max to be 100.
Example 2 Create New Column with Random Integers
We can use the followings syntax to add a new column to the DataFrame named rand that contains random integers between 0 and 100:
from pyspark.sql.functions import rand, round #create new column named 'rand' that contains random integers between 0 and 100 df.withColumn('rand', round(rand(seed=23)*100, 0)).show() +-------+------+----+ | team|points|rand| +-------+------+----+ | Mavs| 18|94.0| | Nets| 33|39.0| | Lakers| 12|23.0| | Kings| 15| 2.0| | Hawks| 19|83.0| |Wizards| 24|89.0| | Magic| 28|81.0| | Jazz| 40|60.0| |Thunder| 24|28.0| | Spurs| 13|70.0| +-------+------+----+
Notice that the new rand column contains random integers between 0 and 100.
You can find the complete documentation for the PySpark rand function here.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
PySpark: How to Select Random Sample of Rows
PySpark: How to Add New Rows to DataFrame
PySpark: How to Add New Column with Constant Value