You can use the following methods to create a new column in a PySpark DataFrame that contains random numbers:

**Method 1: Create New Column with Random Decimal Numbers**

from pyspark.sql.functions import rand #create new column named 'rand' that contains random floats between 0 and 100 df.withColumn('rand', rand(seed=23)*100).show()

**Method 2: Create New Column with Random Integers**

from pyspark.sql.functions import rand, round #create new column named 'rand' that contains random integers between 0 and 100 df.withColumn('rand', round(rand(seed=23)*100, 0)).show()

The following examples show how to use each method in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() #define data data = [['Mavs', 18], ['Nets', 33], ['Lakers', 12], ['Kings', 15], ['Hawks', 19], ['Wizards', 24], ['Magic', 28], ['Jazz', 40], ['Thunder', 24], ['Spurs', 13]] #define column names columns = ['team', 'points'] #create dataframe using data and column names df = spark.createDataFrame(data, columns) #view dataframe df.show() +-------+------+ | team|points| +-------+------+ | Mavs| 18| | Nets| 33| | Lakers| 12| | Kings| 15| | Hawks| 19| |Wizards| 24| | Magic| 28| | Jazz| 40| |Thunder| 24| | Spurs| 13| +-------+------+

**Example 1: Create New Column with Random Decimal Numbers**

We can use the followings syntax to add a new column to the DataFrame named **rand** that contains random decimal numbers between 0 and 100:

from pyspark.sql.functions import rand #create new column named 'rand' that contains random floats between 0 and 100 df.withColumn('rand', rand(seed=23)*100).show() +-------+------+------------------+ | team|points| rand| +-------+------+------------------+ | Mavs| 18| 93.88044512577216| | Nets| 33|39.432553969527554| | Lakers| 12|23.260361399084918| | Kings| 15| 2.339183228862929| | Hawks| 19| 82.53753350983487| |Wizards| 24| 88.94415403143505| | Magic| 28| 80.81524027081029| | Jazz| 40| 59.56629641640896| |Thunder| 24| 27.62195585886885| | Spurs| 13| 70.43214981152886| +-------+------+------------------+

Notice that the new **rand** column contains random decimal numbers between 0 and 100.

**Note #1**: By specifying a value for **seed** within the **rand()** function, we will be able to generate the same random numbers each time we run the code.

**Note #2:** The **rand()** function returns a value between 0 and 1 by default. Thus, the number that we multiply the **rand()** function by specifies the max number that can be returned. In this example, we set the max to be **100**.

**Example 2 Create New Column with Random Integers**

We can use the followings syntax to add a new column to the DataFrame named **rand** that contains random integers between 0 and 100:

from pyspark.sql.functions import rand, round #create new column named 'rand' that contains random integers between 0 and 100 df.withColumn('rand', round(rand(seed=23)*100, 0)).show() +-------+------+----+ | team|points|rand| +-------+------+----+ | Mavs| 18|94.0| | Nets| 33|39.0| | Lakers| 12|23.0| | Kings| 15| 2.0| | Hawks| 19|83.0| |Wizards| 24|89.0| | Magic| 28|81.0| | Jazz| 40|60.0| |Thunder| 24|28.0| | Spurs| 13|70.0| +-------+------+----+

Notice that the new **rand** column contains random integers between 0 and 100.

You can find the complete documentation for the PySpark **rand **function here.

**Additional Resources**

The following tutorials explain how to perform other common tasks in PySpark:

PySpark: How to Select Random Sample of Rows

PySpark: How to Add New Rows to DataFrame

PySpark: How to Add New Column with Constant Value