You can use the following syntax to add a new column with row numbers to a PySpark DataFrame:
from pyspark.sql.functions import row_number,lit from pyspark.sql.window import Window #add column called 'id' that contains row numbers from 1 to n w = Window().orderBy(lit('A')) df = df.withColumn('id', row_number().over(w))
This particular example will add a new column called id that contains row numbers ranging from 1 to n for the DataFrame.
The following example shows how to do so in practice.
Example: How to Add New Column with Row Numbers to PySpark DataFrame
Suppose we create the following PySpark DataFrame:
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() #define data data = [['A', 'East', 11], ['A', 'East', 8], ['A', 'East', 10], ['B', 'West', 6], ['B', 'West', 6], ['C', 'East', 5]] #define column names columns = ['team', 'conference', 'points'] #create DataFrame using data and column names df = spark.createDataFrame(data, columns) #view DataFrame df.show() +----+----------+------+ |team|conference|points| +----+----------+------+ | A| East| 11| | A| East| 8| | A| East| 10| | B| West| 6| | B| West| 6| | C| East| 5| +----+----------+------+
The DataFrame currently contains three columns.
Suppose we would like to add a new column that contains row numbers.
We can use the following syntax to do so:
from pyspark.sql.functions import row_number,lit from pyspark.sql.window import Window #add column called 'id' that contains row numbers from 1 to n w = Window().orderBy(lit('A')) df = df.withColumn('id', row_number().over(w)) #view updated DataFrame df.show() +----+----------+------+---+ |team|conference|points| id| +----+----------+------+---+ | A| East| 11| 1| | A| East| 8| 2| | A| East| 10| 3| | B| West| 6| 4| | B| West| 6| 5| | C| East| 5| 6| +----+----------+------+---+
Notice that the new id column contains row numbers ranging from 1 to 6.
If you’d like, you can reorder the columns so that the id column appears at the front:
#move 'id' column to front
df = df.select('id', 'team', 'conference', 'points')
#view updated DataFrame
df.show()
+---+----+----------+------+
| id|team|conference|points|
+---+----+----------+------+
| 1| A| East| 11|
| 2| A| East| 8|
| 3| A| East| 10|
| 4| B| West| 6|
| 5| B| West| 6|
| 6| C| East| 5|
+---+----+----------+------+
Note: The syntax lit(‘A’) is simply used as an arbitrary value. You can replace ‘A’ with anything you’d like and the code will still work.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
PySpark: How to Select Columns by Index in DataFrame
PySpark: How to Select Rows Based on Column Values
PySpark: How to Find Unique Values in a Column