PySpark: How to Add New Column with Row Numbers


You can use the following syntax to add a new column with row numbers to a PySpark DataFrame:

from pyspark.sql.functions import row_number,lit
from pyspark.sql.window import Window

#add column called 'id' that contains row numbers from 1 to n
w = Window().orderBy(lit('A'))
df = df.withColumn('id', row_number().over(w))

This particular example will add a new column called id that contains row numbers ranging from 1 to n for the DataFrame.

The following example shows how to do so in practice.

Example: How to Add New Column with Row Numbers to PySpark DataFrame

Suppose we create the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11], 
        ['A', 'East', 8], 
        ['A', 'East', 10], 
        ['B', 'West', 6], 
        ['B', 'West', 6], 
        ['C', 'East', 5]] 
  
#define column names
columns = ['team', 'conference', 'points'] 
  
#create DataFrame using data and column names
df = spark.createDataFrame(data, columns) 
  
#view DataFrame
df.show()

+----+----------+------+
|team|conference|points|
+----+----------+------+
|   A|      East|    11|
|   A|      East|     8|
|   A|      East|    10|
|   B|      West|     6|
|   B|      West|     6|
|   C|      East|     5|
+----+----------+------+

The DataFrame currently contains three columns.

Suppose we would like to add a new column that contains row numbers.

We can use the following syntax to do so:

from pyspark.sql.functions import row_number,lit
from pyspark.sql.window import Window

#add column called 'id' that contains row numbers from 1 to n
w = Window().orderBy(lit('A'))
df = df.withColumn('id', row_number().over(w))

#view updated DataFrame
df.show()

+----+----------+------+---+
|team|conference|points| id|
+----+----------+------+---+
|   A|      East|    11|  1|
|   A|      East|     8|  2|
|   A|      East|    10|  3|
|   B|      West|     6|  4|
|   B|      West|     6|  5|
|   C|      East|     5|  6|
+----+----------+------+---+

Notice that the new id column contains row numbers ranging from 1 to 6.

If you’d like, you can reorder the columns so that the id column appears at the front:

#move 'id' column to front
df = df.select('id', 'team', 'conference', 'points')

#view updated DataFrame
df.show()

+---+----+----------+------+
| id|team|conference|points|
+---+----+----------+------+
|  1|   A|      East|    11|
|  2|   A|      East|     8|
|  3|   A|      East|    10|
|  4|   B|      West|     6|
|  5|   B|      West|     6|
|  6|   C|      East|     5|
+---+----+----------+------+

Note: The syntax lit(‘A’) is simply used as an arbitrary value. You can replace ‘A’ with anything you’d like and the code will still work.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

PySpark: How to Select Columns by Index in DataFrame
PySpark: How to Select Rows Based on Column Values
PySpark: How to Find Unique Values in a Column

Featured Posts

Leave a Reply

Your email address will not be published. Required fields are marked *