PySpark: How to Select Rows by Index in DataFrame


By default, a PySpark DataFrame does not have a built-in index.

However, it’s easy to add an index column which you can then use to select rows in the DataFrame based on their index value.

The following example shows how to do so in practice.

Example: Select Rows by Index in PySpark DataFrame

Suppose we create the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11], 
        ['A', 'East', 8], 
        ['A', 'East', 10], 
        ['B', 'West', 6], 
        ['B', 'West', 6], 
        ['C', 'East', 5]] 
  
#define column names
columns = ['team', 'conference', 'points'] 
  
#create DataFrame using data and column names
df = spark.createDataFrame(data, columns) 
  
#view DataFrame
df.show()

+----+----------+------+
|team|conference|points|
+----+----------+------+
|   A|      East|    11|
|   A|      East|     8|
|   A|      East|    10|
|   B|      West|     6|
|   B|      West|     6|
|   C|      East|     5|
+----+----------+------+

We can use the following syntax to add a column called id that ranges from 1 to the last row in the DataFrame:

from pyspark.sql.functions import row_number,lit
from pyspark.sql.window import Window

#add column called 'id' that contains row numbers from 1 to n
w = Window().orderBy(lit('A'))
df = df.withColumn('id', row_number().over(w))

#view updated DataFrame
df.show()

+----+----------+------+---+
|team|conference|points| id|
+----+----------+------+---+
|   A|      East|    11|  1|
|   A|      East|     8|  2|
|   A|      East|    10|  3|
|   B|      West|     6|  4|
|   B|      West|     6|  5|
|   C|      East|     5|  6|
+----+----------+------+---+

Now we can use the where function along with the between function to select all rows between index values 2 and 5:

from pyspark.sql.functions import col

#select all rows between index values 2 and 5
df.where(col('id').between(2, 5)).show()

+----+----------+------+---+
|team|conference|points| id|
+----+----------+------+---+
|   A|      East|     8|  2|
|   A|      East|    10|  3|
|   B|      West|     6|  4|
|   B|      West|     6|  5|
+----+----------+------+---+

The output displays all rows in the DataFrame between index values 2 and 5.

If we’d like, we can instead use the filter and isin functions to select specific rows in a list:

#find unique values in points column
df.filter(df.id.isin(1,5,6)).show()

+----+----------+------+---+
|team|conference|points| id|
+----+----------+------+---+
|   A|      East|    11|  1|
|   B|      West|     6|  5|
|   C|      East|     5|  6|
+----+----------+------+---+

The output displays the rows in the DataFrame in index positions 1, 5 and 6.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

PySpark: How to Select Columns by Index in DataFrame
PySpark: How to Select Rows Based on Column Values
PySpark: How to Find Unique Values in a Column

Featured Posts

Leave a Reply

Your email address will not be published. Required fields are marked *