How to Select Top N Rows in PySpark DataFrame (With Examples)


There are two common ways to select the top N rows in a PySpark DataFrame:

Method 1: Use take()

df.take(10)

This method will return an array of the top 10 rows.

Method 2: Use limit()

df.limit(10).show()

This method will return a new DataFrame that contains the top 10 rows.

The following examples show how to use each of these methods in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11], 
        ['A', 'East', 8], 
        ['A', 'East', 10], 
        ['B', 'West', 6], 
        ['B', 'West', 6], 
        ['C', 'East', 5]] 
  
#define column names
columns = ['team', 'conference', 'points'] 
  
#create DataFrame using data and column names
df = spark.createDataFrame(data, columns) 
  
#view DataFrame
df.show()

+----+----------+------+
|team|conference|points|
+----+----------+------+
|   A|      East|    11|
|   A|      East|     8|
|   A|      East|    10|
|   B|      West|     6|
|   B|      West|     6|
|   C|      East|     5|
+----+----------+------+

Example 1: Select Top N Rows Using take()

We can use the following syntax with the take() method to select the top 3 rows from the DataFrame:

#select top 3 rows from DataFrame
df.take(3)

[Row(team='A', conference='East', points=11, assists=4),
 Row(team='A', conference='East', points=8, assists=9),
 Row(team='A', conference='East', points=10, assists=3)]

This method returns an array of the top 3 rows of the DataFrame.

Example 2: Select Top N Rows Using limit()

We can use the following syntax with the limit() method to select the top 3 rows from the DataFrame:

#select top 3 rows from DataFrame
df.limit(3).show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|    10|      3|
+----+----------+------+-------+

This method returns a DataFrame that contains only the top 3 rows of the original DataFrame.

Note that if you’d like to only select the top 3 rows for particular columns, you can specify those columns by using the select() function:

#select top 3 rows from DataFrame only for 'team' and 'points' columns
df.select('team', 'points').limit(3).show()

+----+------+
|team|points|
+----+------+
|   A|    11|
|   A|     8|
|   A|    10|
+----+------+

Notice that only the top 3 rows for the team and points columns are shown in the resulting DataFrame.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

PySpark: How to Select Columns by Index in DataFrame
PySpark: How to Select Rows by Index in DataFrame
PySpark: How to Find Unique Values in a Column

Featured Posts

Leave a Reply

Your email address will not be published. Required fields are marked *