There are two common ways to select the top N rows in a PySpark DataFrame:
Method 1: Use take()
df.take(10)
This method will return an array of the top 10 rows.
Method 2: Use limit()
df.limit(10).show()
This method will return a new DataFrame that contains the top 10 rows.
The following examples show how to use each of these methods in practice with the following PySpark DataFrame:
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() #define data data = [['A', 'East', 11], ['A', 'East', 8], ['A', 'East', 10], ['B', 'West', 6], ['B', 'West', 6], ['C', 'East', 5]] #define column names columns = ['team', 'conference', 'points'] #create DataFrame using data and column names df = spark.createDataFrame(data, columns) #view DataFrame df.show() +----+----------+------+ |team|conference|points| +----+----------+------+ | A| East| 11| | A| East| 8| | A| East| 10| | B| West| 6| | B| West| 6| | C| East| 5| +----+----------+------+
Example 1: Select Top N Rows Using take()
We can use the following syntax with the take() method to select the top 3 rows from the DataFrame:
#select top 3 rows from DataFrame df.take(3) [Row(team='A', conference='East', points=11, assists=4), Row(team='A', conference='East', points=8, assists=9), Row(team='A', conference='East', points=10, assists=3)]
This method returns an array of the top 3 rows of the DataFrame.
Example 2: Select Top N Rows Using limit()
We can use the following syntax with the limit() method to select the top 3 rows from the DataFrame:
#select top 3 rows from DataFrame df.limit(3).show() +----+----------+------+-------+ |team|conference|points|assists| +----+----------+------+-------+ | A| East| 11| 4| | A| East| 8| 9| | A| East| 10| 3| +----+----------+------+-------+
This method returns a DataFrame that contains only the top 3 rows of the original DataFrame.
Note that if you’d like to only select the top 3 rows for particular columns, you can specify those columns by using the select() function:
#select top 3 rows from DataFrame only for 'team' and 'points' columns df.select('team', 'points').limit(3).show() +----+------+ |team|points| +----+------+ | A| 11| | A| 8| | A| 10| +----+------+
Notice that only the top 3 rows for the team and points columns are shown in the resulting DataFrame.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
PySpark: How to Select Columns by Index in DataFrame
PySpark: How to Select Rows by Index in DataFrame
PySpark: How to Find Unique Values in a Column