You can use the following methods to select columns by index in a PySpark DataFrame:
Method 1: Select Specific Column by Index
#select first column in DataFrame df.select(df.columns[0]).show()
Method 2: Select All Columns Except Specific One by Index
#select all columns except first column in DataFrame df.drop(df.columns[0]).show()
Method 3: Select Range of Columns by Index
#select all columns between index 0 and 2, not including 2 df.select(df.columns[0:2]).show()
The following examples show how to use each of these methods in practice with the following PySpark DataFrame:
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() #define data data = [['A', 'East', 11], ['A', 'East', 8], ['A', 'East', 10], ['B', 'West', 6], ['B', 'West', 6], ['C', 'East', 5]] #define column names columns = ['team', 'conference', 'points'] #create DataFrame using data and column names df = spark.createDataFrame(data, columns) #view DataFrame df.show() +----+----------+------+ |team|conference|points| +----+----------+------+ | A| East| 11| | A| East| 8| | A| East| 10| | B| West| 6| | B| West| 6| | C| East| 5| +----+----------+------+
Example 1: Select Specific Column by Index
We can use the following syntax to select only the first column in the DataFrame:
#select first column in DataFrame df.select(df.columns[0]).show() +----+ |team| +----+ | A| | A| | A| | B| | B| | C| +----+
Notice that only the first column (the team column) has been selected from the DataFrame.
Example 2: Select All Columns Except Specific One by Index
We can use the following syntax to select all columns in the DataFrame except for the first column:
#select all columns except first column in DataFrame df.drop(df.columns[0]).show() +----------+------+ |conference|points| +----------+------+ | East| 11| | East| 8| | East| 10| | West| 6| | West| 6| | East| 5| +----------+------+
Notice that all columns except the first column (the team column) have been selected from the DataFrame.
Example 3: Select Range of Columns by Index
We can use the following syntax to select all columns in the DataFrame in the range of 0 to 2 (not including 2):
#select all columns between index 0 and 2, not including 2 df.select(df.columns[0:2]).show() +----+----------+ |team|conference| +----+----------+ | A| East| | A| East| | A| East| | B| West| | B| West| | C| East| +----+----------+
Notice that all columns in the range of 0 to 2 (not including 2) have been selected from the DataFrame.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
PySpark: How to Select Rows Based on Column Values
PySpark: How to Select Rows by Index in DataFrame
PySpark: How to Find Unique Values in a Column