There are three common ways to select multiple columns in a PySpark DataFrame:
Method 1: Select Multiple Columns by Name
#select 'team' and 'points' columns df.select('team', 'points').show()
Method 2: Select Multiple Columns Based on List
#define list of columns to select
select_cols = ['team', 'points']
#select all columns in list
df.select(*select_cols).show()
Method 3: Select Multiple Columns Based on Index Range
#select all columns between index 0 and 2 ( not including 2) df.select(df.columns[0:2]).show()
The following examples show how to use each method in practice with the following PySpark DataFrame:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [['A', 'East', 11, 4],
['A', 'East', 8, 9],
['A', 'East', 10, 3],
['B', 'West', 6, 12],
['B', 'West', 6, 4],
['C', 'East', 5, 2]]
#define column names
columns = ['team', 'conference', 'points', 'assists']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
| A| East| 11| 4|
| A| East| 8| 9|
| A| East| 10| 3|
| B| West| 6| 12|
| B| West| 6| 4|
| C| East| 5| 2|
+----+----------+------+-------+
Example 1: Select Multiple Columns by Name
We can use the following syntax to select the team and points columns of the DataFrame:
#select 'team' and 'points' columns df.select('team', 'points').show() +----+------+ |team|points| +----+------+ | A| 11| | A| 8| | A| 10| | B| 6| | B| 6| | C| 5| +----+------+
Notice that the resulting DataFrame only contains the team and points columns, just as we specified.
Example 2: Select Multiple Columns Based on List
We can use the following syntax to specify a list of column names and then select all columns in the DataFrame that belong to the list:
#define list of columns to select select_cols = ['team', 'points'] #select all columns in list df.select(*select_cols).show() +----+------+ |team|points| +----+------+ | A| 11| | A| 8| | A| 10| | B| 6| | B| 6| | C| 5| +----+------+
Notice that the resulting DataFrame only contains the column names that we specified in the list.
Example 3: Select Multiple Columns Based on Index Range
We can use the following syntax to specify a list of column names and then select all columns in the DataFrame that belong to the list:
#select all columns between index positions 0 and 2 ( not including 2) df.select(df.columns[0:2]).show() +----+----------+ |team|conference| +----+----------+ | A| East| | A| East| | A| East| | B| West| | B| West| | C| East| +----+----------+
Notice that the resulting DataFrame only contains the columns in index positions 0 and 1.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
PySpark: How to Select Rows Based on Column Values
PySpark: How to Select Rows by Index in DataFrame
PySpark: How to Select Columns by Index in DataFrame