PySpark: How to Drop Multiple Columns from DataFrame


There are two common ways to drop multiple columns in a PySpark DataFrame:

Method 1: Drop Multiple Columns by Name

#drop 'team' and 'points' columns
df.drop('team', 'points').show()

Method 2: Drop Multiple Columns Based on List

#define list of columns to drop
drop_cols = ['team', 'points']

#drop all columns in list 
df.select(*drop_cols).show()

The following examples show how to use each method in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11, 4], 
        ['A', 'East', 8, 9], 
        ['A', 'East', 10, 3], 
        ['B', 'West', 6, 12], 
        ['B', 'West', 6, 4], 
        ['C', 'East', 5, 2]] 
  
#define column names
columns = ['team', 'conference', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|    10|      3|
|   B|      West|     6|     12|
|   B|      West|     6|      4|
|   C|      East|     5|      2|
+----+----------+------+-------+

Example 1: Drop Multiple Columns by Name

We can use the following syntax to drop the team and points columns from the DataFrame:

#drop 'team' and 'points' columns
df.drop('team', 'points').show()

+----------+-------+
|conference|assists|
+----------+-------+
|      East|      4|
|      East|      9|
|      East|      3|
|      West|     12|
|      West|      4|
|      East|      2|
+----------+-------+

Notice that the team and points columns have both been dropped from the DataFrame, just as we specified.

Example 2: Drop Multiple Columns Based on List

We can use the following syntax to specify a list of column names and then drop all columns in the DataFrame that belong to the list:

#define list of columns to drop
drop_cols = ['team', 'points']

#drop all columns in list
df.drop(*drop_cols).show()

+----------+-------+
|conference|assists|
+----------+-------+
|      East|      4|
|      East|      9|
|      East|      3|
|      West|     12|
|      West|      4|
|      East|      2|
+----------+-------+

Notice that the resulting DataFrame drops each of the column names that we specified in the list.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

PySpark: How to Check if Column Exists in DataFrame
PySpark: How to Print One Column of a DataFrame
PySpark: How to Select Columns by Index in DataFrame

Leave a Reply

Your email address will not be published. Required fields are marked *