You can use the following methods to exclude specific columns in a PySpark DataFrame:
Method 1: Exclude One Column
#select all columns except 'points' column
df_new = df.drop('points')
Method 2: Exclude Multiple Columns
#select all columns except 'conference' and 'points' columns
df_new = df.drop('conference', 'points')
The following examples show how to use each method in practice with the following PySpark DataFrame:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [['A', 'East', 11, 4],
['A', 'East', 8, 9],
['A', 'East', 10, 3],
['B', 'West', 6, 12],
['B', 'West', 6, 4],
['C', 'East', 5, 2]]
#define column names
columns = ['team', 'conference', 'points', 'assists']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
| A| East| 11| 4|
| A| East| 8| 9|
| A| East| 10| 3|
| B| West| 6| 12|
| B| West| 6| 4|
| C| East| 5| 2|
+----+----------+------+-------+
Example 1: Exclude One Column in PySpark
We can use the following syntax to select all columns in the DataFrame, excluding the points column:
#select all columns except 'points' column df_new = df.drop('points') #view new DataFrame df_new.show() +----+----------+-------+ |team|conference|assists| +----+----------+-------+ | A| East| 4| | A| East| 9| | A| East| 3| | B| West| 12| | B| West| 4| | C| East| 2| +----+----------+-------+
Notice that all columns in the DataFrame are selected except for the points column.
Example 2: Exclude Multiple Columns in PySpark
We can use the following syntax to select all columns in the DataFrame, excluding the conference and points column:
#select all columns except 'conference' and 'points' columns df_new = df.drop('conference', 'points') #view new DataFrame df_new.show() +----+-------+ |team|assists| +----+-------+ | A| 4| | A| 9| | A| 3| | B| 12| | B| 4| | C| 2| +----+-------++
Notice that all columns in the DataFrame are selected except for the conference and points columns.
Note: You can find the complete documentation for the PySpark drop function here.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
PySpark: How to Select Rows Based on Column Values
PySpark: How to Select Rows by Index in DataFrame
PySpark: How to Select Columns by Index in DataFrame