How to Drop First Column in PySpark DataFrame


You can use the following methods to drop the first column from a PySpark DataFrame:

Method 1: Drop First Column by Index Position

#create new DataFrame that drops first column by index position
df_new = df.drop(df.columns[0])

Method 2: Drop First Column by Name

#create new DataFrame that drops first column by name
df_new = df.drop('col1')

The following examples show how to use each of these methods in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11, 4], 
        ['A', 'East', 8, 9], 
        ['A', 'East', 10, 3], 
        ['B', 'West', 6, 12], 
        ['B', 'West', 6, 4], 
        ['C', 'East', 5, 2]] 
  
#define column names
columns = ['team', 'conference', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|    10|      3|
|   B|      West|     6|     12|
|   B|      West|     6|      4|
|   C|      East|     5|      2|
+----+----------+------+-------+

Example 1: Drop First Column in PySpark by Index Position

We can use the following syntax to drop the first column in the DataFrame by index position:

#create new DataFrame that drops first column by index position
df_new = df.drop(df.columns[0])

#view new DataFrame
df_new.show()

+----------+------+-------+
|conference|points|assists|
+----------+------+-------+
|      East|    11|      4|
|      East|     8|      9|
|      East|    10|      3|
|      West|     6|     12|
|      West|     6|      4|
|      East|     5|      2|
+----------+------+-------+

Notice that only the first column (the team column) has been dropped from the DataFrame.

Example 2: Drop First Column in PySpark by Name

We can use the following syntax to drop the first column in the DataFrame by name:

#create new DataFrame that drops first column by name
df_new = df.drop('team')

#view new DataFrame
df_new.show()

+----------+------+-------+
|conference|points|assists|
+----------+------+-------+
|      East|    11|      4|
|      East|     8|      9|
|      East|    10|      3|
|      West|     6|     12|
|      West|     6|      4|
|      East|     5|      2|
+----------+------+-------+

Notice that only the first column (the team column) has been dropped from the DataFrame.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

PySpark: How to Select Rows Based on Column Values
PySpark: How to Select Rows by Index in DataFrame
PySpark: How to Find Unique Values in a Column

Featured Posts

Leave a Reply

Your email address will not be published. Required fields are marked *