How to Keep Certain Columns in PySpark (With Examples)


You can use the following methods to only keep certain columns in a PySpark DataFrame:

Method 1: Specify Columns to Keep

from pyspark.sql.functions import col

#only keep columns 'col1' and 'col2'
df.select(col('col1'), col('col2')).show() 

Method 2: Specify Columns to Drop

from pyspark.sql.functions import col

#drop columns 'col3' and 'col4'
df.drop(col('col3'), col('col4')).show()  

The following examples show how to use each method with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11, 4], 
        ['A', 'East', 8, 9], 
        ['A', 'East', 10, 3], 
        ['B', 'West', 6, 12], 
        ['B', 'West', 6, 4], 
        ['C', 'East', 5, 2]] 
  
#define column names
columns = ['team', 'conference', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|    10|      3|
|   B|      West|     6|     12|
|   B|      West|     6|      4|
|   C|      East|     5|      2|
+----+----------+------+-------+

Example 1: Specify Columns to Keep

The following code shows how to define a new DataFrame that only keeps the team and points columns:

from pyspark.sql.functions import col

#create new DataFrame and only keep 'team' and 'points' columns
df.select(col('team'), col('points')).show()

+----+------+
|team|points|
+----+------+
|   A|    11|
|   A|     8|
|   A|    10|
|   B|     6|
|   B|     6|
|   C|     5|
+----+------+

Notice that the resulting DataFrame only keeps the two columns that we specified.

Example 2: Specify Columns to Drop

The following code shows how to define a new DataFrame that drops the conference and assists columns from the original DataFrame:

from pyspark.sql.functions import col

#create new DataFrame that drops 'conference' and 'assists' columns
df.drop(col('conference'), col('assists')).show()

+----+------+
|team|points|
+----+------+
|   A|    11|
|   A|     8|
|   A|    10|
|   B|     6|
|   B|     6|
|   C|     5|
+----+------+

Notice that the resulting DataFrame drops the conference and assists columns from the original DataFrame and keeps the remaining columns.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

PySpark: How to Select Rows Based on Column Values
PySpark: How to Select Columns by Index in DataFrame
PySpark: How to Select Rows by Index in DataFrame

Leave a Reply

Your email address will not be published. Required fields are marked *