How to Rename Columns in PySpark (With Examples)


You can use the following methods to rename columns in a PySpark DataFrame:

Method 1: Rename One Column

#rename 'conference' column to 'conf'
df = df.withColumnRenamed('conference', 'conf')

Method 2: Rename Multiple Columns

#rename 'conference' and 'team' columns
df = df.withColumnRenamed('conference', 'conf')\
       .withColumnRenamed('team', 'team_name')

Method 3: Rename All Columns

#specify new column names to use
col_names = ['the_team', 'the_conf', 'points_scored', 'total_assists']

#rename all column names with new names
df = df.toDF(*col_names)

The following examples show how to use each of these methods in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11, 4], 
        ['A', 'East', 8, 9], 
        ['A', 'East', 10, 3], 
        ['B', 'West', 6, 12], 
        ['B', 'West', 6, 4], 
        ['C', 'East', 5, 2]] 
  
#define column names
columns = ['team', 'conference', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|    10|      3|
|   B|      West|     6|     12|
|   B|      West|     6|      4|
|   C|      East|     5|      2|
+----+----------+------+-------+

Example 1: Rename One Column in PySpark

We can use the following syntax to rename just the conference column in the DataFrame:

#rename 'conference' column to 'conf'
df = df.withColumnRenamed('conference', 'conf')

#view updated DataFrame
df.show()

+----+----+------+-------+
|team|conf|points|assists|
+----+----+------+-------+
|   A|East|    11|      4|
|   A|East|     8|      9|
|   A|East|    10|      3|
|   B|West|     6|     12|
|   B|West|     6|      4|
|   C|East|     5|      2|
+----+----+------+-------+

Notice that only the conference column has been renamed.

Example 2: Rename Multiple Columns in PySpark

We can use the following syntax to rename the conference and team columns in the DataFrame:

#rename 'conference' and 'team' columns
df = df.withColumnRenamed('conference', 'conf')\
       .withColumnRenamed('team', 'team_name')

#view updated DataFrame
df.show()

+---------+----+------+-------+
|team_name|conf|points|assists|
+---------+----+------+-------+
|        A|East|    11|      4|
|        A|East|     8|      9|
|        A|East|    10|      3|
|        B|West|     6|     12|
|        B|West|     6|      4|
|        C|East|     5|      2|
+---------+----+------+-------+

Notice that the conference and team columns have been renamed while all other column names have remained the same.

Example 3: Rename All Columns in PySpark

We can use the following syntax to rename all columns in the DataFrame:

#specify new column names to use
col_names = ['the_team', 'the_conf', 'points_scored', 'total_assists']

#rename all column names with new names
df = df.toDF(*col_names)

#view updated DataFrame
df.show()

+--------+--------+-------------+-------------+
|the_team|the_conf|points_scored|total_assists|
+--------+--------+-------------+-------------+
|       A|    East|           11|            4|
|       A|    East|            8|            9|
|       A|    East|           10|            3|
|       B|    West|            6|           12|
|       B|    West|            6|            4|
|       C|    East|            5|            2|
+--------+--------+-------------+-------------+

Notice that all of the column names have been renamed based on the new names that we specified.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

PySpark: How to Select Rows Based on Column Values
PySpark: How to Select Rows by Index in DataFrame
PySpark: How to Find Unique Values in a Column

Featured Posts

Leave a Reply

Your email address will not be published. Required fields are marked *