You can use the following syntax to remove spaces from each column name in a PySpark DataFrame:
from pyspark.sql import functions as F
#replace all spaces in column names with underscores
df_new = df.select([F.col(x).alias(x.replace(' ', '_')) for x in df.columns])
The following example shows how to use this syntax in practice.
Example: How to Remove Spaces from Column Names in PySpark
Suppose we have the following PySpark DataFrame that contains information about various basketball players:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [['Mavs', 18, 4],
['Nets', 33, 9],
['Hawks', 12, 7],
['Thunder', 15, 3],
['Lakers', 19, 2],
['Cavs', 24, 5],
['Magic', 28, 7]]
#define column names
columns = ['team name', 'points scored', 'total assists']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+---------+-------------+-------------+
|team name|points scored|total assists|
+---------+-------------+-------------+
| Mavs| 18| 4|
| Nets| 33| 9|
| Hawks| 12| 7|
| Thunder| 15| 3|
| Lakers| 19| 2|
| Cavs| 24| 5|
| Magic| 28| 7|
+---------+-------------+-------------+
Notice that each of the column names contains a space.
We can use the following syntax to replace the spaces in each column name with an underscore instead:
from pyspark.sql import functions as F
#replace all spaces in column names with underscores
df_new = df.select([F.col(x).alias(x.replace(' ', '_')) for x in df.columns])
#view new DataFrame
df_new.show()
+---------+-------------+-------------+
|team_name|points_scored|total_assists|
+---------+-------------+-------------+
| Mavs| 18| 4|
| Nets| 33| 9|
| Hawks| 12| 7|
| Thunder| 15| 3|
| Lakers| 19| 2|
| Cavs| 24| 5|
| Magic| 28| 7|
+---------+-------------+-------------+
Notice that the spaces in each column name have been replaced with an underscore.
Also note that you could simply remove the spaces from each column name without replacing them with another character by using the following syntax:
from pyspark.sql import functions as F
#remove all spaces in column names
df_new = df.select([F.col(x).alias(x.replace(' ', '')) for x in df.columns])
#view new DataFrame
df_new.show()
+--------+------------+------------+
|teamname|pointsscored|totalassists|
+--------+------------+------------+
| Mavs| 18| 4|
| Nets| 33| 9|
| Hawks| 12| 7|
| Thunder| 15| 3|
| Lakers| 19| 2|
| Cavs| 24| 5|
| Magic| 28| 7|
+--------+------------+------------+
Notice that the spaces in each column name have simply been removed.
Note that we used the PySpark replace function to replace the spaces in the column names.
You can find the complete documentation for the PySpark replace function here.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
PySpark: How to Remove Special Characters from Column
PySpark: How to Remove Specific Characters from Strings
PySpark: How to Reorder Columns in DataFrame