PySpark: How to Create Empty DataFrame with Column Names


You can use the following syntax to create an empty PySpark DataFrame with specific column names:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

from pyspark.sql.types import StructType, StructField, StringType, FloatType

#create empty RDD
empty_rdd=spark.sparkContext.emptyRDD()

#specify colum names and types
my_columns=[StructField('team', StringType(),True),
            StructField('position', StringType(),True),
            StructField('points', FloatType(),True)]

#create DataFrame with specific column names
df=spark.createDataFrame([], schema=StructType(my_columns))

This particular example creates a DataFrame called df with three columns: team, position and points.

The following example shows how to use this syntax in practice.

Example: Create Empty PySpark DataFrame with Column Names

We can use the following syntax to create an empty PySpark DataFrame with specific column names:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

from pyspark.sql.types import StructType, StructField, StringType, FloatType

#create empty RDD
empty_rdd=spark.sparkContext.emptyRDD()

#specify colum names and types
my_columns=[StructField('team', StringType(),True),
            StructField('position', StringType(),True),
            StructField('points', FloatType(),True)]

#create DataFrame with specific column names
df=spark.createDataFrame([], schema=StructType(my_columns))

#view DataFrame
df.show()

+----+--------+------+
|team|position|points|
+----+--------+------+
+----+--------+------+

We can see that an empty PySpark DataFrame has been created with the following column names: team, position and points.

We can also use the following syntax to view the schema of the DataFrame:

#view schema of DataFrame
df.printSchema()

root
 |-- team: string (nullable = true)
 |-- position: string (nullable = true)
 |-- points: float (nullable = true)

From the output we can see:

  • The team field is a string.
  • The position field is a string.
  • The points field is a float.

Note: You can find a complete list of data types that you can specify for columns in a PySpark DataFrame here.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

PySpark: How to Create New DataFrame from Existing DataFrame
PySpark: How to Select Rows by Index in DataFrame
PySpark: How to Select Columns by Index in DataFrame

Leave a Reply

Your email address will not be published. Required fields are marked *