You can use the following syntax to create an empty PySpark DataFrame with specific column names:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
from pyspark.sql.types import StructType, StructField, StringType, FloatType
#create empty RDD
empty_rdd=spark.sparkContext.emptyRDD()
#specify colum names and types
my_columns=[StructField('team', StringType(),True),
StructField('position', StringType(),True),
StructField('points', FloatType(),True)]
#create DataFrame with specific column names
df=spark.createDataFrame([], schema=StructType(my_columns))
This particular example creates a DataFrame called df with three columns: team, position and points.
The following example shows how to use this syntax in practice.
Example: Create Empty PySpark DataFrame with Column Names
We can use the following syntax to create an empty PySpark DataFrame with specific column names:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
from pyspark.sql.types import StructType, StructField, StringType, FloatType
#create empty RDD
empty_rdd=spark.sparkContext.emptyRDD()
#specify colum names and types
my_columns=[StructField('team', StringType(),True),
StructField('position', StringType(),True),
StructField('points', FloatType(),True)]
#create DataFrame with specific column names
df=spark.createDataFrame([], schema=StructType(my_columns))
#view DataFrame
df.show()
+----+--------+------+
|team|position|points|
+----+--------+------+
+----+--------+------+
We can see that an empty PySpark DataFrame has been created with the following column names: team, position and points.
We can also use the following syntax to view the schema of the DataFrame:
#view schema of DataFrame
df.printSchema()
root
|-- team: string (nullable = true)
|-- position: string (nullable = true)
|-- points: float (nullable = true)
From the output we can see:
- The team field is a string.
- The position field is a string.
- The points field is a float.
Note: You can find a complete list of data types that you can specify for columns in a PySpark DataFrame here.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
PySpark: How to Create New DataFrame from Existing DataFrame
PySpark: How to Select Rows by Index in DataFrame
PySpark: How to Select Columns by Index in DataFrame