You can use the toDF() function to convert a RDD (resilient distributed dataset) to a DataFrame in PySpark:
my_df = my_RDD.toDF()
This particular example will convert the RDD named my_RDD to a DataFrame called my_df.
The following example shows how to use this syntax in practice.
Example: How to Convert RDD to DataFrame in PySpark
First, let’s create the following RDD:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [('A', 11),
('B', 19),
('C', 22),
('D', 25),
('E', 12),
('F', 41)]
#create RDD using data
my_RDD = spark.sparkContext.parallelize(data)
We can verify that this object is a RDD by using the type() function:
#check object type type(my_RDD) pyspark.rdd.RDD
We can see that the object my_RDD is indeed a RDD.
We can then use the following syntax to convert the RDD to a PySpark DataFrame:
#convert RDD to DataFrame
my_df = my_RDD.toDF()
#view DataFrame
my_df.show()
+---+---+
| _1| _2|
+---+---+
| A| 11|
| B| 19|
| C| 22|
| D| 25|
| E| 12|
| F| 41|
+---+---+
We can see that the RDD has been converted to a DataFrame.
We can verify that the my_df object is a DataFrame by using the type() function once again:
#check object type type(my_df) pyspark.sql.dataframe.DataFrame
We can see that the object my_df is indeed a DataFrame.
Note that the toDF() function uses column names _1 and _2 by default.
However, we can also specify column names to use within the toDF() function:
#convert RDD to DataFrame with specific column names
my_df = my_RDD.toDF(['player', 'assists'])
#view DataFrame
my_df.show()
+------+-------+
|player|assists|
+------+-------+
| A| 11|
| B| 19|
| C| 22|
| D| 25|
| E| 12|
| F| 41|
+------+-------+
Notice that the RDD has now been converted to a DataFrame with the column names player and assists.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
How to Convert String to Integer in PySpark
How to Convert String to Date in PySpark
How to Convert String to Timestamp in PySpark