How to Read CSV File into PySpark DataFrame (3 Examples)


You can use the spark.read.csv() function to read a CSV file into a PySpark DataFrame.

Here are three common ways to do so:

Method 1: Read CSV File 

df = spark.read.csv('data.csv')

Method 2: Read CSV File with Header

df = spark.read.csv('data.csv', header=True) 

Method 3: Read CSV File with Specific Delimiter

df = spark.read.csv('data.csv', header=True, sep=';')

The following examples show how to use each method in practice.

Example 1: Read CSV File

Suppose I have a CSV file called data.csv with the following contents:

team, points, assists
'A', 78, 12
'B', 85, 20
'C', 93, 23
'D', 90, 8
'E', 91, 14

I can use the following syntax to read this CSV file into a PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#read CSV into PySpark DataFrame
df = spark.read.csv('data.csv')

#view resulting DataFrame
df.show()

+----+-------+--------+
| _c0|    _c1|     _c2|
+----+-------+--------+
|team| points| assists|
| 'A'|     78|      12|
| 'B'|     85|      20|
| 'C'|     93|      23|
| 'D'|     90|       8|
| 'E'|     91|      14|
+----+-------+--------+

By default, PySpark assumes there is no header in the CSV file and simply uses _c0, _c1, _c2 as the column names.

Example 2: Read CSV File with Header

Once again suppose I have a CSV file called data.csv with the following contents:

team, points, assists
'A', 78, 12
'B', 85, 20
'C', 93, 23
'D', 90, 8
'E', 91, 14

I can use the following syntax to read this CSV file into a PySpark DataFrame and specify that the first row should be used as the header:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#read CSV into PySpark DataFrame
df = spark.read.csv('data.csv', header=True)

#view resulting DataFrame
df.show()

+----+-------+--------+
|team| points| assists|
+----+-------+--------+
| 'A'|     78|      12|
| 'B'|     85|      20|
| 'C'|     93|      23|
| 'D'|     90|       8|
| 'E'|     91|      14|
+----+-------+--------+

Since we specified header=True, PySpark used the first row in the CSV file as the header row in the resulting DataFrame.

Example 3: Read CSV File with Specific Delimiter

Suppose I have a CSV file called data.csv with the following contents:

team; points; assists
'A'; 78; 12
'B'; 85; 20
'C'; 93; 23
'D'; 90; 8
'E'; 91; 14

I can use the following syntax to read this CSV file into a PySpark DataFrame and specify that the values in the file are separated by semi-colons:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#read CSV into PySpark DataFrame
df = spark.read.csv('data.csv', header=True, sep=';')

#view resulting DataFrame
df.show()

+----+-------+--------+
|team| points| assists|
+----+-------+--------+
| 'A'|     78|      12|
| 'B'|     85|      20|
| 'C'|     93|      23|
| 'D'|     90|       8|
| 'E'|     91|      14|
+----+-------+--------+

Since we used the sep argument, PySpark knew to use semi-colons as the delimiter for the values when reading the CSV file into the DataFrame.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

PySpark: How to Check if Column Exists in DataFrame
PySpark: How to Select Columns by Index in DataFrame
PySpark: How to Print One Column of a DataFrame

Featured Posts

Leave a Reply

Your email address will not be published. Required fields are marked *