You can use the spark.read.csv() function to read a CSV file into a PySpark DataFrame.
Here are three common ways to do so:
Method 1: Read CSV File
df = spark.read.csv('data.csv')
Method 2: Read CSV File with Header
df = spark.read.csv('data.csv', header=True)
Method 3: Read CSV File with Specific Delimiter
df = spark.read.csv('data.csv', header=True, sep=';')
The following examples show how to use each method in practice.
Example 1: Read CSV File
Suppose I have a CSV file called data.csv with the following contents:
team, points, assists 'A', 78, 12 'B', 85, 20 'C', 93, 23 'D', 90, 8 'E', 91, 14
I can use the following syntax to read this CSV file into a PySpark DataFrame:
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() #read CSV into PySpark DataFrame df = spark.read.csv('data.csv') #view resulting DataFrame df.show() +----+-------+--------+ | _c0| _c1| _c2| +----+-------+--------+ |team| points| assists| | 'A'| 78| 12| | 'B'| 85| 20| | 'C'| 93| 23| | 'D'| 90| 8| | 'E'| 91| 14| +----+-------+--------+
By default, PySpark assumes there is no header in the CSV file and simply uses _c0, _c1, _c2 as the column names.
Example 2: Read CSV File with Header
Once again suppose I have a CSV file called data.csv with the following contents:
team, points, assists 'A', 78, 12 'B', 85, 20 'C', 93, 23 'D', 90, 8 'E', 91, 14
I can use the following syntax to read this CSV file into a PySpark DataFrame and specify that the first row should be used as the header:
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() #read CSV into PySpark DataFrame df = spark.read.csv('data.csv', header=True) #view resulting DataFrame df.show() +----+-------+--------+ |team| points| assists| +----+-------+--------+ | 'A'| 78| 12| | 'B'| 85| 20| | 'C'| 93| 23| | 'D'| 90| 8| | 'E'| 91| 14| +----+-------+--------+
Since we specified header=True, PySpark used the first row in the CSV file as the header row in the resulting DataFrame.
Example 3: Read CSV File with Specific Delimiter
Suppose I have a CSV file called data.csv with the following contents:
team; points; assists 'A'; 78; 12 'B'; 85; 20 'C'; 93; 23 'D'; 90; 8 'E'; 91; 14
I can use the following syntax to read this CSV file into a PySpark DataFrame and specify that the values in the file are separated by semi-colons:
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() #read CSV into PySpark DataFrame df = spark.read.csv('data.csv', header=True, sep=';') #view resulting DataFrame df.show() +----+-------+--------+ |team| points| assists| +----+-------+--------+ | 'A'| 78| 12| | 'B'| 85| 20| | 'C'| 93| 23| | 'D'| 90| 8| | 'E'| 91| 14| +----+-------+--------+
Since we used the sep argument, PySpark knew to use semi-colons as the delimiter for the values when reading the CSV file into the DataFrame.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
PySpark: How to Check if Column Exists in DataFrame
PySpark: How to Select Columns by Index in DataFrame
PySpark: How to Print One Column of a DataFrame