PySpark: How to Check Data Type of Columns in DataFrame


You can use the following methods in PySpark to check the data type of columns in a DataFrame:

Method 1: Check Data Type of One Specific Column

#return data type of 'conference' column
dict(df.dtypes)['conference']

Method 2: Check Data Type of All Columns

#return data type of all columns
df.dtypes

The following examples show how to use each method in practice with the following PySpark DataFrame:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11, 4], 
        ['A', None, 8, 9], 
        ['A', 'East', 10, 3], 
        ['B', 'West', None, 12], 
        ['B', 'West', None, 4], 
        ['C', 'East', 5, 2]] 
  
#define column names
columns = ['team', 'conference', 'points', 'assists'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      null|     8|      9|
|   A|      East|    10|      3|
|   B|      West|  null|     12|
|   B|      West|  null|      4|
|   C|      East|     5|      2|
+----+----------+------+-------+

Example 1: Check Data Type of One Specific Column

We can use the following syntax to check the data type of the conference column in the DataFrame:

#return data type of 'conference' column
dict(df.dtypes)['conference']

'string'

The output tells us that the conference column has a data type of string.

To check the data type of another specific column, simply replace conference with a different column name:

#return data type of 'points' column
dict(df.dtypes)['points']

'bigint'

The output tells us that the points column has a data type of bigint.

Example 2: Check Data Type of All Columns

We can use the following syntax to check the data type of all columns in the DataFrame:

#return data type of all columns
df.dtypes

[('team', 'string'),
 ('conference', 'string'),
 ('points', 'bigint'),
 ('assists', 'bigint')]

The output shows each of the column names along with the data type of each column.

For example, we can see:

  • The team column has a data type of string.
  • The conference column has a data type of string.
  • The points column has a data type of bigint.
  • The assists column has a data type of bigint.

And so on.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

PySpark: How to Check if Column Exists in DataFrame
PySpark: How to Select Columns by Index in DataFrame
PySpark: How to Print One Column of a DataFrame

Featured Posts

Leave a Reply

Your email address will not be published. Required fields are marked *