How to Convert String to Integer in PySpark (With Example)


You can use the following syntax to convert a string column to an integer column in a PySpark DataFrame:

from pyspark.sql.types import IntegerType

df = df.withColumn('my_integer', df['my_string'].cast(IntegerType()))

This particular example creates a new column called my_integer that contains the integer values from the string values in the my_string column.

The following example shows how to use this syntax in practice.

Example: How to Convert String to Integer in PySpark

Suppose we have the following PySpark DataFrame that contains information about points scored by various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', '11'], 
        ['B', '19'], 
        ['C', '22'], 
        ['D', '25'], 
        ['E', '12'], 
        ['F', '41'],
        ['G', '32'],
        ['H', '20']] 
  
#define column names
columns = ['team', 'points']
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+------+
|team|points|
+----+------+
|   A|    11|
|   B|    19|
|   C|    22|
|   D|    25|
|   E|    12|
|   F|    41|
|   G|    32|
|   H|    20|
+----+------+

We can use the following syntax to display the data type of each column in the DataFrame:

#check data type of each column
df.dtypes

[('team', 'string'), ('points', 'string')]

We can see that the points column currently has a data type of string.

To convert this column from a string to an integer, we can use the following syntax:

from pyspark.sql.types import IntegerType

#create integer column from string column
df = df.withColumn('points_integer', df['points'].cast(IntegerType()))

#view updated DataFrame
df.show()

+----+------+--------------+
|team|points|points_integer|
+----+------+--------------+
|   A|    11|            11|
|   B|    19|            19|
|   C|    22|            22|
|   D|    25|            25|
|   E|    12|            12|
|   F|    41|            41|
|   G|    32|            32|
|   H|    20|            20|
+----+------+--------------+

We can use the dtypes function once again to view the data types of each column in the DataFrame:

#check data type of each column
df.dtypes

[('team', 'string'), ('points', 'string'), ('points_integer', 'int')]

We can see that the points_integer column has a data type of int.

We have successfully created an integer column from a string column.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

How to Convert String to Date in PySpark
How to Convert String to Timestamp in PySpark
How to Convert Timestamp to Date in PySpark

Featured Posts

Leave a Reply

Your email address will not be published. Required fields are marked *