You can use the following syntax to convert a string column to an integer column in a PySpark DataFrame:
from pyspark.sql.types import IntegerType df = df.withColumn('my_integer', df['my_string'].cast(IntegerType()))
This particular example creates a new column called my_integer that contains the integer values from the string values in the my_string column.
The following example shows how to use this syntax in practice.
Example: How to Convert String to Integer in PySpark
Suppose we have the following PySpark DataFrame that contains information about points scored by various basketball players:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [['A', '11'],
['B', '19'],
['C', '22'],
['D', '25'],
['E', '12'],
['F', '41'],
['G', '32'],
['H', '20']]
#define column names
columns = ['team', 'points']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+----+------+
|team|points|
+----+------+
| A| 11|
| B| 19|
| C| 22|
| D| 25|
| E| 12|
| F| 41|
| G| 32|
| H| 20|
+----+------+
We can use the following syntax to display the data type of each column in the DataFrame:
#check data type of each column
df.dtypes
[('team', 'string'), ('points', 'string')]
We can see that the points column currently has a data type of string.
To convert this column from a string to an integer, we can use the following syntax:
from pyspark.sql.types import IntegerType
#create integer column from string column
df = df.withColumn('points_integer', df['points'].cast(IntegerType()))
#view updated DataFrame
df.show()
+----+------+--------------+
|team|points|points_integer|
+----+------+--------------+
| A| 11| 11|
| B| 19| 19|
| C| 22| 22|
| D| 25| 25|
| E| 12| 12|
| F| 41| 41|
| G| 32| 32|
| H| 20| 20|
+----+------+--------------+
We can use the dtypes function once again to view the data types of each column in the DataFrame:
#check data type of each column
df.dtypes
[('team', 'string'), ('points', 'string'), ('points_integer', 'int')]
We can see that the points_integer column has a data type of int.
We have successfully created an integer column from a string column.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
How to Convert String to Date in PySpark
How to Convert String to Timestamp in PySpark
How to Convert Timestamp to Date in PySpark