You can use the following syntax to convert a string column to a timestamp column in a PySpark DataFrame:
from pyspark.sql import functions as F
df = df.withColumn('ts_new', F.to_timestamp('ts', 'yyyy-MM-dd HH:mm:ss'))
This particular example creates a new column called ts_new that contains timestamp values from the string values in the ts column.
The following example shows how to use this syntax in practice.
Example: How to Convert String to Timestamp in PySpark
Suppose we have the following PySpark DataFrame that contains information about sales made on various timestamps at some company:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [['2023-01-15 04:14:22', 225],
['2023-02-24 10:55:01', 260],
['2023-07-14 18:34:59', 413],
['2023-10-30 22:20:05', 368]]
#define column names
columns = ['ts', 'sales']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+-------------------+-----+
| ts|sales|
+-------------------+-----+
|2023-01-15 04:14:22| 225|
|2023-02-24 10:55:01| 260|
|2023-07-14 18:34:59| 413|
|2023-10-30 22:20:05| 368|
+-------------------+-----+
We can use the following syntax to display the data type of each column in the DataFrame:
#check data type of each column
df.dtypes
[('ts', 'string'), ('sales', 'bigint')]
We can see that the ts column currently has a data type of string.
To convert this column from a string to a timestamp, we can use the following syntax:
from pyspark.sql import functions as F
#convert 'ts' column from string to timestamp
df = df.withColumn('ts_new', F.to_timestamp('ts', 'yyyy-MM-dd HH:mm:ss'))
#view updated DataFrame
df.show()
+-------------------+-----+-------------------+
| ts|sales| ts_new|
+-------------------+-----+-------------------+
|2023-01-15 04:14:22| 225|2023-01-15 04:14:22|
|2023-02-24 10:55:01| 260|2023-02-24 10:55:01|
|2023-07-14 18:34:59| 413|2023-07-14 18:34:59|
|2023-10-30 22:20:05| 368|2023-10-30 22:20:05|
+-------------------+-----+-------------------+
We can use the dtypes function once again to view the data types of each column in the DataFrame:
#check data type of each column
df.dtypes
[('ts', 'string'), ('sales', 'bigint'), ('ts_new', 'timestamp')]
We can see that the new column called ts_new has a data type of timestamp.
We have successfully converted a string column to a timestamp column.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
PySpark: How to Select Columns with Alias
PySpark: How to Select Columns by Index
PySpark: How to Select Multiple Columns