How to Select Only Numeric Columns in PySpark


You can use the following syntax to only select numeric columns in a PySpark DataFrame:

#find all numeric columns in DataFrame
numeric_cols = [c for c, t in df.dtypes if t.startswith('string')==False]

#select only numeric columns in DataFrame
df.select(*numeric_cols).show()

The following example shows how to use this syntax in practice.

Example: How to Select Only Numeric Columns in PySpark

Suppose we have the following PySpark DataFrame that contains five columns with information about various basketball players:

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'Guard', 11, 4, 22.4], 
        ['A', 'Guard', 8, 5, 34.1], 
        ['A', 'Forward', 22, 6, 35.1], 
        ['A', 'Forward', 22, 7, 18.7], 
        ['B', 'Guard', 14, 12, 20.2], 
        ['B', 'Guard', 14, 8, 15.6],
        ['B', 'Forward', 13, 9, 20.9],
        ['B', 'Center', 7, 9, 4.8]] 
  
#define column names
columns = ['team', 'position', 'points', 'assists', 'minutes'] 
  
#create dataframe using data and column names
df = spark.createDataFrame(data, columns) 
  
#view dataframe
df.show()

+----+--------+------+-------+-------+
|team|position|points|assists|minutes|
+----+--------+------+-------+-------+
|   A|   Guard|    11|      4|   22.4|
|   A|   Guard|     8|      5|   34.1|
|   A| Forward|    22|      6|   35.1|
|   A| Forward|    22|      7|   18.7|
|   B|   Guard|    14|     12|   20.2|
|   B|   Guard|    14|      8|   15.6|
|   B| Forward|    13|      9|   20.9|
|   B|  Center|     7|      9|    4.8|
+----+--------+------+-------+-------+

We can use the following syntax to display the data type of each column in the DataFrame:

#display data type of each column
df.dtypes

[('team', 'string'),
 ('position', 'string'),
 ('points', 'bigint'),
 ('assists', 'bigint'),
 ('minutes', 'double')]

We can then use the following syntax to get a list of all numeric columns in the DataFrame:

#find all numeric columns in DataFrame
numeric_cols = [c for c, t in df.dtypes if t.startswith('string')==False]

#view list of numeric columns
print(numeric_cols)

['points', 'assists', 'minutes']

We can then use the following syntax to only select the numeric columns in the DataFrame:

#select only numeric columns in DataFrame
df.select(*numeric_cols).show()

+------+-------+-------+
|points|assists|minutes|
+------+-------+-------+
|    11|      4|   22.4|
|     8|      5|   34.1|
|    22|      6|   35.1|
|    22|      7|   18.7|
|    14|     12|   20.2|
|    14|      8|   15.6|
|    13|      9|   20.9|
|     7|      9|    4.8|
+------+-------+-------+

The resulting DataFrame contains the points, assists and minutes columns, which are the three numeric columns in the DataFrame.

Note: We used the startswith function to find all of the columns in the DataFrame whose datatype did not start with ‘string’, which left us with only the numeric columns.

Additional Resources

The following tutorials explain how to perform other common tasks in PySpark:

PySpark: How to Select Rows Based on Column Values
PySpark: How to Select Rows by Index in DataFrame
PySpark: How to Select Columns by Index in DataFrame

Featured Posts

Leave a Reply

Your email address will not be published. Required fields are marked *