You can use the following methods to calculate summary statistics for columns in a PySpark DataFrame:
Method 1: Calculate Summary Statistics for All Columns
df.summary().show()
Method 2: Calculate Specific Summary Statistics for All Columns
df.summary('min', '25%', '50%', '75%', 'max').show()
Method 3: Calculate Summary Statistics for Only Numeric Columns
numeric_cols = [c for c, t in df.dtypes if t.startswith('string')==False] df.select(*numeric_cols).summary().show()
The following examples show how to use each method in practice with the following PySpark DataFrame that contains information about various basketball players:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
#define data
data = [['A', 'East', 11, 4],
['A', 'East', 8, 9],
['A', 'East', 10, 3],
['B', 'West', 6, 12],
['B', 'West', 6, 4],
['C', 'East', 5, 2]]
#define column names
columns = ['team', 'conference', 'points', 'assists']
#create dataframe using data and column names
df = spark.createDataFrame(data, columns)
#view dataframe
df.show()
+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
| A| East| 11| 4|
| A| East| 8| 9|
| A| East| 10| 3|
| B| West| 6| 12|
| B| West| 6| 4|
| C| East| 5| 2|
+----+----------+------+-------+
Example 1: Calculate Summary Statistics for All Columns
We can use the following syntax to calculate summary statistics for all columns in the DataFrame:
#calculate summary statistics for each column in DataFrame
df.summary().show()
+-------+----+----------+-----------------+------------------+
|summary|team|conference| points| assists|
+-------+----+----------+-----------------+------------------+
| count| 6| 6| 6| 6|
| mean|null| null|7.666666666666667| 5.666666666666667|
| stddev|null| null|2.422120283277993|3.9327683210007005|
| min| A| East| 5| 2|
| 25%|null| null| 6| 3|
| 50%|null| null| 6| 4|
| 75%|null| null| 10| 9|
| max| C| West| 11| 12|
+-------+----+----------+-----------------+------------------+
The output displays the following summary statistics for each column in the DataFrame:
- count: The number of values in the column
- mean: The mean value
- stddev: The standard deviation of values
- min: The minimum value
- 25%: The 25th percentile
- 50%:The 50th percentile (this is also the median)
- 75%: The 75th percentile
- max: The max value
Note that many of these values don’t make sense to interpret for string variables.
Example 2: Calculate Specific Summary Statistics for All Columns
We can use the following syntax to calculate specific summary statistics for all columns in the DataFrame:
#calculate specific summary statistics for each column in DataFrame
df.summary('min', '25%', '50%', '75%', 'max').show()
+-------+----+----------+------+-------+
|summary|team|conference|points|assists|
+-------+----+----------+------+-------+
| min| A| East| 5| 2|
| 25%|null| null| 6| 3|
| 50%|null| null| 6| 4|
| 75%|null| null| 10| 9|
| max| C| West| 11| 12|
+-------+----+----------+------+-------+
Example 3: Calculate Summary Statistics for Only Numeric Columns
We can use the following syntax to calculate summary statistics only for the numeric columns in the DataFrame:
#identify numeric columns in DataFrame numeric_cols = [c for c, t in df.dtypes if t.startswith('string')==False] #calculate summary statistics for only the numeric columns df.select(*numeric_cols).summary().show() +-------+-----------------+------------------+ |summary| points| assists| +-------+-----------------+------------------+ | count| 6| 6| | mean|7.666666666666667| 5.666666666666667| | stddev|2.422120283277993|3.9327683210007005| | min| 5| 2| | 25%| 6| 3| | 50%| 6| 4| | 75%| 10| 9| | max| 11| 12| +-------+-----------------+------------------+
Notice that summary statistics are displayed only for the two numeric columns in the DataFrame – the points and assists columns.
Note: You can find the complete documentation for the PySpark summary function here.
Additional Resources
The following tutorials explain how to perform other common tasks in PySpark:
How to Calculate the Mean of a Column in PySpark
How to Calculate Mean of Multiple Columns in PySpark
How to Calculate Sum by Group in PySpark