# How to Calculate Summary Statistics in PySpark

You can use the following methods to calculate summary statistics for columns in a PySpark DataFrame:

Method 1: Calculate Summary Statistics for All Columns

```df.summary().show()
```

Method 2: Calculate Specific Summary Statistics for All Columns

```df.summary('min', '25%', '50%', '75%', 'max').show()
```

Method 3: Calculate Summary Statistics for Only Numeric Columns

```numeric_cols = [c for c, t in df.dtypes if t.startswith('string')==False]

df.select(*numeric_cols).summary().show()```

The following examples show how to use each method in practice with the following PySpark DataFrame that contains information about various basketball players:

```from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['A', 'East', 11, 4],
['A', 'East', 8, 9],
['A', 'East', 10, 3],
['B', 'West', 6, 12],
['B', 'West', 6, 4],
['C', 'East', 5, 2]]

#define column names
columns = ['team', 'conference', 'points', 'assists']

#create dataframe using data and column names
df = spark.createDataFrame(data, columns)

#view dataframe
df.show()

+----+----------+------+-------+
|team|conference|points|assists|
+----+----------+------+-------+
|   A|      East|    11|      4|
|   A|      East|     8|      9|
|   A|      East|    10|      3|
|   B|      West|     6|     12|
|   B|      West|     6|      4|
|   C|      East|     5|      2|
+----+----------+------+-------+```

## Example 1: Calculate Summary Statistics for All Columns

We can use the following syntax to calculate summary statistics for all columns in the DataFrame:

```#calculate summary statistics for each column in DataFrame
df.summary().show()

+-------+----+----------+-----------------+------------------+
|summary|team|conference|           points|           assists|
+-------+----+----------+-----------------+------------------+
|  count|   6|         6|                6|                 6|
|   mean|null|      null|7.666666666666667| 5.666666666666667|
| stddev|null|      null|2.422120283277993|3.9327683210007005|
|    min|   A|      East|                5|                 2|
|    25%|null|      null|                6|                 3|
|    50%|null|      null|                6|                 4|
|    75%|null|      null|               10|                 9|
|    max|   C|      West|               11|                12|
+-------+----+----------+-----------------+------------------+```

The output displays the following summary statistics for each column in the DataFrame:

• count: The number of values in the column
• mean: The mean value
• stddev: The standard deviation of values
• min: The minimum value
• 25%: The 25th percentile
• 50%:The 50th percentile (this is also the median)
• 75%: The 75th percentile
• max: The max value

Note that many of these values don’t make sense to interpret for string variables.

## Example 2: Calculate Specific Summary Statistics for All Columns

We can use the following syntax to calculate specific summary statistics for all columns in the DataFrame:

```#calculate specific summary statistics for each column in DataFrame
df.summary('min', '25%', '50%', '75%', 'max').show()

+-------+----+----------+------+-------+
|summary|team|conference|points|assists|
+-------+----+----------+------+-------+
|    min|   A|      East|     5|      2|
|    25%|null|      null|     6|      3|
|    50%|null|      null|     6|      4|
|    75%|null|      null|    10|      9|
|    max|   C|      West|    11|     12|
+-------+----+----------+------+-------+
```

## Example 3: Calculate Summary Statistics for Only Numeric Columns

We can use the following syntax to calculate summary statistics only for the numeric columns in the DataFrame:

```#identify numeric columns in DataFrame
numeric_cols = [c for c, t in df.dtypes if t.startswith('string')==False]

#calculate summary statistics for only the numeric columns
df.select(*numeric_cols).summary().show()

+-------+-----------------+------------------+
|summary|           points|           assists|
+-------+-----------------+------------------+
|  count|                6|                 6|
|   mean|7.666666666666667| 5.666666666666667|
| stddev|2.422120283277993|3.9327683210007005|
|    min|                5|                 2|
|    25%|                6|                 3|
|    50%|                6|                 4|
|    75%|               10|                 9|
|    max|               11|                12|
+-------+-----------------+------------------+
```

Notice that summary statistics are displayed only for the two numeric columns in the DataFrame – the points and assists columns.

Note: You can find the complete documentation for the PySpark summary function here.