# How to Calculate Standard Deviation in PySpark

You can use the following methods to calculate the standard deviation of a column in a PySpark DataFrame:

Method 1: Calculate Standard Deviation for One Specific Column

```from pyspark.sql import functions as F

#calculate standard deviation of values in 'game1' column
df.agg(F.stddev('game1')).collect()[0][0]
```

Method 2: Calculate Standard Deviation for Multiple Columns

```from pyspark.sql.functions import stddev

#calculate standard deviation for game1, game2 and game3 columns
df.select(stddev(df.game1), stddev(df.game2), stddev(df.game3)).show()```

Note: The stddev function uses the sample standard deviation formula to calculate the standard deviation.

If you would instead like to use the population standard deviation formula, then use the stddev_pop function instead.

The following examples show how to use each method in practice with the following PySpark DataFrame:

```from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

#define data
data = [['Mavs', 25, 11, 10],
['Nets', 22, 8, 14],
['Hawks', 14, 22, 10],
['Kings', 30, 22, 35],
['Bulls', 15, 14, 12],
['Blazers', 10, 14, 18]]

#define column names
columns = ['team', 'game1', 'game2', 'game3']

#create dataframe using data and column names
df = spark.createDataFrame(data, columns)

#view dataframe
df.show()

+-------+-----+-----+-----+
|   team|game1|game2|game3|
+-------+-----+-----+-----+
|   Mavs|   25|   11|   10|
|   Nets|   22|    8|   14|
|  Hawks|   14|   22|   10|
|  Kings|   30|   22|   35|
|  Bulls|   15|   14|   12|
|Blazers|   10|   14|   18|
+-------+-----+-----+-----+```

## Example 1: Calculate Standard Deviation for One Specific Column

We can use the following syntax to calculate the standard deviation of values in the game1 column of the DataFrame only:

```from pyspark.sql import functions as F

#calculate standard deviation  of column named 'game1'
df.agg(F.stddev('game1')).collect()[0][0]

7.5806771905065755
```

The standard deviation of values in the game1 column turns out to be 7.5807.

## Example 2: Calculate Standard Deviation for Multiple Columns

We can use the following syntax to calculate the standard deviation of values for the game1, game2 and game3 columns of the DataFrame:

```from pyspark.sql.functions import stddev

#calculate standard deviation for game1, game2 and game3 columns
df.select(stddev(df.game1), stddev(df.game2), stddev(df.game3)).show()

+------------------+------------------+------------------+
|stddev_samp(game1)|stddev_samp(game2)|stddev_samp(game3)|
+------------------+------------------+------------------+
|7.5806771905065755| 5.741660619251774| 9.544631999192006|
+------------------+------------------+------------------+
```

From the output we can see:

• The standard deviation of values in the game1 column is 7.5807.
• The standard deviation of values in the game2 column is 5.7417.
• The standard deviation of values in the game3 column is 9.5446.

Note: If there are null values in the column, the stddev function will ignore these values by default.