Many statistical tests make the assumption that datasets are normally distributed.

There are four common ways to check this assumption in R:

**1. (Visual Method) Create a histogram.**

- If the histogram is roughly “bell-shaped”, then the data is assumed to be normally distributed.

**2. (Visual Method) Create a Q-Q plot.**

- If the points in the plot roughly fall along a straight diagonal line, then the data is assumed to be normally distributed.

**3. (Formal Statistical Test) Perform a Shapiro-Wilk Test.**

- If the p-value of the test is greater than α = .05, then the data is assumed to be normally distributed.

**4. (Formal Statistical Test) Perform a Kolmogorov-Smirnov Test.**

- If the p-value of the test is greater than α = .05, then the data is assumed to be normally distributed.

The following examples show how to use each of these methods in practice.

**Method 1: Create a Histogram**

The following code shows how to create a histogram for a normally distributed and non-normally distributed dataset in R:

#make this example reproducible set.seed(0) #create data that follows a normal distribution normal_data <- rnorm(200) #create data that follows an exponential distribution non_normal_data <- rexp(200, rate=3) #define plotting region par(mfrow=c(1,2)) #create histogram for both datasets hist(normal_data, col='steelblue', main='Normal') hist(non_normal_data, col='steelblue', main='Non-normal')

The histogram on the left exhibits a dataset that is normally distributed (roughly a “bell-shape”) and the one on the right exhibits a dataset that is not normally distributed.

**Method 2: Create a Q-Q plot**

The following code shows how to create a Q-Q plot for a normally distributed and non-normally distributed dataset in R:

#make this example reproducible set.seed(0) #create data that follows a normal distribution normal_data <- rnorm(200) #create data that follows an exponential distribution non_normal_data <- rexp(200, rate=3) #define plotting region par(mfrow=c(1,2)) #create Q-Q plot for both datasets qqnorm(normal_data, main='Normal') qqline(normal_data) qqnorm(non_normal_data, main='Non-normal') qqline(non_normal_data)

The Q-Q plot on the left exhibits a dataset that is normally distributed (the points fall along a straight diagonal line) and the Q-Q plot on the right exhibits a dataset that is not normally distributed.

**Method 3: Perform a Shapiro-Wilk Test**

The following code shows how to perform a Shapiro-Wilk test on a normally distributed and non-normally distributed dataset in R:

#make this example reproducible set.seed(0) #create data that follows a normal distribution normal_data <- rnorm(200) #perform shapiro-wilk test shapiro.test(normal_data) Shapiro-Wilk normality test data: normal_data W = 0.99248, p-value = 0.3952 #create data that follows an exponential distribution non_normal_data <- rexp(200, rate=3) #perform shapiro-wilk test shapiro.test(non_normal_data) Shapiro-Wilk normality test data: non_normal_data W = 0.84153, p-value = 1.698e-13

The p-value of the first test is not less than .05, which indicates that the data is normally distributed.

The p-value of the second test *is* less than .05, which indicates that the data is not normally distributed.

**Method 4: Perform a Kolmogorov-Smirnov Test**

The following code shows how to perform a Kolmogorov-Smirnov test on a normally distributed and non-normally distributed dataset in R:

#make this example reproducible set.seed(0) #create data that follows a normal distribution normal_data <- rnorm(200) #perform kolmogorov-smirnov test ks.test(normal_data, 'pnorm') One-sample Kolmogorov-Smirnov test data: normal_data D = 0.073535, p-value = 0.2296 alternative hypothesis: two-sided #create data that follows an exponential distribution non_normal_data <- rexp(200, rate=3) #perform kolmogorov-smirnov test ks.test(non_normal_data, 'pnorm') One-sample Kolmogorov-Smirnov test data: non_normal_data D = 0.50115, p-value < 2.2e-16 alternative hypothesis: two-sided

The p-value of the first test is not less than .05, which indicates that the data is normally distributed.

The p-value of the second test *is* less than .05, which indicates that the data is not normally distributed.

**How to Handle Non-Normal Data**

If a given dataset is *not* normally distributed, we can often perform one of the following transformations to make it more normally distributed:

**1. Log Transformation: **Transform the values from x to **log(x)**.

**2. Square Root Transformation: **Transform the values from x to **√x**.

**3. Cube Root Transformation: **Transform the values from x to **x ^{1/3}**.

By performing these transformations, the dataset typically becomes more normally distributed.

Read this tutorial to see how to perform these transformations in R.

**Additional Resources**

How to Create Histograms in R

How to Create & Interpret a Q-Q Plot in R

How to Perform a Shapiro-Wilk Test in R

How to Perform a Kolmogorov-Smirnov Test in R