# How to Easily Conduct a Kruskal-Wallis Test in R

Kruskal-Wallis test is used to determine whether or not there is a statistically significant difference between the medians of three or more independent groups.

This test is the nonparametric equivalent of the one-way ANOVA and is typically used when the normality assumption is violated.

The Kruskal-Wallis test does not assume normality in the data and is much less sensitive to outliers than the one-way ANOVA.

## How to Conduct a Kruskal-Wallis Test in R

The following example illustrates how to conduct a Kruskal-Wallis test in R.

### Background

A researcher wants to know whether or not three drugs have different effects on back pain, so he recruits 30 individuals who all experience similar back pain and randomly splits them up into three groups to receive either Drug A, Drug B, or Drug C.

After one month of taking the drug, the researcher asks each individual to rate their back pain on a scale of 1 to 100, with 100 indicating the most severe pain.

The researcher conducts a Kruskal-Wallis test using a .05 significance level to determine if there is a statistically significant difference between the median back pain ratings across these three groups.

The following code creates the data frame we’ll be working with:

```#make this example reproducible
set.seed(0)

#create data frame
data <- data.frame(drug = rep(c("A", "B", "C"), each = 10),
pain = c(runif(10, 40, 60),
runif(10, 45, 65),
runif(10, 55, 70)))

#view first six rows of data frame

#  drug     pain
#1    A 57.93394
#2    A 45.31017
#3    A 47.44248
#4    A 51.45707
#5    A 58.16416
#6    A 44.03364
```

The first column in the data frame shows the drug that the person took for one month and the second column shows the reported back pain after one month, on a scale from 0 to 100.

### Exploring the Data

Before we perform the Kruskal-Wallis test, we can gain a better understanding of the data by finding the mean and standard deviation of back pain for each drug using the dplyr package:

```#load dplyr package
library(dplyr)

#find mean and standard deviation of reported back pain for each drug group
data %>%
group_by(drug) %>%
summarise(mean = mean(pain),
sd = sd(pain))

# A tibble: 3 x 3
#  drug   mean    sd
#
#1 A      52.7  5.60
#2 B      54.7  5.99
#3 C      61.9  4.88
```

We can also create a boxplot for each of the three drugs to visualize the distribution of back pain for each group:

```#create boxplots
boxplot(pain ~ drug,
data = data,
main = "Reported Pain by Drug",
xlab = "Drug",
ylab = "Reported Pain",
col = "steelblue",
border = "black")``` Just from these boxplots we can see that the the mean reported pain is highest for the participants who used drug C.

We can also see that the standard deviation (the “length” of the boxplot) for reported pain is slightly higher among the participants who used drug A or drug B compared to those who used drug C.

Next, we’ll conduct the Kruskal-Wallis test to see if these visual differences are actually statistically significant.

### Conducting the Kruskal-Wallis Test

The general syntax to conduct a Kruskal-Wallis test in R is as follows:

kruskal.test(response variable ~ predictor variable, data = dataset)

In our example, we can use the following code to conduct the Kruskal-Wallist test, using pain as the response variable and drug as our predictor variable:

```kruskal.test(pain ~ drug, data = data)

#	Kruskal-Wallis rank sum test
#
#data:  pain by drug
#Kruskal-Wallis chi-squared = 11.105, df = 2, p-value = 0.003879
```

From the output we can see that the chi-squared test statistic is 11.105 and the corresponding p-value is 0.003879. Since this p-value is less than the .05 significance level, this means there is a statistically significant difference between the reported pain levels among the three drugs.

### Analyzing Group Differences

Once we have identified that there is a statistically significant difference between the reported pain levels for the three drugs, we can then conduct a post hoc test to determine exactly which treatment groups differ from one another.

For our post hoc test, we will use the function pairwise.wilcox.test() to calculate pairwise comparisons between the groups using the following syntax:

pairwise.wilcox.test(response variable ~ predictor variable, p.adjust.method)

The following code illustrates how to apply this function to our data:

```pairwise.wilcox.test(data\$pain, data\$drug, p.adjust.method = "BH")

#	Pairwise comparisons using Wilcoxon rank sum test
#
#data:  data\$pain and data\$drug
#
#  A      B
#B 0.3527 -
#C 0.0032 0.0220
#
#P value adjustment method: BH ```

The pairwise comparisons show that the difference between the reported pain levels for drug A and drug C is statistically significant (p-value = .0032) and the difference between the reported pain levels for drug B and drug C is statistically significant (p-value = .0220).

These results line up with what we saw from the boxplots previously. We saw that the reported pain levels for participants on drug C were noticeably higher compared to drug A and drug B, and that there was only a subtle difference between drug A and drug B.

### The Complete Code

You can find the complete code used in this analysis here:

```#make this example reproducible
set.seed(0)

#create data frame
data <- data.frame(drug = rep(c("A", "B", "C"), each = 10),
pain = c(runif(10, 40, 60),
runif(10, 45, 65),
runif(10, 55, 70)))

#view first six rows of data frame

library(dplyr)

#find mean and standard deviation of reported back pain for each drug group
data %>%
group_by(drug) %>%
summarise(mean = mean(pain),
sd = sd(pain))

#visualize data
boxplot(pain ~ drug,
data = data,
main = "Reported Pain by Drug",
xlab = "Drug",
ylab = "Reported Pain",
col = "steelblue",
border = "black")

#conduct Kruskal-Wallis test
kruskal.test(pain ~ drug, data = data)

#conduct post-hoc test for pairwise comparisons