# A Guide to Conducting Chi-Square Tests in R

This guide explains how to conduct the following three Chi-Square tests in R:

• Chi-Square Test for Independence
• Chi-Square Test for Goodness of Fit
• Chi-Square Test for Homogeneity

## Chi-Square Test for Independence

We use a chi-square test for independence when we want to test whether or not there is a significant association between two categorical variables. The test has the following hypotheses:

Null hypothesis (H0): The two variables are independent

Alternative hypothesis (HA): The two variables are not independent

The test statistic is X2 = Σ [ (Oi – Ei)2 / Ei ]

Where Σ is just a fancy symbol that means “sum”, Oi is the observed frequency at level i of the variable, and Ei is the expected frequency at level i of the variable.

If the p-value associated with the test statistic is less than our significance level (common choices are 0.10, 0.05, 0.01), then we can reject the null hypothesis and conclude that the two variables are not independent.

### Example

Suppose we want to know whether or not gender is associated with political party preference. We take a simple random sample of 500 voters and survey them on their political party preference. Using a 0.05 level of significance, we conduct a chi-square test for independence to determine if gender is associated with political party preference.

The code below illustrates how to conduct this test in R:

```#generate dataset
data <- data.frame(gender = rep(c('Male', 'Female'), each = 250),
party = c(rep('Republican', 120), rep('Democrat', 90),
rep('Independent', 40), rep('Republican', 110),
rep('Democrat', 95), rep('Independent', 45)))

#view frequency table
tbl <- table(data\$gender, data\$party)
tbl

#         Democrat Independent Republican
#  Female       95          45        110
#  Male         90          40        120

#conduct chi-square test for independence
chisq.test(tbl)

#	Pearson's Chi-squared test
#
#data:  tbl
#X-squared = 0.86404, df = 2, p-value = 0.6492
```

The test statistic X2 is 0.86404 and the corresponding p-value is 0.6492. Since the p-value is not less than the 0.05 significance level, we do not reject the null hypothesis. We do not have sufficient evidence to say that gender is associated with political party preference.

We can also create a balloon plot using the balloonplot() function from the gplots library to help us visualize the association between gender and political party preference. This function displays a matrix where each cell contains a dot whose size shows the magnitude of each combination of gender and political party preference:

```#load gplots library
library("gplots")

#create balloon plot
balloonplot(t(tbl), main ="Gender vs. Political Party Preference",
xlab ="", ylab="", label = FALSE, show.margins = FALSE)```

Notice that the size for the “Female” and “Male” dot is roughly the same size for each political party. This indicates that gender is likely not associated with political party preference, which confirms the result we got from the chi-square test for independence.

## Chi-Square Test for Goodness of Fit

We use a chi-square goodness of fit test when we want to test whether or not a categorical variable follows a hypothesized distribution.

Null hypothesis (H0): The variable does follow a hypothesized distribution

Alternative hypothesis (HA): The variable does not follow a hypothesized distribution

Once again, the test statistic is X2 = Σ [ (Oi – Ei)2 / Ei ]

If the p-value associated with the test statistic is less than our significance level, then we can reject the null hypothesis and conclude that the two variables are not independent.

### Example

An owner of a shop claims that 30% of all his weekend customers visit on Friday, 50% on Saturday, and 20% on Sunday. An independent researcher visits the shop on a random weekend and finds that 91 customers visit on Friday, 104 visit on Saturday, and 65 visit on Sunday. Using a 0.05 level of significance, we conduct a chi-square test for goodness of fit to determine if the data is consistent with the shop owner’s claim.

The code below illustrates how to conduct this test in R:

```#define vector of observed values
observed <- c(91, 104, 65)

#define vector of expected proportions
expected <- c(.3, .5, .2)

#perform chi-square test for goodness of fit
chisq.test(observed, p = expected)

#	Chi-squared test for given probabilities
#
#data:  observed
#X-squared = 10.617, df = 2, p-value = 0.00495
```

The test statistic X2 is 10.617 and the corresponding p-value is 0.00495. Since the p-value is less than the 0.05 significance level, we reject the null hypothesis. We have sufficient evidence to say that the true customer distribution does not match the distribution specified by the shop owner.

## Chi-Square Test for Homogeneity

We use a chi-square test for homogeneity when we want to formally test whether or not there is a difference in proportions between several groups.

Null hypothesis (H0): The proportion of “successes” in each group is the same

Alternative hypothesis (HA): The proportion of “successes” in each group is not the same

Once again, the test statistic is X2 = Σ [ (Oi – Ei)2 / Ei ]

If the p-value associated with the test statistic is less than our significance level, then we can reject the null hypothesis and conclude that not all of the groups have the same proportion of “successes.”

### Example

A basketball training facility wants to see if two new training programs improve the proportion of their players who pass a difficult shooting test. 172 players are randomly assigned to program 1, 173 to program 2, and 215 to the current program. After using the training programs for one month, the players then take a shooting test. The table below shows the number of players who pass the shooting test, based on which program they used.

Program 1 Program 2 Current Program Total
# Passed 112 94 130 336
# Failed 60 79 85 224
Total 172 173 215 560

Using a 0.05 level of significance, we conduct a chi-square test for homogeneity to determine if the proportion of players who pass the shooting test is the same for each group.

The code below illustrates how to conduct this test in R:

```#create data frame
data <- as.table(rbind(c(112, 94, 130), c(60, 79, 85)))
dimnames(data) <- list(Outcome = c("Passed", "Failed"),
Day = c("Program 1", "Program 2", "Current Program"))
#view data
data

#        Day
#Outcome  Program 1 Program 2 Current Program
#  Passed       112        94             130
#  Failed        60        79              85

#conduct chi-square test for homogeneity
chisq.test(data)

#	Pearson's Chi-squared test
#
#data:  data
#X-squared = 4.2085, df = 2, p-value = 0.1219
```

The test statistic X2 is 4.2085 and the corresponding p-value is 0.1219. Since the p-value is not less than the 0.05 significance level, we do not reject the null hypothesis. We do not have sufficient evidence to say that the proportion of players who pass the shooting test differs between the programs.

Once again we can create a balloon plot using the balloonplot() function from the gplots library to help us visualize the association between the outcome of the shooting test and the program:

```#load gplots library
library("gplots")

#create balloon plot
balloonplot(t(data), main ="Program vs. Outcome", xlab ="", ylab="",
label = FALSE, show.margins = FALSE)```

We can see that the amount of players who passed in each program varied a bit (especially in Program 1), but the variations were not statistically significant enough to say that the programs produced different passing rates.