How to Perform a Chi-Square Goodness of Fit Test in Stata


Chi-Square Goodness of Fit Test is used to determine whether or not a categorical variable follows a hypothesized distribution.

This tutorial explains how to perform a Chi-Square Goodness of Fit Test in Stata.

Example: Chi-Square Goodness of Fit Test in Stata

To illustrate how to perform this test, we will use a dataset called nlsw88, which contains information about labor statistics for women in the U.S. in 1988.

Use the following steps to perform a Chi-Square Goodness of Fit test to determine if the true distribution of race in this dataset is as follows: 70% White, 20% Black, 10% Other.

Step 1: Load and view the raw data.

First, we will load the data by typing in the following command:

sysuse nlsw88

We can view the raw data by typing in the following command:

br

View raw data in Stata

Each line displays information for an individual including their age, race, marital status, education level, and a variety of other factors.

Step 2: Load the goodness of fit package.

To perform a Goodness of Fit Test, we will need to install the csgof package. We can do so by typing in the following command:

findit csgof

A new window will pop up. Click the link that says csgof from https://stats.idre.ucla.edu/stat/stata/ado/analysis.

Another window will pop up. Click the link that says click here to install

The package should only take a few seconds to install.

Step 3: Perform the Goodness-of-Fit Test.

Once the package is installed, we can perform the Goodness of Fit Test on the data to determine if the true distribution of race is as follows: 70% White, 20% Black, 10% Other.

We will use the following syntax to perform the test:

csgof variable_of_interest, expperc(list_of_expected_percentages)

Here is the exact syntax we’ll use in our case:

csgof race, expperc(70, 20, 10)

Chi-Square goodness of fit output in Stata

Here is how to interpret the output:

Summary box: This box shows us the expected percent, expected frequency, and observed frequency for each race. For example:

  • The expected percent of white individuals was 70%. This is the percentage that we specified. 
  • The expected frequency of white individuals was 1,572.2. This is calculated using the fact that there were 2,246 individuals in the dataset, so 70% of that number is 1,572.2.
  • The observed frequency of white individuals was 1,637. This is the actual number of white individuals in the dataset.

Chisq(2): This is the Chi-Square test statistic for the Goodness of Fit Test. It turns out to be 218.13.

p: This is the p-value associated with the Chi-Square test statistic. It turns out to be 0. Since this is less than 0.05, we fail to reject the null hypothesis that the true distribution of race is 70% White, 20% Black, 10% Other. We have sufficient evidence to conclude that the true distribution of race is different from this hypothesized distribution.

3 Replies to “How to Perform a Chi-Square Goodness of Fit Test in Stata”

  1. Hi
    Thank for sharing this great peace of knowledge, I want make small correction about null hypothesis where you say we fail to reject null hypothesis, in fact we reject null hypothesis since p-value is less than 0.05, which is other way round of what you have said.

    1. Hi Ken…In Stata, you can perform a deviance goodness-of-fit test for a logistic regression model using the `lrtest` command. Here’s how you can do it:

      1. **Fit your logistic regression model:**
      First, you need to estimate your logistic regression model using the `logit` command. For example:
      “`
      logit outcome_var predictor_var1 predictor_var2 …
      “`

      2. **Obtain the fitted model:**
      After estimating the model, you can use the `predict` command to obtain the fitted values. Fitted values represent the predicted probabilities from your logistic regression model.
      “`
      predict fitted, xb
      “`

      3. **Perform the deviance goodness-of-fit test:**
      Use the `lrtest` command to perform the deviance goodness-of-fit test. This command compares the fit of your model to a saturated model (a model with a parameter for each observation), providing a chi-squared test statistic.
      “`
      lrtest
      “`
      This command will prompt you to specify the null hypothesis. For the deviance goodness-of-fit test, the null hypothesis is that the fitted model provides a good fit to the data.

      4. **Interpret the results:**
      After executing the `lrtest` command, Stata will provide the chi-squared test statistic and its associated p-value. If the p-value is small (typically less than 0.05), you may reject the null hypothesis and conclude that the fitted model does not provide a good fit to the data, indicating potential model misspecification.

      Here’s an example combining all the steps:

      “`stata
      // Step 1: Fit logistic regression model
      logit outcome_var predictor_var1 predictor_var2 …

      // Step 2: Obtain fitted values
      predict fitted, xb

      // Step 3: Perform deviance goodness-of-fit test
      lrtest
      “`

      Replace `outcome_var` with the name of your outcome variable and `predictor_var1`, `predictor_var2`, etc., with the names of your predictor variables. Make sure to include all relevant variables in your model.

Leave a Reply

Your email address will not be published. Required fields are marked *