How to Use the relevel() Function in R


Linear regression is a method we can use to quantify the relationship between one or more predictor variables and a response variable.

When we use a categorical variable as a predictor variable in the model, the coefficients shown in the output of the model show the average difference in the response variable, relative to a specific level of the categorical variable.

By default, R will choose the level to be used as the baseline upon which all other levels are compared.

However, sometimes you may want to specify which level of the categorical variable should be used as the baseline.

You can use the relevel() function in R to do so, which uses the following basic syntax:

relevel(x, ref)

where:

  • x: An unordered factor
  • ref: The reference level, typically expressed as a string

The following example shows how to use the relevel() function with a linear regression model in practice in R.

Example: How to Use the relevel() Function in R

Suppose we have the following data frame in R that contains information on three variables for 12 different basketball players:

  • points scored
  • hours spent practicing
  • training program used
#create data frame
df <- data.frame(points=c(7, 7, 9, 10, 13, 14, 12, 10, 16, 19, 22, 18),
                 hours=c(1, 2, 2, 3, 2, 6, 4, 3, 4, 5, 8, 6),
                 program=c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3))

#view data frame
df

   points hours program
1       7     1       1
2       7     2       1
3       9     2       1
4      10     3       1
5      13     2       2
6      14     6       2
7      12     4       2
8      10     3       2
9      16     4       3
10     19     5       3
11     22     8       3
12     18     6       3

Suppose we would like to fit the following linear regression model:

points = β0 + β1hours + β2program

In this example, program is a categorical variable that can take on three possible categories: program 1, program 2, or program 3. 

We can use the following syntax to fit this regression model:

#convert 'program' to factor
df$program <- as.factor(df$program)

#fit linear regression model
fit <- lm(points ~ hours + program, data = df)

#view model summary
summary(fit)

Call:
lm(formula = points ~ hours + program, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.5192 -1.0064 -0.3590  0.8269  2.4551 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   6.3013     0.9462   6.660 0.000159 ***
hours         0.9744     0.3176   3.068 0.015401 *  
program2      2.2949     1.1369   2.019 0.078234 .  
program3      6.8462     1.5499   4.417 0.002235 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.403 on 8 degrees of freedom
Multiple R-squared:  0.9392,	Adjusted R-squared:  0.9164 
F-statistic: 41.21 on 3 and 8 DF,  p-value: 3.276e-05

The values in the Estimate column of the regression table show the coefficients for both program2 and program3, which means that program1 was used as the baseline level for the program variable.

However, suppose that we would like program3 to be the baseline level used.

We can use the following syntax with the relevel() function to set program3 as the baseline level and then fit the linear regression model one more time:

#convert 'program' to factor
df$program <- as.factor(df$program)

#specify that program3 should be used as baseline level
df$program <- relevel(df$program, ref='3')

#fit linear regression model
fit <- lm(points ~ hours + program, data = df)

#view model summary
summary(fit)

Call:
lm(formula = points ~ hours + program, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.5192 -1.0064 -0.3590  0.8269  2.4551 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  13.1474     1.9563   6.721  0.00015 ***
hours         0.9744     0.3176   3.068  0.01540 *  
program1     -6.8462     1.5499  -4.417  0.00223 ** 
program2     -4.5513     1.1777  -3.864  0.00478 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.403 on 8 degrees of freedom
Multiple R-squared:  0.9392,	Adjusted R-squared:  0.9164 
F-statistic: 41.21 on 3 and 8 DF,  p-value: 3.276e-05

Notice that the Estimate column of the regression table now shows the coefficients for program1 and program2, which means that program3 was used as the baseline level for the program variable.

Additional Resources

The following tutorials explain how to perform other common tasks in R:

How to Perform Simple Linear Regression in R
How to Perform Multiple Linear Regression in R
How to Create a Residual Plot in R

Featured Posts

Leave a Reply

Your email address will not be published. Required fields are marked *