This tutorial explains when and how to use the jitter function in R for scatterplots.
When to Use Jitter
Scatterplots are excellent for visualizing the relationship between two continuous variables. For example, the following scatterplot helps us visualize the relationship between height and weight for 100 athletes:
#define vectors of heights and weights weights <- runif(100, 160, 240) #100 numbers randomly distributed between 160 and 240 heights <- (weights/3) + rnorm(100) #each weight divided by 3, plus some random noise #create data frame of heights and weights data <- as.data.frame(cbind(weights, heights)) #view first six rows of data frame head(data) # weights heights #1 170.8859 57.20745 #2 183.2481 62.01162 #3 235.6884 77.93126 #4 231.9864 77.12520 #5 200.8562 67.93486 #6 169.6987 57.54977 #create scatterplot of heights vs weights plot(data$weights, data$heights, pch = 16, col = 'steelblue')
However, on some occasions we may want to visualize the relationship between one continuous variable and another variable that is almost continuous.
For example, suppose we have the following dataset that shows the number of games a basketball player has started out of the first 10 games in a season as well as their average points per game:
#create data frame games_started <- sample(1:10, 300, TRUE) points_per_game <- 3*games_started + rnorm(300) data <- as.data.frame(cbind(games_started, points_per_game)) #view first six rows of data frame head(data) # games_started points_per_game #1 9 25.831554 #2 9 26.673983 #3 10 29.850948 #4 4 12.024353 #5 4 11.534192 #6 1 4.383127
Points per game is a continuous variable, but games started is a discrete variable. If we attempt to create a scatterplot of these two variables, here is what it would look like:
#create scatterplot of games started vs average points per game plot(data$games_started, data$points_per_game, pch = 16, col = 'steelblue')
From this scatterplot, we can tell that games started and average points per game has a positive relationship, but it’s a bit hard to see the individual points in the plot because so many of them overlap with each other.
By using the jitter function, we can add a bit of “noise” to the x-axis variable games started so that we can see the individual points on the plot more clearly:
#add jitter to games started plot(jitter(data$games_started), data$points_per_game, pch = 16, col = 'steelblue')
We can optionally add a numeric argument to jitter to add even more noise to the data:
#add jitter to games started plot(jitter(data$games_started, 2), data$points_per_game, pch = 16, col = 'steelblue')
We should be careful not to add too much jitter, though, as this can distort the original data too much:
plot(jitter(data$games_started, 20), data$points_per_game, pch = 16, col = 'steelblue')
Jittering Provides a Better View of the Data
Jittering is particularly useful when one of the levels of the discrete variable has far more values than the other levels.
For example, in the following dataset there are three hundred basketball players who started 2 out of the first 5 games in the season, but just one hundred players who started 1, 3, 4, or 5 games:
games_started <- sample(1:5, 100, TRUE) points_per_game <- 3*games_started + rnorm(100) data <- as.data.frame(cbind(games_started, points_per_game)) games_twos <- rep(2, 200) points_twos <- 3*games_twos + rnorm(200) data_twos <- as.data.frame(cbind(games_twos, points_twos)) names(data_twos) <- c('games_started', 'points_per_game') all_data <- rbind(data, data_twos)
When we visualize the number of games played vs average points per game, we can tell that there are more players who have played 2 games, but it’s hard to tell exactly how many more have played 2 games:
plot(all_data$games_started, all_data$points_per_game, pch = 16, col = 'steelblue')
Once we add jitter to the games started variable, though, we can see just how many more players there are who have started 2 games:
plot(jitter(all_data$games_started), all_data$points_per_game, pch = 16, col = 'steelblue')
Increasing the amount of jitter by a little bit reveals this difference even more:
plot(jitter(all_data$games_started, 1.5), all_data$points_per_game, pch = 16, col = 'steelblue')
Jittering for Visualizations Only
As mentioned before, jittering adds some random noise to data, which can be beneficial when we want to visualize data in a scatterplot. By using the jitter function, we can get a better picture of the true underlying relationship between two variables in a dataset.
However, when using a statistical analysis like regression, it doesn’t make sense to add random noise to variables in a dataset since this would impact the results of an analysis. Thus, jitter is only meant to be used for data visualization, not for data analysis.