How to Create Violin Plots in R


A violin plot is a type of plot that shows the distribution of numeric values in a dataset.

Known for being a combination between a box plot and a kernel density plot, a violin plot is particularly useful for visualizing the shape of a distribution that offers more detail than a box plot but a more concise summary of a distribution than a kernel density plot.

The easiest way to create a violin plot in R is by using the geom_violin() function from the ggplot2 package.

The following example shows how to use this function in practice.

Example: How to Create a Violin Plot in R

Suppose that we collect data for the points scored by basketball players on three different teams.

Suppose that we would like to create violin plots to visualize the distribution of points scored by players on each team.

We can use the geom_violin() function from the ggplot2 package with the following syntax to do so:

#load ggplot2 package
library(ggplot2)

#create scatterplot
data <- data.frame(team=c(rep('A',200), rep('B',200), rep('C',200)),
                   points=c(rnorm(200, 10, 3), rnorm(200, 22, 6), rnorm(200, 13, 2)))

#create violin plot of points by team
ggplot(data, aes(x=team, y=points, fill=team)) +
  geom_violin()

The following screenshot shows the plot produced by this code:

violin plot in R created using ggplot2 package

Here is how to interpret the chart:

  • The x-axis displays the team names.
  • The y-axis displays the points scored by each team.
  • The legend on the right side of the plot displays the color that corresponds to each team.

Simply by looking at this chart we can gain a strong understanding of the distribution of points scored by players on each team along with how the distributions compare to each other.

For example, we can see that Team A had the lowest points scored, on average.

Conversely, we can see that Team B had the highest points scored on average, but also the greatest variation in points scored.

The length of each violin plot indicates the variance in numerical values for each distribution.

For example, we can easily see that the variation in points scored is greatest among players on Team B because their point values on the y-axis range from less than 10 to nearly 40.

It’s worth noting that in this example we used the rnorm function to generate random values from a normal distribution for each of the three teams.

This function uses the following syntax:

rnorm(n, mean, sd)

where:

  • n: Number of values to generate in normal distribution
  • mean: Mean of distribution
  • sd: Standard deviation of distribution

We can see in the original code that we specified that Team B should have a standard deviation of 6, which was the highest among all three teams.

This explains why the length of the violin plot for Team B was the greatest in the plot: the standard deviation of points scored for this team was the greatest.

Note: You can find the complete documentation for the geom_violin() function in ggplot2 here.

Additional Resources

The following tutorials explain how to perform other common tasks in R:

How to Label Points on a Scatterplot in R
How to Add Text Outside of a Plot in R
How to Create a Scatterplot with a Regression Line in R

Leave a Reply

Your email address will not be published. Required fields are marked *