A Complete Guide to the Built-in Datasets in R


The R programming language comes with several built-in datasets that are useful for practicing building models, summarizing datasets, and creating visualizations.

You can find a complete list of available built-in datasets by typing the following into your R console:

library(help='datasets')

There are over 50 built-in datasets but some of the most popular ones include:

  • iris: A dataset that contains measurements on 4 different attributes (in centimeters) for 50 flowers from 3 different species.
  • mtcars: A dataset in R that contains measurements on 11 different attributes for 32 different cars.
  • airquality: A dataset that contains air quality measurements in New York City from 1973 with 154 observations and 6 variables.
  • AirPassengers: A dataset that contains the number of monthly airline passengers from 1949 to 1960.

The following example explains how to gain a quick understanding of any of these datasets by using the iris dataset as an example.

Example: How to Analyze a Built-in Dataset in R

One of the easiest ways to gain a quick understanding of a built-in dataset is by using the head function, which allows you to view the first six rows of the dataset.

#view first six rows of iris dataset
head(iris)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

You can also use the summary function to quickly summarize each variable in the dataset:

#summarize iris dataset
summary(iris)

  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50  

For each of the numeric variables we can see the following information:

  • Min: The minimum value.
  • 1st Qu: The value of the first quartile (25th percentile).
  • Median: The median value.
  • Mean: The mean value.
  • 3rd Qu: The value of the third quartile (75th percentile).
  • Max: The maximum value.

For the only categorical variable in the dataset (Species) we see a frequency count of each value:

  • setosa: This species occurs 50 times.
  • versicolor: This species occurs 50 times.
  • virginica: This species occurs 50 times.

You can also use the dim function to get the dimensions of the dataset in terms of number of rows and number of columns:

#display rows and columns
dim(iris)

[1] 150   5

We can see that the dataset has 150 rows and 5 columns.

We can also create some plots to visualize the values in the dataset.

For example, we can use the hist() function to create a histogram of the values for a certain variable:

#create histogram of values for sepal length
hist(iris$Sepal.Length,
     col='steelblue',
     main='Histogram',
     xlab='Length',
     ylab='Frequency')

This histogram allows us to visualize the distribution of values for the Sepal.Length variable.

Feel free to use each of the functions shown here to explore any of the built-in datasets in R that you’d like.

Additional Resources

The following tutorials explain how to perform other common tasks in R:

How to Create Summary Tables in R
How to Calculate Five Number Summary in R
How to Calculate Descriptive Statistics in R

Leave a Reply

Your email address will not be published. Required fields are marked *