# A Complete Guide to the Built-in Datasets in R

The R programming language comes with several built-in datasets that are useful for practicing building models, summarizing datasets, and creating visualizations.

You can find a complete list of available built-in datasets by typing the following into your R console:

`library(help='datasets')`

There are over 50 built-in datasets but some of the most popular ones include:

• iris: A dataset that contains measurements on 4 different attributes (in centimeters) for 50 flowers from 3 different species.
• mtcars: A dataset in R that contains measurements on 11 different attributes for 32 different cars.
• airquality: A dataset that contains air quality measurements in New York City from 1973 with 154 observations and 6 variables.
• AirPassengers: A dataset that contains the number of monthly airline passengers from 1949 to 1960.

The following example explains how to gain a quick understanding of any of these datasets by using the iris dataset as an example.

## Example: How to Analyze a Built-in Dataset in R

One of the easiest ways to gain a quick understanding of a built-in dataset is by using the head function, which allows you to view the first six rows of the dataset.

```#view first six rows of iris dataset

Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa
```

You can also use the summary function to quickly summarize each variable in the dataset:

```#summarize iris dataset
summary(iris)

Sepal.Length    Sepal.Width     Petal.Length    Petal.Width
Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100
1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300
Median :5.800   Median :3.000   Median :4.350   Median :1.300
Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199
3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800
Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500
Species
setosa    :50
versicolor:50
virginica :50  ```

For each of the numeric variables we can see the following information:

• Min: The minimum value.
• 1st Qu: The value of the first quartile (25th percentile).
• Median: The median value.
• Mean: The mean value.
• 3rd Qu: The value of the third quartile (75th percentile).
• Max: The maximum value.

For the only categorical variable in the dataset (Species) we see a frequency count of each value:

• setosa: This species occurs 50 times.
• versicolor: This species occurs 50 times.
• virginica: This species occurs 50 times.

You can also use the dim function to get the dimensions of the dataset in terms of number of rows and number of columns:

```#display rows and columns
dim(iris)

[1] 150   5
```

We can see that the dataset has 150 rows and 5 columns.

We can also create some plots to visualize the values in the dataset.

For example, we can use the hist() function to create a histogram of the values for a certain variable:

```#create histogram of values for sepal length
hist(iris\$Sepal.Length,
col='steelblue',
main='Histogram',
xlab='Length',
ylab='Frequency')
```

This histogram allows us to visualize the distribution of values for the Sepal.Length variable.

Feel free to use each of the functions shown here to explore any of the built-in datasets in R that you’d like.