How to Explore a Dataset in R Using Descriptive Statistics

How to explore a dataset in R using descriptive statistics

This tutorial explains seven ways you can gain a better understanding of your dataset using descriptive statistics and various R functions. We will use the built-in R dataset iris throughout this tutorial.

1. View the first few rows of the dataset

The first step to gaining a better understanding of your dataset is to actually view the first few rows to see what the data looks like. One of the simplest ways to do so is by using the head() function.

#view first six rows of iris dataset - R displays first six rows by default
head(iris)

# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1         5.1         3.5          1.4         0.2  setosa
#2         4.9         3.0          1.4         0.2  setosa
#3         4.7         3.2          1.3         0.2  setosa
#4         4.6         3.1          1.5         0.2  setosa
#5         5.0         3.6          1.4         0.2  setosa
#6         5.4         3.9          1.7         0.4  setosa

#alternatively, you can choose how many rows to display, e.g. first 2 rows
head(iris, 2)

# Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#1         5.1         3.5          1.4         0.2  setosa
#2         4.9         3.0          1.4         0.2  setosa

2. View the dimensions and structure of the dataset

Once you’ve seen the first few rows of the dataset, two more useful commands you can use to get an overview of your data are dim() and str():

dim() – tells you how many rows and columns are in the dataset

str() – tells you how many rows and columns are in the dataset along with the class of each column 

#view dimensions of the dataset
dim(iris)

#[1] 150 5  #150 rows and 5 columns

#view the structure of the dataset
str(iris)

# 'data.frame': 150 obs. of 5 variables:
# $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
# $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
# $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
# $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
# $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

3. View the numerical summary of each variable

Once you know what the data looks like, the number of rows and columns in the dataset, and the class of each column, a common next step is to look at the summary of each column in the dataset using the summary() function, which gives you the following metrics for each column:

  • Minimum
  • 1st quartile
  • Median
  • Mean
  • 3rd quartile
  • Maximum

In addition, the summary() function tells you how many NAs (missing values) are present in each column. For “Factor” columns (like the Species column in the iris dataset), the summary() function simply gives you a count of the number of each factor.

#view summary of each column in iris dataset
summary(iris)

#  Sepal.Length   Sepal.Width  Petal.Length   Petal.Width       Species 
# Min.   :4.300 Min.   :2.000 Min.   :1.000 Min.   :0.100 setosa    :50
# 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
# Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
# Mean   :5.843 Mean   :3.057 Mean   :3.758 Mean   :1.199 
# 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 
# Max.   :7.900 Max.   :4.400 Max.   :6.900 Max.   :2.500 

#view summary of each column in iris dataset, round numbers to one decimal place
summary(iris, digits = 1)

# Sepal.Length Sepal.Width Petal.Length Petal.Width  Species 
#Min. :4       Min. :2     Min. :1      Min. :0.1    setosa :50 
#1st Qu.:5     1st Qu.:3   1st Qu.:2    1st Qu.:0.3  versicolor:50 
#Median :6     Median :3   Median :4    Median :1.3  virginica :50 
#Mean :6       Mean :3     Mean :4      Mean :1.2 
#3rd Qu.:6     3rd Qu.:3   3rd Qu.:5    3rd Qu.:1.8 
#Max. :8       Max. :4     Max. :7      Max. :2.5 

4. View the standard deviation of each variable

One metric that is missing from the summary() function that may be important to know is the standard deviation of each variable. Fortunately, we can easily find the standard deviation using the sapply() function and the sd() function:

#find standard deviation of first four columns in iris dataset
sapply(iris[ ,1:4], sd)

#Sepal.Length Sepal.Width Petal.Length Petal.Width 
#   0.8280661   0.4358663    1.7652982   0.7622377 

5. View the range and interquartile range of each variable

Two more measures of spread that may be interesting to know are the interquartile range and range of each variable.

Interquartile range – the difference between the first quartile and the third quartile of a variable

Range – the difference between the largest and smallest value of a variable

Once again we can use the sapply() function along with the IQR() and range() functions to find these values.

#find the interquartile range of the first four columns
sapply(iris[ ,1:4], IQR)

#Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
#         1.3          0.5          3.5          1.5 

#find the range (min and max values) of the first four columns
sapply(iris[ ,1:4], range)

#     Sepal.Length Sepal.Width Petal.Length Petal.Width
#[1,]          4.3         2.0          1.0         0.1
#[2,]          7.9         4.4          6.9         2.5

6. View the skewness of each variable

Skewness is a measure of the asymmetry of a dataset or distribution. This value can be positive or negative. A negative skew typically indicates that the tail is on the left side of the distribution. A positive value typically indicates that the tail is on the right.

To find the skewness of the values in each column of a dataset, we can use the skewness() function in the e1071 library:

#install (if not already installed) and load e1071 library
if(!require(e1071)){install.packages('e1071')}

#find skewness of first four columns of iris dataset
sapply(iris[ ,1:4], skewness) 

#Sepal.Length Sepal.Width Petal.Length Petal.Width 
#   0.3086407   0.3126147   -0.2694109  -0.1009166 

7. View the correlation between variables

It can also be useful to know the correlation between the variables in a dataset to see what type of linear relationships (if any) exist between the variables.

One simple way to see all of the pairwise correlations between the variables in a dataset is by using the cor() function to generate a correlation matrix.

Positive numbers in the matrix indicate a positive correlation between two variables, that is, when one variable increases, the other tends to increase as well)

Negative numbers in the matrix indicate a negative correlation between two variables, that is, when one variable increases, the other tends to decrease.

Numbers close to zero in the matrix indicate little to no correlation between two variables, that is, there is no linear association between the variables.

#generate correlation matrix for first four columns of iris dataset
#and round values in matrix to one decimal place
round(cor(iris[ , 1:4]), 1)

#              Sepal.Length Sepal.Width Petal.Length Petal.Width
# Sepal.Length          1.0        -0.1          0.9         0.8
# Sepal.Width          -0.1         1.0         -0.4        -0.4
# Petal.Length          0.9        -0.4          1.0         1.0
# Petal.Width           0.8        -0.4          1.0         1.0

Leave a Reply

Your email address will not be published. Required fields are marked *