This tutorial explains seven ways you can gain a better understanding of your dataset using descriptive statistics and various R functions. We will use the built-in R dataset **iris** throughout this tutorial.

**1. View the first few rows of the dataset**

The first step to gaining a better understanding of your dataset is to actually view the first few rows to see what the data looks like. One of the simplest ways to do so is by using the **head()** function.

#view first six rows ofirisdataset - R displays first six rows by default head(iris) # Sepal.Length Sepal.Width Petal.Length Petal.Width Species #1 5.1 3.5 1.4 0.2 setosa #2 4.9 3.0 1.4 0.2 setosa #3 4.7 3.2 1.3 0.2 setosa #4 4.6 3.1 1.5 0.2 setosa #5 5.0 3.6 1.4 0.2 setosa #6 5.4 3.9 1.7 0.4 setosa #alternatively, you can choose how many rows to display, e.g. first 2 rows head(iris, 2) # Sepal.Length Sepal.Width Petal.Length Petal.Width Species #1 5.1 3.5 1.4 0.2 setosa #2 4.9 3.0 1.4 0.2 setosa

**2. View the dimensions and structure of the dataset**

Once you’ve seen the first few rows of the dataset, two more useful commands you can use to get an overview of your data are **dim()** and **str()**:

**dim()** – tells you how many rows and columns are in the dataset

**str()** – tells you how many rows and columns are in the dataset along with the class of each column

#view dimensions of the dataset dim(iris) #[1] 150 5 #150 rows and 5 columns #view the structure of the dataset str(iris) # 'data.frame': 150 obs. of 5 variables: # $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... # $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... # $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... # $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... # $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

**3. View the numerical summary of each variable**

Once you know what the data looks like, the number of rows and columns in the dataset, and the class of each column, a common next step is to look at the summary of each column in the dataset using the **summary()** function, which gives you the following metrics for each column:

- Minimum
- 1st quartile
- Median
- Mean
- 3rd quartile
- Maximum

In addition, the summary() function tells you how many NAs (missing values) are present in each column. For “Factor” columns (like the *Species *column in the iris dataset), the summary() function simply gives you a count of the number of each factor.

#view summary of each column inirisdataset summary(iris) # Sepal.Length Sepal.Width Petal.Length Petal.Width Species # Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50 # 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50 # Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50 # Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 # 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 # Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 #view summary of each column inirisdataset, round numbers to one decimal place summary(iris, digits = 1) # Sepal.Length Sepal.Width Petal.Length Petal.Width Species #Min. :4 Min. :2 Min. :1 Min. :0.1 setosa :50 #1st Qu.:5 1st Qu.:3 1st Qu.:2 1st Qu.:0.3 versicolor:50 #Median :6 Median :3 Median :4 Median :1.3 virginica :50 #Mean :6 Mean :3 Mean :4 Mean :1.2 #3rd Qu.:6 3rd Qu.:3 3rd Qu.:5 3rd Qu.:1.8 #Max. :8 Max. :4 Max. :7 Max. :2.5

**4. View the standard deviation of each variable**

One metric that is missing from the summary() function that may be important to know is the** standard deviation** of each variable. Fortunately, we can easily find the standard deviation using the **sapply()** function and the **sd()** function:

#find standard deviation of first four columns inirisdataset sapply(iris[ ,1:4], sd) #Sepal.Length Sepal.Width Petal.Length Petal.Width # 0.8280661 0.4358663 1.7652982 0.7622377

**5. View the range and interquartile range of each variable**

Two more measures of spread that may be interesting to know are the **interquartile** **range** and **range** of each variable.

**Interquartile range** – the difference between the first quartile and the third quartile of a variable

**Range** – the difference between the largest and smallest value of a variable

Once again we can use the **sapply()** function along with the **IQR()** and **range()** functions to find these values.

#find the interquartile range of the first four columns sapply(iris[ ,1:4], IQR) #Sepal.Length Sepal.Width Petal.Length Petal.Width # 1.3 0.5 3.5 1.5 #find the range (min and max values) of the first four columns sapply(iris[ ,1:4], range) # Sepal.Length Sepal.Width Petal.Length Petal.Width #[1,] 4.3 2.0 1.0 0.1 #[2,] 7.9 4.4 6.9 2.5

**6. View the skewness of each variable**

**Skewness** is a measure of the asymmetry of a dataset or distribution. This value can be positive or negative. A negative skew typically indicates that the *tail* is on the left side of the distribution. A positive value typically indicates that the tail is on the right.

To find the skewness of the values in each column of a dataset, we can use the **skewness()** function in the **e1071** library:

#install (if not already installed) and load e1071 library if(!require(e1071)){install.packages('e1071')} #find skewness of first four columns ofirisdataset sapply(iris[ ,1:4], skewness) #Sepal.Length Sepal.Width Petal.Length Petal.Width # 0.3086407 0.3126147 -0.2694109 -0.1009166

**7. View the correlation between variables**

It can also be useful to know the correlation between the variables in a dataset to see what type of linear relationships (if any) exist between the variables.

One simple way to see all of the pairwise correlations between the variables in a dataset is by using the **cor()** function to generate a correlation matrix.

Positive numbers in the matrix indicate a positive correlation between two variables, that is, when one variable increases, the other tends to increase as well)

Negative numbers in the matrix indicate a negative correlation between two variables, that is, when one variable increases, the other tends to decrease.

Numbers close to zero in the matrix indicate little to no correlation between two variables, that is, there is no linear association between the variables.

#generate correlation matrix for first four columns ofirisdataset #and round values in matrix to one decimal place round(cor(iris[ , 1:4]), 1) # Sepal.Length Sepal.Width Petal.Length Petal.Width # Sepal.Length 1.0 -0.1 0.9 0.8 # Sepal.Width -0.1 1.0 -0.4 -0.4 # Petal.Length 0.9 -0.4 1.0 1.0 # Petal.Width 0.8 -0.4 1.0 1.0