Often when we fit machine learning algorithms to datasets, we first split the dataset into a training set and a test set.
There are three common ways to split data into training and test sets in R:
Method 1: Use Base R
#make this example reproducible set.seed(1) #use 70% of dataset as training set and 30% as test set sample <- sample(c(TRUE, FALSE), nrow(df), replace=TRUE, prob=c(0.7,0.3)) train <- df[sample, ] test <- df[!sample, ]
Method 2: Use caTools package
library(caTools) #make this example reproducible set.seed(1) #use 70% of dataset as training set and 30% as test set sample <- sample.split(df$any_column_name, SplitRatio = 0.7) train <- subset(df, sample == TRUE) test <- subset(df, sample == FALSE)
Method 3: Use dplyr package
library(dplyr) #make this example reproducible set.seed(1) #create ID column df$id <- 1:nrow(df) #use 70% of dataset as training set and 30% as test set train <- df %>% dplyr::sample_frac(0.70) test <- dplyr::anti_join(df, train, by = 'id')
The following examples show how to use each method in practice with the built-in iris dataset in R.
Example 1: Split Data Into Training & Test Set Using Base R
The following code shows how to use base R to split the iris dataset into a training and test set, using 70% of the rows as the training set and the remaining 30% as the test set:
#load iris dataset data(iris) #make this example reproducible set.seed(1) #Use 70% of dataset as training set and remaining 30% as testing set sample <- sample(c(TRUE, FALSE), nrow(iris), replace=TRUE, prob=c(0.7,0.3)) train <- iris[sample, ] test <- iris[!sample, ] #view dimensions of training set dim(train) [1] 106 5 #view dimensions of test set dim(test) [1] 44 5
From the output we can see:
- The training set is a data frame with 106 rows and 5 columns.
- The test is a data frame with 44 rows and 5 columns.
Since the original data frame had 150 total rows, the training set contains roughly 106 / 150 = 70.6% of the original rows.
We can also view the first few rows of the training set if we’d like:
#view first few rows of training set
head(train)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
Example 2: Split Data Into Training & Test Set Using caTools
The following code shows how to use the caTools package in R to split the iris dataset into a training and test set, using 70% of the rows as the training set and the remaining 30% as the test set:
library(caTools) #load iris dataset data(iris) #make this example reproducible set.seed(1) #Use 70% of dataset as training set and remaining 30% as testing set sample <- sample.split(iris$Species, SplitRatio = 0.7) train <- subset(iris, sample == TRUE) test <- subset(iris, sample == FALSE) #view dimensions of training set dim(train) [1] 105 5 #view dimensions of test set dim(test) [1] 45 5
From the output we can see:
- The training set is a data frame with 105 rows and 5 columns.
- The test is a data frame with 45 rows and 5 columns.
Example 3: Split Data Into Training & Test Set Using dplyr
The following code shows how to use the caTools package in R to split the iris dataset into a training and test set, using 70% of the rows as the training set and the remaining 30% as the test set:
library(dplyr) #load iris dataset data(iris) #make this example reproducible set.seed(1) #create ID variable iris$id <- 1:nrow(iris) #Use 70% of dataset as training set and remaining 30% as testing set train <- iris %>% dplyr::sample_frac(0.7) test <- dplyr::anti_join(iris, train, by = 'id') #view dimensions of training set dim(train) [1] 105 6 #view dimensions of test set dim(test) [1] 45 6
From the output we can see:
- The training set is a data frame with 105 rows and 6 columns.
- The test is a data frame with 45 rows and 6 columns.
Note that these training and test sets contain one extra ‘id’ column that we created.
Be sure not to use this column (or drop it entirely from the data frames) when fitting your machine learning algorithm.
Additional Resources
The following tutorials explain how to perform other common operations in R:
How to Calculate MSE in R
How to Calculate RMSE in R
How to Calculate Adjusted R-Squared in R