You can use the **sample.split()** function from the **caTools** package in R to split a data frame into training and testing sets for model building.

This function uses the following basic syntax:

**sample.split(Y, SplitRatio, …)**

where:

**Y**: vector of outcomes**SplitRatio**: percentage of data to use in training set

The following example shows how to use this function in practice.

**Example: How to Use sample.split() in R**

Suppose we have some data frame in R with 1,000 rows that contains information about **hours** studied by students and their corresponding **score** on a final exam:

**#make this example reproducible
set.seed(0)
#create data frame
df <- data.frame(hours=runif(1000, min=0, max=10),
score=runif(1000, min=40, max=100))
#view head of data frame
head(df)
hours score
1 8.966972 55.93220
2 2.655087 71.84853
3 3.721239 81.09165
4 5.728534 62.99700
5 9.082078 97.29928
6 2.016819 47.10139**

Suppose we would like to fit a simple linear regression model that uses hours studied to predict final exam score.

Suppose we would like to train the model on 80% of the rows in the data frame and test it on the remaining 20% of rows.

The following code shows how to use the **sample.split()** function from the **caTools **package to split the data frame into training and testing sets:

**library(caTools)
#specify split
split = sample.split(df$score, SplitRatio=0.8)
#create training set
df_train = subset(df, split==TRUE)
#create test set
df_test = subset(df, split==FALSE)
#view number of rows in each set
nrow(df_train)
[1] 800
nrow(df_test)
[1] 200
**

We can see that our training dataset contains 800 rows, which represents 80% of the original dataset.

Similarly, we can see that our test dataset contains 200 rows, which represents 20% of the original dataset.

We can also view the first few rows of each set:

**#view head of training set
head(df_train)
hours score
1 8.966972 55.93220
5 9.082078 97.29928
6 2.016819 47.10139
7 8.983897 42.34600
8 9.446753 70.27030
9 6.607978 74.70895
#view head of testing set
head(df_test)
hours score
2 2.655087 71.84853
3 3.721239 81.09165
4 5.728534 62.99700
20 3.800352 47.95551
23 2.121425 89.17611
35 1.862176 98.07025
**

We can then proceed to train the regression model using the training set and assess its performance using the testing set.

**Additional Resources**

The following tutorials explain how to perform other common tasks in R:

How to Perform K-Fold Cross Validation in R

How to Perform Multiple Linear Regression in R

How to Perform Logistic Regression in R