How to Create a Train and Test Set from a Pandas DataFrame


When fitting machine learning models to datasets, we often split the dataset into two sets:

1. Training Set: Used to train the model (70-80% of original dataset)

2. Testing Set: Used to get an unbiased estimate of the model performance (20-30% of original dataset)

In Python, there are two common ways to split a pandas DataFrame into a training set and testing set:

Method 1: Use train_test_split() from sklearn

from sklearn.model_selection import train_test_split

train, test = train_test_split(df, test_size=0.2, random_state=0)

Method 2: Use sample() from pandas

train = df.sample(frac=0.8,random_state=0)
test = df.drop(train.index)

The following examples show how to use each method with the following pandas DataFrame:

import pandas as pd
import numpy as np

#make this example reproducible
np.random.seed(1)

#create DataFrame with 1,000 rows and 3 columns
df = pd.DataFrame({'x1': np.random.randint(30, size=1000),
                   'x2': np.random.randint(12, size=1000),
                   'y': np.random.randint(2, size=1000)})

#view first few rows of DataFrame
df.head()

        x1	x2	y
0	5	1	1
1	11	8	0
2	12	4	1
3	8	7	0
4	9	0	0

Example 1: Use train_test_split() from sklearn

The following code shows how to use the train_test_split() function from sklearn to split the pandas DataFrame into training and test sets:

from sklearn.model_selection import train_test_split

#split original DataFrame into training and testing sets
train, test = train_test_split(df, test_size=0.2, random_state=0)

#view first few rows of each set
print(train.head())

     x1  x2  y
687  16   2  0
500  18   2  1
332   4  10  1
979   2   8  1
817  11   1  0

print(test.head())

     x1  x2  y
993  22   1  1
859  27   6  0
298  27   8  1
553  20   6  0
672   9   2  1

#print size of each set
print(train.shape, test.shape)

(800, 3) (200, 3)

From the output we can see that two sets have been created:

  • Training set: 800 rows and 3 columns
  • Testing set: 200 rows and 3 columns

Note that test_size controls the percentage of observations from the original DataFrame that will belong to the testing set and the random_state value makes the split reproducible.

Example 2: Use sample() from pandas

The following code shows how to use the sample() function from pandas to split the pandas DataFrame into training and test sets:

#split original DataFrame into training and testing sets
train = df.sample(frac=0.8,random_state=0)
test = df.drop(train.index)

#view first few rows of each set
print(train.head())

     x1  x2  y
993  22   1  1
859  27   6  0
298  27   8  1
553  20   6  0
672   9   2  1

print(test.head())

    x1  x2  y
9   16   5  0
11  12  10  0
19   5   9  0
23  28   1  1
28  18   0  1

#print size of each set
print(train.shape, test.shape)

(800, 3) (200, 3)

From the output we can see that two sets have been created:

  • Training set: 800 rows and 3 columns
  • Testing set: 200 rows and 3 columns

Note that frac controls the percentage of observations from the original DataFrame that will belong to the training set and the random_state value makes the split reproducible.

Additional Resources

The following tutorials explain how to perform other common tasks in Python:

How to Perform Logistic Regression in Python
How to Create a Confusion Matrix in Python
How to Calculate Balanced Accuracy in Python

Leave a Reply

Your email address will not be published.