Systematic Sampling in Pandas (With Examples)


Researchers often take samples from a population and use the data from the sample to draw conclusions about the population as a whole.

One commonly used sampling method is systematic sampling, which is implemented with a simple two step process:

1. Place each member of a population in some order.

2. Choose a random starting point and select every nth member to be in the sample.

This tutorial explains how to perform systematic sampling on a pandas DataFrame in Python.

Example: Systematic Sampling in Pandas

Suppose a teacher wants to obtain a sample of 100 students from a school that has 500 total students. She chooses to use systematic sampling in which she places each student in alphabetical order according to their last name, randomly chooses a starting point, and picks every 5th student to be in the sample.

The following code shows how to create a fake data frame to work with in Python:

import pandas as pd
import numpy as np
import string
import random

#make this example reproducible
np.random.seed(0)

#create simple function to generate random last names
def randomNames(size=6, chars=string.ascii_uppercase):
    return ''.join(random.choice(chars) for _ in range(size))

#create DataFrame
df = pd.DataFrame({'last_name': [randomNames() for _ in range(500)],
                   'GPA': np.random.normal(loc=85, scale=3, size=500)})

#view first six rows of DataFrame
df.head()

last_name	GPA
0	PXGPIV	86.667888
1	JKRRQI	87.677422
2	TRIZTC	83.733056
3	YHUGIN	85.314142
4	ZVUNVK	85.684160

And the following code shows how to obtain a sample of 100 students through systematic sampling:

#obtain systematic sample by selecting every 5th row
sys_sample_df = df.iloc[::5]

#view first six rows of DataFrame
sys_sample_df.head()

   last_name      gpa
3      ORJFW 88.78065
8      RWPSB 81.96988
13     RACZU 79.21433
18     ZOHKA 80.47246
23     QJETK 87.09991
28     JTHWB 83.87300

#view dimensions of data frame
sys_sample_df.shape

(100, 2)

Notice that the first member included in the sample was in the first row of the original data frame. Each subsequent member in the sample is located 5 rows after the previous member.

And from using shape() we can see that the systematic sample we obtained is a data frame with 100 rows and 2 columns.

Additional Resources

Types of Sampling Methods
Cluster Sampling in Pandas
Stratified Sampling in Pandas

Leave a Reply

Your email address will not be published.