Researchers often take samples from a population and use the data from the sample to draw conclusions about the population as a whole.

One commonly used sampling method is **systematic sampling**, which is implemented with a simple two step process:

**1.** Place each member of a population in some order.

**2.** Choose a random starting point and select every n^{th} member to be in the sample.

This tutorial explains how to perform systematic sampling in R.

**Example: Systematic Sampling in R**

Suppose a superintendent wants to obtain a sample of 100 students from a school that has 500 total students. She chooses to use systematic sampling in which she places each student in alphabetical order according to their last name, randomly chooses a starting point, and picks every 5th student to be in the sample.

The following code shows how to create a fake data frame to work with in R:

#make this example reproducible set.seed(1) #create simple function to generate random last names randomNames <- function(n = 5000) { do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE)) } #create data frame df <- data.frame(last_name = randomNames(500), gpa = rnorm(500, mean=82, sd=3)) #view first six rows of data frame head(df) last_name gpa 1 GONBW 82.19580 2 JRRWZ 85.10598 3 ORJFW 88.78065 4 XRYNL 85.94409 5 FMDCE 79.38993 6 XZBJC 80.49061

And the following code shows how to obtain a sample of 100 students through systematic sampling:

#define function to obtain systematic sample obtain_sys = function(N,n){ k = ceiling(N/n) r = sample(1:k, 1) seq(r, r + k*(n-1), k) } #obtain systematic sample sys_sample_df = df[obtain_sys(nrow(df), 100), ] #view first six rows of data frame head(sys_sample_df) last_name gpa 3 ORJFW 88.78065 8 RWPSB 81.96988 13 RACZU 79.21433 18 ZOHKA 80.47246 23 QJETK 87.09991 28 JTHWB 83.87300 #view dimensions of data frame dim(sys_sample_df) [1] 100 2

Notice that the first member included in the sample was in row 3 of the original data frame. Each subsequent member in the sample is located 5 rows after the previous member.

And from using **dim() **we can see that the systematic sample we obtained is a data frame with 100 rows and 2 columns.

**Additional Resources**

Types of Sampling Methods

Stratified Sampling in R

Cluster Sampling in R

Hi Zach, does this only work if your sample is a divisor of your data frame? i.e. I tried to use your code to take a sample of 49 students out of 500:

sys_sample_df = df[obtain_sys(nrow(df), 49), ]

And my last four rows were NA. I assume this is because there is a remainder in N/n, so in this case it selects every 11 students, but once it reaches the end of the dataframe it has only selected 45 students but needs more. It also would struggle if your sample was more than n/2 of your total size. I’ve written a rough while loop which basically continually samples until you hit your desired sample size, but I’m sure there could be a much neater way:

set.seed(1)

#create simple function to generate random last names

randomNames <- function(n = 5000) {

do.call(paste0, replicate(5, sample(LETTERS, n, TRUE), FALSE))

}

#create data frame

df <- data.frame(last_name = randomNames(500),

gpa = rnorm(500, mean=82, sd=3))

#view first six rows of data frame

head(df)

#Function to sample

obtain_sys = function(N,n){

k = ceiling(N/n)

r = sample(1:k, 1)

seq(r, r + k*(n-1), k)

}

#Total sample size

N_tot = 500

#Desired sample

n_samp = 49

#Run initial sample

sys_sample_df <- df[obtain_sys(N_tot, n_samp), ]

#Remove any duplicates and NAs

sys_sample_df <- sys_sample_df[!duplicated(sys_sample_df$last_name),]

sys_sample_df <- sys_sample_df[!is.na(sys_sample_df$last_name),]

while(length(unique(sys_sample_df$last_name))<n_samp){

#remove those already sampled

df2 <- df[which(!df$last_name %in% sys_sample_df$last_name), ]

#redefine samples and total

N_tot2 <- N_tot – length(unique(sys_sample_df$last_name))

n_samp2 <- n_samp – length(unique(sys_sample_df$last_name))

#Run sample again

sys_sample_df2 = df2[obtain_sys(N_tot2, n_samp2), ]

#Remove any duplicates and NAs

sys_sample_df2 <- sys_sample_df2[!duplicated(sys_sample_df2$last_name),]

sys_sample_df2 <- sys_sample_df2[!is.na(sys_sample_df2$last_name),]

#Join together samples

sys_sample_df <- rbind(sys_sample_df, sys_sample_df2)

}