Bootstrapping in Statistics

This tutorial provides a simple explanation of the statistical procedure bootstrapping by using an example.

Bootstrapping: The Background

Suppose we want to know the average height of a student in a particular school that has 1,000 students. Since it would take too long to measure the height of each student, we decide to randomly measure 100 students and use the average height of these 100 to estimate the average height of all 1,000 students.

An example of sample and population in statistics

The 1,000 students represent the population we are interested in.

The 100 students we randomly chose to measure represent the sample we will use to draw inferences about the population.

But how can we be sure that the average height of the students in the sample is an accurate representation of the average height of all 1,000 students at this school? For example, suppose we find that the average height of the 100 students in the sample is 68 inches. How do we know that this is a good estimate of the true average height of all 1,000 students?

One common way to account for this uncertainty is to produce a confidence interval, which is a range of values that we believe contains the true average height of all 1,000 students.

For example, we could produce a 95% confidence interval, which would allow us to be 95% confident that the true average height of students at this school lies within the lower and upper boundaries of the confidence interval.

In order to construct this confidence interval, we have two choices:

1. Create a confidence interval by making an assumption about the shape of the population. For example, we may make the assumption that the height of the 1,000 students at this school is normally distributed. Thus, to create a confidence interval for the population mean, we could simply use the following formula:

x +/- tn-1 * (s / √n)

where x  is our sample mean, tn-1 is the t critical-value that comes from the t distribution table with n-1 degrees of freedom, s is our sample standard deviation, and n is our sample size.

By simply plugging in the necessary values into this formula, we could come up with a confidence interval for the true average height of students at this school.

2. Create a bootstrapped confidence interval by using the information from the sample we already have. Using this approach, we take repeated samples with replacement from the original sample. Thus, we generate lots of simulated samples and each of these simulated samples has its own mean.

Then, we can create a histogram that shows the distribution of these means and observe the sampling distribution of the mean. We can then generate a 95% confidence interval by finding the values of this distribution located at the percentiles 2.5% and 97.5%. 

Bootstrapping: The Approach

Bootstrapping uses the following approach:

1. Obtain a a simple random sample from the population.

2. Generate hundreds hundreds or thousands of simulated samples by taking repeated samples with replacement from this original sample.

3. Construct a confidence interval for the sample statistic we’re interested in using the sampling distribution formed by the simulated samples.

For example, suppose our original sample contains the values: 1, 2, 3, 4, 5

When we take our first bootstrapped sample, we may select the values: 1, 1, 2, 5, 5

When we take our second bootstrapped sample, we may select the values: 3, 4, 4, 4, 5

When we take our third bootstrapped sample, we may select the values: 1, 2, 3, 4, 4

Our bootstrapped samples have the following properties:

  • It’s possible for one value in the original sample to show up in the bootstrapped sample more than once, since we’re sampling from the original sample with replacement.
  • Each bootstrapped sample is the same size as the original sample. 

When we actually perform bootstrapping for a real problem, our sample sizes would likely be much larger and we would ideally take hundreds or thousands of bootstrapped samples using statistical software. 

Bootstrapping: An Example

For this example, we ill use the built-in R dataset iris, which contains information about 150 different flowers:

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

Suppose we are interested in finding the mean Petal.Length for all of the flowers in this dataset.

First, we take a simple random sample of 30 flowers.

Next, we generate 1,000 bootstrapped samples from this original sample of 30 and find the mean Petal.Length for each bootstrapped sample.

Then, we use a histogram to visualize the sampling distribution of the mean Petal.Length:

Histogram of a bootstrapped sample in R

Notice that some of the samples had a mean Petal.Length as low as 3 while other samples had a mean Petal.Length as high as 5. 

To create a 95% bootstrapped confidence interval, we simply identify the percentiles 2.5% and 97.5% (since 95% of the sample means lie between these two values) of this distribution and use those values as the lower and upper bounds of our confidence interval, respectively. These values turn out to be:

(3.472, 4.578) 

Thus, our 95% bootstrapped confidence interval for the mean Petal.Length for all flowers in the dataset is (3.472, 4.578).

Note: here is the R code used to run this example:

#set the seed to make this example reproducible

#load bootstrap library

#define original sample of 30
sample_rows <- sample(1:nrow(iris), 30)
original_sample <- iris[sample_rows, ]

#define function to find mean Petal.Length
mean_function = function(x, indices) {
return( mean( x[indices] ) )

#bootstrapping with 1000 replications 
bootstrapped_samples <- boot(data = original_sample$Petal.Length,
                             mean_function, R = 1000)

#create histogram of mean Petal.Length for each bootstrapped sample

# get 95% confidence interval for mean

Advantages of Bootstrapping

“The central idea is that it may sometimes be better to draw conclusions about the characteristics of a population strictly from the sample at hand, rather than by making perhaps unrealistic assumptions about the population.” Mooney & Duval, Bootstrapping, 1993

Bootstrapping doesn’t require you to make any assumptions about the shape of the population you’re studying. Unlike traditional methods that require a normality assumption to create a confidence interval or perform a hypothesis test, bootstrapping has no such requirement.

Bootstrapping can therefore be used for a wider variety of distributions include unknown distributions and small sample sizes. Even sample sizes as small as 10 can be used with bootstrapping.

In particular, bootstrapping is especially useful for estimating confidence intervals for sample statistics that have no known sampling distributions (like medians) since it makes no assumptions about the shape of the population distributions. 

Why is Bootstrapping Valid?

You may be wondering why bootstrapping is a valid approach to constructing confidence intervals if the entire approach relies on repeatedly taking samples from one randomly selected sample from the population. 

It turns out to be a valid approach because most samples will, if they’re randomly chosen, look quite similar to the population they came from. By definition, when you select a simple random sample, every element of a population has an equal chance of being selected to be in the sample, which means a simple random sample is likely to be representative of the population as a whole. Of course, the larger the sample, the higher the likelihood that the sample is representative of the population. 

This means that when we take bootstrapped samples from the original sample, we’re likely taking bootstrapped samples from a tiny version of the larger population. This is why it’s valid to create a sampling distribution from all of these bootstrapped samples and consequently generate a confidence interval for a sample statistic using this sampling distribution. 

Bootstrapping A Variety of Sample Statistics

Although the example we used in this tutorial focused on the sample mean, bootstrapping can be used to produce confidence intervals for a wide variety of sample statistics including the mean, median, mode, standard deviation, correlations, proportions, odds ratios, and regression coefficients among others.

Leave a Reply

Your email address will not be published. Required fields are marked *