Simulating Data in R: The Bootstrap
Introduction
A common problem in data analysis is a lack of data. This is unfortunate because when it comes to data: more is more. The more data we have, the more accurate our estimates are and the more confidence we can have in the conclusions that we draw from that data. But data doesn’t just grow on trees. It takes resources to gather data; collecting data requires spending time, money, and/or manpower. For many applications, the cost of collecting data poses a real limitation. This is a tricky trade-off: collecting more data can produce more reliable results, but this improvement comes at a cost.
Additionally, sometimes it’s just not possible to collect more data – even if we have the resources to do so. Maybe the event under study has passed. Maybe it’s not longer possible to contact participants.
Thankfully, there are some techniques we can use in instances where gathering more data not an option. One such technique is the bootstrap. At it’s heart, the bootstrap solves the problem of collecting more data samples, not by sampling from the population, but by sampling from the initial (seed) dataset. The intuition behind the bootstrap is that if we assume that our original sample is representative of the population - that is, that the distribution of our sample reflects the distribution of the population - then sampling from our sample should be identical to sampling from the population. In practice, this assumption holds in some instances and fails in others. But, for the most part, it’s quite useful for overcoming data limitations.
The Problem
Suppose we are interested in estimating the population mean. Statistically, the sample mean gives us a quality estimate of the population mean. But this estimate is just that, an estimate. Crucially, we don’t know how precise this estimate is. Or in other words, how much uncertainty or variance is in the estimate. The variance of an estimate can have significant implications for how we interpret our findings.
For example, as an extreme case, imagine that you’re managing a venue and you estimate the average attendance is 50 people, but the standard error of your estimate is 35. Clearly, you cannot count on having a consistent level of traffic. Whereas if the standard error of your estimate was, say 2, you would be reasonably confident that you’ll have around 50 people each night. This would be an important distinction if your business relies on a consistent level of traffic or volume in order to operate.
Getting back to the point: clearly the variance of an estimate is important, so how can we calculate it? We could estimate this quantity exactly how we would estimate any other quantity: Sample a bunch of observations, then estimate by computing the statistic on the sample.
But when that statistic is the sample mean, which itself is computed on a sample, then, to estimate it’s variance, we would need a sample of samples. That is, we would need to collect multiple datasets, which - for reason’s outlined above - can be problematic.
Enter the bootstrap.
The Bootstrap
Normally, collecting additional samples would be expensive. The bootstrap allows us to generate simulated samples by resampling our original dataset. We can then use these samples to calculate the statistic of interest. In our case, the variance of the sample mean.
To perform the bootstrap, first we create \(m\) additional datasets, \(B_1, B_2, ..., B_m\) from our original dataset. Each of the \(B_i\) bootstrap samples is built by sampling with replacement \(n\) observations from the original dataset, where \(n\) is the number of observations in our original sample.
Then, on each of the \(B_i\) bootstrap samples we can then compute the statistic of interest \(\alpha_i\), called the bootstrap statistic. This could be the mode, variance, max, min, or any other statistic. In our case, we’re interested in the mean.
Finally, using our \(\alpha_1, \alpha_2, ..., \alpha_m\) bootstrap statistics we can compute the statistic of interest. Again this could be any statistic, but in our case we’re interested in the variance.
Demo
Now we will use the bootstrap to estimate the standard deviation of the sample mean.
Suppose we have a sample of observations:
\[X \sim \mathcal{N}(\mu = 100, \sigma = 5) \]
set.seed(265) # We set the seed to produce reproducible results for the random sampling functions
sample_data <- rnorm(n = 1E4, mean = 100, sd = 5)
Statistical theory tells us that the standard error of the mean is \(\frac{\sigma}{\sqrt{n}}\). Under our test conditions this evaluates to \(\frac{5}{\sqrt{10000}} = .05\). How well do the results of the bootstrap meet this expectation?
The function bootstrap()
returns a bootstrap sample of an input vector. We then use replicate()
to calculate the mean of 1,000 bootstrap samples (Here the sample mean is our bootstrap statistic).
bootstrap <- function(vect){
n <- length(vect)
bootstrap_indices <- sample(x = 1:n, size = n, replace = TRUE)
vect[bootstrap_indices]
}
bootstrapped_sample_means <- replicate(n = 1E3, expr = mean(bootstrap(sample_data)))
Recall that statistical theory tells us that the standard deviation of the mean should be \(\frac{5}{\sqrt{10000}} = .05\).
bootstrap_sd <- sd(bootstrapped_sample_means)
bootstrap_sd
## [1] 0.04953899
Our bootstrap estimate is within 0.92 percent of the theorized value.
Since this is a toy example, we know the actual distribution that our sample is drawn form. Thus, we can also estimate the standard deviation by drawing additional samples directly from the actual distribution. How well does our bootstrap estimate compare to this direct estimate?
repeated_direct_sample_means <- replicate(n = 1E3, expr = mean(rnorm(n = 1E4, mean = 100, sd = 5)))
direct_sd <- sd(repeated_direct_sample_means)
direct_sd
## [1] 0.04911771
Our bootstrap estimate is within 0.86 percent of the direct estimate.
Conclusions
In practice, it’s not always possible to collect additional data when we need it. In those situations, the bootstrap is a useful tool for estimating statistics that may otherwise be difficult to estimate.