When deal with statistical models, one common issue is we don’t have enough data. To tackle this problem, we have bootstrap which can generate “new” data from data we already have.
Take-out
The principle behind bootstrap method is simple:
The bootstrap method is a resampling technique used to estimate statistics on a population by sampling a dataset with replacement.
Here are few important things:
- Bootstrap is what we use to estimate statistics by sampling dataset with replacement. This means bootstrap is beyond just generate data from original data.
- Bootstrap can be used for quantifying uncertainty of a given estimator, and that’s what “estimate statistics” means.
How bootstrap works
Generally, it works like below:
- Choose a number of bootstrap samples to perform
- Choose a sample size
- For each bootstrap sample
- Draw a sample with replacement with the chosen size
- Calculate the statistic on the sample
- Calculate the mean of the calculated sample statistics.
More specifically, in statistical inference, or the trendy term “machine learning” people use nowadays, it works like this way:
- Choose a number of bootstrap samples to perform
- Choose a sample size
- For each bootstrap sample
- Draw a sample with replacement with the chosen size
- Fit a model on the data sample (aka train dataset)
- Estimate the skill of the model on the out-of-bag sample. (aka test dataset)
- Calculate the mean of the sample of model skill estimates.
Easy example
Now suppose we have a dataset with 6 data points (observations):
data = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]
We now apply bootstrap method:
- Choose number of bootstrap sample, here we choose
1
, which means we generate 1 bootstrap sample dataset. For the sample size, we choose3
, which means in each bootstrap sample dataset, it contains3
observations. - Randomly select data point (observation) from the original dataset
data
, here we randomly select0.4
- Repeat this procedure for 3 times, since we choose the sample size of 3 above.
- Now we have a bootstrap dataset sample:
bootstrap_dataset = [0.4, 0.5, 0.4]
- Now with this
bootstrap_dataset
, we can call it another name:train_data
, while these data points not in thisbootstrap_dataset
(train_data
) :0.1, 0.2, 0.3, 0.6
, we call themtest_data
(some call themOOB
, out of observation). - We use
train_data
to build & fit a model, usetest_data
to test our model’s accuracy. (Here is how bootstrap can be used for quantifying uncertainty of a given estimator, and that’s what “estimate statistics” means.)