Bootstrapping Prediction Errors

by Alex Gold   Last Updated July 02, 2018 01:19 AM

I'm trying to bootstrap non-parametric prediction errors for a model I'm building. I've seen a few resources that suggest that the proper procedure is the following, with the input matrix $X$, response vector $y$, an input matrix to be predicted $X_p$, and $N$ equal to the number of resamples (e.g. 10,000):

for $i$ in $N$:

  1. Resample the rows of $X$ with replacement to get $X_i$
  2. Retrain the model $m_i(X_i)$ and predict the responses: $\hat{y_i} = m_i(X_p)$
  3. Generate the residual vector $\epsilon_i = y - \hat{y_i}$
  4. Resample (shuffle) the residuals to yield $\epsilon^*_i$ the bootstrap estimates $\hat{y^b_i} = \hat{y_i} + \epsilon^*_i$

Then the bootstrap estimate and prediction intervals can be computed directly from that vector (e.g. mean and quantiles of the vector).

So my question is why is step 4 necessary? My intuition is that it has to do with the fact that I'm computing prediction intervals rather than a confidence interval, but I haven't found a good resource on this.


Sources: Slides 12-15

Bootstrap prediction interval (This would be perfect if I happened to be using OLS, but I'm not...)

Related Questions

GBM Bootstrap Prediction Interval Code Error

Updated April 21, 2015 20:08 PM