I am using recursive feature elimination based on random forest variable importance to select a minimal subset of predictors needed to model a continuous response.
My first priority is to obtain a robust ranking of variable importance and information on model performance (RMSE). Therefore, I embed the recursive feature elimination in an outer cross validation scheme: I resample the training data 10 times with 80:20 training:testing (k = 5, rep = 2).
Now, I have a problem of understanding. What is the correct workflow?
(1) For all resamples, fit the full model using all predictors and assess variable importance in each resample. Then average these importance values to obtain an average (and robust?) vip score for each variable. Use these averaged vip scores to reduce the number of predictors, then refit the model for all resamples.
(2) for each resample, perform rfe seperately. That is, for each resample, fit the full model, assess variable importance and drop the least important ones, then refit the model using the reduced number of predictors. One the rfe is completed for each resample, average results across resamples
(3) something else entirely.
From the verbose of caret::rfe, it seems to me that (2) is implemented. But then, how are the results from the resamples averaged? I expect a slightly different subset of predictors being chosen in each step for the different resamples. So then, how can results be averaged across resamples to obtain a single (and robust?) ranking of variables at the end?