7.3: Boruta
Boruta helps in fixing the
limitations of the feature importance method in tree-based models. It uses
randomization on top of results obtained from variable importance obtained from
random forests to determine the truly important and statistically valid
results.
A copy of the original dataset is
created, in which features are randomly shuffled. These shuffled features are
called 'shadow features'. The original dataset and the dataset with shadow
features are merged as a single dataset. With the help of the z score, the
hypothesis is tested whether the feature has significant information. If found
insignificant, the feature is dropped. The remaining features go through the
same process iteratively several times, as specified by the analyst. This
process stops if either all features are removed or all the features are
selected by the Boruta algorithm. This process can also stop if the specified
number of iterations are complete.