7.3: Boruta

Boruta helps in fixing the limitations of the feature importance method in tree-based models. It uses randomization on top of results obtained from variable importance obtained from random forests to determine the truly important and statistically valid results.

A copy of the original dataset is created, in which features are randomly shuffled. These shuffled features are called 'shadow features'. The original dataset and the dataset with shadow features are merged as a single dataset. With the help of the z score, the hypothesis is tested whether the feature has significant information. If found insignificant, the feature is dropped. The remaining features go through the same process iteratively several times, as specified by the analyst. This process stops if either all features are removed or all the features are selected by the Boruta algorithm. This process can also stop if the specified number of iterations are complete.