7.2: Feature Importance of Tree Models

Tree-based models are of two types. The first type is bootstrapped aggregated models. It consists of individual decision trees. The final prediction is averaged prediction from all individual trees. Random forests are a bootstrapped aggregation model. Boosting models are a series of connected models. They correct the error of previous decision trees. An example of boosting model is Xgboost.

It is easy to interpret a single decision tree by looking at the tree diagram. We can identify most and least important features from the visualization and we can use most important features and discard the rest. Tree models also have a built-in feature selection method. We can select the best features to split on when building trees. But this isn't foolproof and often leads to overfitting.

For bagging and boosting models, feature importance can be calculated for each feature individually. Feature importance is measured by calculating the weighted impurity decrease at each node by the number of examples at the node, for each decision tree. Resultant weighted impurities from all individual trees are averaged for features. Feature level averaged weighted impurities are normalized so that they add up to 1. Feature importance thus obtained, can be used for feature selection.

There are a few caveats to using feature importance in tree-based models for feature selection.

1) Feature importance is always relative to the feature set used and does not tell us anything about the statistical dependence between the target and features.

2) Feature importance does not take into account co-dependence among features. As a result, it might not represent the correct extent of feature importance.

3) Feature importance cannot be interpreted as a direct dependence between the predictor and target.

4) If the model is weak and cannot generalize well on test or validation data, then feature importance becomes less reliable to use for feature selection.

Despite all the caveats, the less computational time needed for tree models could make it a convenient method for feature selection for tree models.