7.2: Feature Importance of Tree Models
Tree-based models are of two types.
The first type is bootstrapped aggregated models. It consists of individual
decision trees. The final prediction is averaged prediction from all individual
trees. Random forests are a bootstrapped aggregation model. Boosting models are
a series of connected models. They correct the error of previous decision
trees. An example of boosting model is Xgboost.
It is easy to interpret a single
decision tree by looking at the tree diagram. We can identify most and least
important features from the visualization and we can use most important
features and discard the rest. Tree models also have a built-in feature
selection method. We can select the best features to split on when building
trees. But this isn't foolproof and often leads to overfitting.
For bagging and boosting models,
feature importance can be calculated for each feature individually. Feature
importance is measured by calculating the weighted impurity decrease at each
node by the number of examples at the node, for each decision tree. Resultant
weighted impurities from all individual trees are averaged for features.
Feature level averaged weighted impurities are normalized so that they add up
to 1. Feature importance thus obtained, can be used for feature selection.
There are a few caveats to using feature
importance in tree-based models for feature selection.
1) Feature importance is always
relative to the feature set used and does not tell us anything about the
statistical dependence between the target and features.
2) Feature importance does not take
into account co-dependence among features. As a result, it might not represent
the correct extent of feature importance.
3) Feature importance cannot be
interpreted as a direct dependence between the predictor and target.
4) If the model is weak and cannot
generalize well on test or validation data, then feature importance becomes
less reliable to use for feature selection.
Despite all the caveats, the less
computational time needed for tree models could make it a convenient method for
feature selection for tree models.