6.4: Wrapper Method
This method uses a subset of
features from the original list of features to train the model. It is a greedy
method, as it evaluates all possible combinations of features to find the best
possible combination. The inference is drawn from a model trained on the subset
of data. Based on this, decisions are made to add or remove features from the
subset. It evaluates many combinations of features against a specified model
metric, such as the f1 score for classification and R square for regression. It
returns a feature set that gives the best results. It is computationally more
expensive than filter methods and takes more time to perform feature selection.
Even after performing this search, results might not always be desirable, as
the wrapper method often leads to overfitting.
4 methods come under the wrapper
method. These are forward selection, backward selection, Stepwise feature
selection, and recursive feature selection.
6.4.1 Forward
Selection
Forward selection starts with an
empty set of features and features are added until the R square keeps
increasing and F-statistics are significant. It starts with a model with a
single feature. Models with single features with the highest r squares are
selected. In the subsequent iterations, additional features are incrementally
added to the model, until R square keeps increasing, and f-statistics is
significant.
6.4.2 Backward
Selection
This method starts with all
features. Features for whom a change in R square is not reflected with
statistically significant F-statistics are removed.
6.4.3 Stepwise
Selection
Stepwise feature Selection is done
by creating a regression model, followed by ranking features based on p values.
Features with a p-value of more than 0.05 are dropped. The removed features can
still enter the model at a later step. This process continues until it meets
convergence criteria. However, it could lead to overfitting and increase in
false positives.
6.4.4 Recursive
Feature Elimination
The recursive feature elimination
(RFE) is an iterative procedure and is an instance of backward feature
elimination. The first model has all features and features are removed one by
one based on some scoring function, such as importance of the coefficients to
maximize some target metric. This is performed until the desired number of
features remains. This is the most commonly used feature selection method.
Now let s look at different datasets
and the impact of each method on the model performance.