6.3: Filter Method
This method uses the characteristics
of the data. This method is independent of the machine learning technique. This
is done before training the model. It requires less computational power and is
usually fast. This method can miss out on removing features that are not useful
for the modeling problem. This method is sometimes used as a preliminary
feature selection method for datasets. After obtaining a reduced set of
features from filter method, these features are introduced to other
sophisticated feature selection methods. This approach is useful when the
number of features is too high and computing power is limited. The reduced set
of features can then be used by advanced methods to give superior results.
It mostly uses correlation
coefficient and hypothesis testing techniques such as ANOVA, T-Test, and
Chi-Square test to identify the relationship between the dependent variable and
independent variables. Features that do not have any proven relationship with
the dependent variable are discarded from the features for modeling.
If the nature of the modeling
problem is classification and the feature is categorical, we can use the
chi-square test to validate if the distribution of different categories across
different classes is statistically significant. If the p-value is above 0.05,
we can remove the categorical feature, as the distribution of categories across
different classes is not significant. This could imply a weak predictor.
For classification problems and
numerical features, we can perform ANOVA. The dependent variable has a value of
either 1 or 0 and can be considered as 2 categories. We can calculate averages
for both classes for numerical features. With the help of ANOVA, we can test
the hypothesis if the different mean value thus calculated for each class of
dependent variable is statistically significant. If it is not significant, we
can remove such numerical features. ANOVA can also be applied for the numeric
dependent variable and categorical features, in other words, regression models
and categorical features.
For regression problems and
numerical features, we can use correlation to ascertain the strength of the
feature. Although correlation is not causation, this technique can still be
used for high-dimensional datasets. One caveat of this method is deciding the
correlation threshold to use for accepting or rejecting a feature as useful.
Although it is subjective, we can keep a higher threshold such as 0.5 and
above. For linear regression with multiple features, using correlation as a
feature selection method has its limitations. The biggest risk will be losing
out on potential interaction effects, as correlation only talks about the
relation between the feature and the dependent variable in isolation, without
accounting for any other feature. The same could be said for the Chi-square
test and logistic regression. In such a situation, regularization techniques
such as Lasso would be the most ethically sound method for feature selection
for linear regression.