6.3: Filter Method

This method uses the characteristics of the data. This method is independent of the machine learning technique. This is done before training the model. It requires less computational power and is usually fast. This method can miss out on removing features that are not useful for the modeling problem. This method is sometimes used as a preliminary feature selection method for datasets. After obtaining a reduced set of features from filter method, these features are introduced to other sophisticated feature selection methods. This approach is useful when the number of features is too high and computing power is limited. The reduced set of features can then be used by advanced methods to give superior results.

It mostly uses correlation coefficient and hypothesis testing techniques such as ANOVA, T-Test, and Chi-Square test to identify the relationship between the dependent variable and independent variables. Features that do not have any proven relationship with the dependent variable are discarded from the features for modeling.

If the nature of the modeling problem is classification and the feature is categorical, we can use the chi-square test to validate if the distribution of different categories across different classes is statistically significant. If the p-value is above 0.05, we can remove the categorical feature, as the distribution of categories across different classes is not significant. This could imply a weak predictor.

For classification problems and numerical features, we can perform ANOVA. The dependent variable has a value of either 1 or 0 and can be considered as 2 categories. We can calculate averages for both classes for numerical features. With the help of ANOVA, we can test the hypothesis if the different mean value thus calculated for each class of dependent variable is statistically significant. If it is not significant, we can remove such numerical features. ANOVA can also be applied for the numeric dependent variable and categorical features, in other words, regression models and categorical features.

For regression problems and numerical features, we can use correlation to ascertain the strength of the feature. Although correlation is not causation, this technique can still be used for high-dimensional datasets. One caveat of this method is deciding the correlation threshold to use for accepting or rejecting a feature as useful. Although it is subjective, we can keep a higher threshold such as 0.5 and above. For linear regression with multiple features, using correlation as a feature selection method has its limitations. The biggest risk will be losing out on potential interaction effects, as correlation only talks about the relation between the feature and the dependent variable in isolation, without accounting for any other feature. The same could be said for the Chi-square test and logistic regression. In such a situation, regularization techniques such as Lasso would be the most ethically sound method for feature selection for linear regression.