4.4: Conclusion
For all the different types of
encoding and transformation discussed, a big question arises as to which
encoding to use for what feature. There are two ways to approach this. The
first method is doing the correct transformation based on the nature of the
data, aided by domain knowledge. We can choose a specific type of
transformation for a feature, based on the nature of the data.
The second method follows the
principle of doing the least harm. We should try all encoding and
transformation possible for a feature based on its type. The only exceptions
should be where it might cause harm. i.e., if the specific type of
transformation is not suitable for the feature type or for the modeling
technique. For example, we cannot use higher order feature engineering suitable
for numerical features and apply it on categorical features.
We should also consider the
suitability of the higher order feature engineering technique for the feature,
based on the modeling technique being used. For example, we cannot use
label-encoded categorical features in linear models.
We can create multiple higher order features for an original
feature. If the new features are beyond the computational capacity, we can
select a few features from the list of higher order features. For this, we can
use techniques such as correlation and hypothesis testing techniques, namely
F-test and Chi-square test. In some cases, we might still end up with more than
one type of higher order representation for an original feature. In such
situations, we can keep these, as they can help improve the model performance.