Chapter 10: Feature Engineering & Selection for Text Classification
10.1: Introduction
For text classification, the labels can be binary,
multi-class, and multi-label. Each label is mapped to a text corpus. Each of
these text corpuses will be used for constructing different types of features
that will be further used by a feature extraction process to convert text
documents to computer understandable format. To improve the efficiency of
classifiers, reduce model complexity, and improve explainability, we can use
feature selection before the feature extraction step. To reduce model
complexity even further we can perform feature reduction, after feature
extraction. Figure 10.1 shows the step-wise process that can be followed as a
step before training machine learning classifiers.
Fig 10.1: a step-wise process of converting text corpus
into machine learning model readable format.
Feature construction and feature extraction are the bare
minimum needed for training machine learning or deep learning classifiers for
text classification. Feature selection and feature reduction help us get
superior performance on the created and extracted features.