Chapter 10: Feature Engineering & Selection for Text Classification

10.1: Introduction

For text classification, the labels can be binary, multi-class, and multi-label. Each label is mapped to a text corpus. Each of these text corpuses will be used for constructing different types of features that will be further used by a feature extraction process to convert text documents to computer understandable format. To improve the efficiency of classifiers, reduce model complexity, and improve explainability, we can use feature selection before the feature extraction step. To reduce model complexity even further we can perform feature reduction, after feature extraction. Figure 10.1 shows the step-wise process that can be followed as a step before training machine learning classifiers.

Fig 10.1: a step-wise process of converting text corpus into machine learning model readable format.

Feature construction and feature extraction are the bare minimum needed for training machine learning or deep learning classifiers for text classification. Feature selection and feature reduction help us get superior performance on the created and extracted features.