10.3: Feature Selection

Feature selection in text classification is applicable for all features, except meta-features. In the case of text classification, feature selection can be done through 1) the Filter method 2) the embedded method using document frequency with a bag of words or TF-IDF 3) the genetic algorithm 4) Ensemble feature selection.

The biggest problem with text features is that the number of unique features is in the thousands. If you move from unigram to bi-gram, the number of unique bi-gram features increases manifold. Not every token is useful and many are useless noise. We need to be highly discriminatory to select a handful list of such unique features.

10.3.1  Filter Method

In the case of text classification, feature selection using the filter method is performed first and then passed into for feature extraction before being fed into the model. They try to find out the least useful features. When done correctly, these methods can help in reducing computation time and prevent overfitting.

10.3.1.1 Document Frequency

It is the number of times a word or n-gram is present across all documents [3], discarding repetitions. It is different from word occurrence count. Word count is how many times the specific word or n-gram appeared across all documents. It does not discard repetitions.

For example, if we are analyzing 100 documents and the word 'Lion' is present 10 times in only 1 document and not present in any other document and there is another word 'Bear', which is present in 7 documents and out of those 7 documents, it is present 4 times in a single document. Word frequency for both 'Lion' and 'Bear' is 10. However, document frequency is 1 for 'Lion' and 7 for 'Bear'.

If we have to choose between word and document frequency, we should remove features based on document frequency as it has more impact on feature vector representation. A word can have a very high word frequency because it is mentioned in a document many times, but a word that is mentioned in multiple documents but comparatively less frequently within each document can have more bearing on feature representation. The same applies to high-frequency words, as these have a low degree of bearing in the feature vector.

Words that have either very low or very high document frequency have little bearing on feature representation. We can remove these words. The same principle can be applied to higher-order n-grams, syntactic n-grams, and taxonomy features.

10.3.1.2 Chi-Square

It is otherwise known as X2 statistic [3] and it measures the lack of independence between term(t) and class(c). It has a natural value of zero if t and c are independent. If it is higher, then the term is dependent. It is not reliable for low-frequency terms

Using the two-way contingency table of a term t and a category c, where A is the number of times t and c co-occur, B is the number of times the t occurs without c, C is the number of times c occurs without t, D is the number of times neither c nor t occurs, and N is the total number of documents, below will be the formula to calculate Chi-square statistic.

10.3.1.3 Mutual Information

In this method, rare terms will have a higher score than common terms [3]. For multi-class categories, we will calculate the MI value for all categories and will take the Max (MI) value across all categories at the word level. It is calculated using the below formula

10.3.1.4 Proportional Difference

It calculates how close two numbers are to becoming equal [5][6] . It helps find unigrams that occur mostly in one class of documents or the other. It helps find unigrams that occur mostly in one class of documents or the other. The values of PD are restricted between −1 and 1. Values near  1 indicate that a word occurs in approximately an equal number of documents in all categories and a 1 indicates that a word occurs in the documents of only one category.

Below is the formula for calculating the proportional difference

 

For multi-class categories, we will calculate the MI value for all categories and will take the Max (MI) value across all categories at the word level

10.3.1.5 Information Gain

It measures the number of bits of information obtained for category prediction by knowing the presence or absence of a term in a document [3][4]. It is calculated using the below formula

10.3.2  Metaheuristics Algorithms

The genetic algorithm technique has been found to perform well for the support vector machine technique with conceptual representation wordnet features and TF-IDF feature representation [7]. Instead of randomly selecting several text tokens as features, we can instead use a percentage of text tokens as features from the total number of words [8].

The Genetic algorithm can be used for deciding which text token should be included or excluded from the feature vector. Although metaheuristic methods do not perform well when the number of features is very high, like in the case of text classification, it has still been cited in numerous research papers.

There are similar mentions of other metaheuristics feature selection algorithms for text classification. However, these methods all have one limitation in common, i.e., they cannot perform well on high dimensional data where the number of features is numerous.

10.3.3  Ensemble Feature Selection

There are linear and non-linear models which can be trained individually to identify the target variable. Each technique has its strengths and weaknesses. One way to leverage the strength of different models is to perform ensemble learning by training all the base learner models. We can also leverage the strength of the different types of features such as n-grams and feature extraction methods, such as TF-IDF or bag of words vector.

On one hand, it can significantly improve classification model results, on the other hand, it can make the entire process computationally heavy for actual use. We might need heavy computing power for predicting labels from base models on new data. To solve this, we can 1) reduce the complexity of individual base models and 2) reduce the number of base models. This can be done by performing feature selection at both base models and ensemble models. This entire process can be jointly done through an algorithm called ensemble feature selection [9]. Figure 10.3.3 details the original ensemble feature selection algorithm

 

Fig 10.3.3: Ensemble feature selection (Wang et. al. 2020)

This algorithm was created for structured data. For text classification, we can make adaptations at different levels. For the base model, we can use document frequency to reduce words from the lower and higher count of occurrence, instead of the lasso feature selection used in the original paper. Through grid search, we can find the optimal combination of lower and higher document frequency. For lower frequency, we use document count and for the upper frequency, we use occurrence in percent of documents. By reducing the number of words, feature vector size decreases and model complexity also reduces.

The original paper performs bootstrapping. In the adaption, we are performing 5 or 10 cross-validations to avoid overfitting. The entire labeled dataset can be divided into 3 parts 1) training data for training individual models, 2) ensemble training data for training ensemble model and 3) test data. This can be done in a 60:20:20 ratio. For each cross-validation set, training data is used for training base learner models such as logistic regression, random forest, etc. After training is complete, the base model is used for predicting raw class probabilities for ensemble training data and test data. An ensemble learning classifier is trained using raw class probabilities for ensemble training data as features and actual labels as the outcome variable. Finally, the ensemble learning model is used for predicting outcomes for test data. A performance metric, such as the f1 score is used for each cross-validation and is averaged across all cross-validations to ascertain model performance.

For the ensemble model, we perform feature selection through the genetic algorithm. This step is done as it is in the original algorithm. Features that were discarded by the genetic algorithm are mapped with all the base models. For any base model, if no class probability feature is present in the list of features selected by the genetic algorithm, we can discard those models. This entire process will result in fewer base models, with each base model being less complex.

Python implementation of these filter methods, metaheuristics, and ensemble feature selection exists in the companion python library TextFeatureSelection.