10.5: Feature Reduction

There are a few methods available for the feature reduction of sparse vectors. The most notable ones are singular value decomposition (SVD) and non-negative matrix factorization (NMF). The objective of both techniques is to identify latent spaces that explain the original data in a more concise and computationally efficient manner.

10.5.1  Singular Value Decomposition

Bag of words and TF-IDF methods create high-dimensional vectors which contain useful numbers. However, these are very long and sparse.

Storing and processing high-dimensional data is computationally intensive and expensive. Hence, ideally, we will like to reduce the bag of words and TF-IDF vectors into short dense vectors which contain the most important dimensions of variation. A traditional method of doing this is called singular value decomposition. SVD can reduce the TF-IDF matrix into a few columnar features, which makes machine learning classifier training faster. It removes less important parts of the matrix and produced an approximation in the specified number of dimensions. It is a classical second-order eigenvector technique, in which components are generated from the original vector, which is uncorrelated. SVD is used as a data compression technique.

10.5.2  Non-Negative Matrix Factorization

This method is otherwise known as NMF. This is suitable when features are made up of non-negative elements. In the case of text classification, both bag of words and TF-IDF feature matrix, values are positive. The lowest value in the matrix is 0. Hence, we can use it for processing sparse matrices and finding meaningful components. It is mostly used for topic modeling techniques. Through NMF and topic modeling, we can identify the most common word in social media conversations. This can help us create a lexicon of useful words for the domain. The availability of a comprehensive lexicon can aid in identifying the different classes for the classification model, as well as expanding the understanding of the broad domain for which the classification model is being developed.