10.5: Feature Reduction
There are a few methods
available for the feature reduction of sparse vectors. The most notable ones
are singular value decomposition (SVD) and non-negative matrix factorization (NMF).
The objective of both techniques is to identify latent spaces that explain the
original data in a more concise and computationally efficient manner.
10.5.1 Singular
Value
Decomposition
Bag of words and TF-IDF
methods create high-dimensional vectors which contain useful numbers. However,
these are very long and sparse.
Storing and processing
high-dimensional data is computationally intensive and expensive. Hence,
ideally, we will like to reduce the bag of words and TF-IDF vectors into short
dense vectors which contain the most important dimensions of variation. A
traditional method of doing this is called singular value decomposition. SVD can reduce the
TF-IDF matrix into a few columnar features, which makes machine learning
classifier training faster. It removes less important parts of the matrix and
produced an approximation in the specified number of dimensions. It is a
classical second-order eigenvector technique, in which components are generated
from the original vector, which is uncorrelated. SVD is used as a data
compression technique.
10.5.2 Non-Negative
Matrix Factorization
This method is otherwise
known as NMF. This is suitable when features are made up of non-negative
elements. In the case of text classification, both bag of words and TF-IDF
feature matrix, values are positive. The lowest value in the matrix is 0.
Hence, we can use it for processing sparse matrices and finding meaningful
components. It is mostly used for topic modeling techniques. Through NMF and
topic modeling, we can identify the most common word in social media
conversations. This can help us create a lexicon of useful words for the
domain. The availability of a comprehensive lexicon can aid in identifying the
different classes for the classification model, as well as expanding the
understanding of the broad domain for which the classification model is being
developed.