10.4: Feature Extraction

In the case of text classification, feature extraction is done by representing text documents into matrices, which can then be used by machine learning algorithms as features to classify labels. For traditional machine learning models, feature extraction is done mostly by either a bag of words matrix or the TF-IDF matrix.

10.4.1  Bag of Words

It is a scoring method in which the count of individual words is used for presence and 0 for absence. Let us consider example 10.4.1 below with 4 corpora.

Example 10.4.1

Corpus 1: I am very happy.

Corpus 2: I just had an awesome weekend.

Corpus 3: I just had lunch. You had lunch?

Corpus 4: Do you want chips?

Unique words from these 4 corpora, after removing punctuation marks are: 'am', 'an', 'awesome', 'chips', 'Do', 'had', 'happy', 'I', 'just', 'lunch', 'very', 'want', 'weekend', 'you'

Bag of words vector will appear in table 10.4.1.

Table 10.4.1: Bag of words vector

In the above table, the corpus is present in the column and the words are present in the row. If a specific word is present in the corpus, its count is presented in corresponding cells against the word. If a word is not present in the corpus, 0 is presented in the corresponding cell.

For example, the word  lunch  is present in corpus 3. Hence count 2 is present in the corresponding cell. The word  am  is only present in corpus 1, for once. Hence the corresponding cells have the value 1. Again, the word  am , is not present in any other corpus apart from corpus 1. Hence, in corpus 2, 3, and 4, the value in the corresponding cells is 0.

10.4.2  Term Frequency Inverse Document Frequency

This is otherwise abbreviated as TF-IDF. Term frequency and inverse document frequency of each word are calculated and multiplied to obtain the TF-IDF score.

Term frequency (TF) is calculated by dividing how frequently a term appears in a corpus, by the total number of terms in the corpus.

TF = Count of the number of times the term  t  is present in the corpus) / Count all the terms in the corpus.

Inverse document frequency is calculated by taking the logarithm of the total number of the corpus by the count of documents that has the term.

IDF = log (Total number of documents / Count(documents which have term  t ))

TF-IDF vector for the corpora in example 10.4.1 will appear as per table 10.4.2

Table 10.4.2: TF-IDF vector

10.4.3  Word2vec

Word2Vec is a shallow neural network that tries to understand the context of words [10]. In word2vec, individual words are represented as one-hot vectors and are used for creating a vector space. It has a hidden layer, which is a fully-connected dense layer. The weights of the hidden layer are the word embeddings. The output layer outputs probabilities based on the Softmax activation function for the target words from the vocabulary. Such a network is a "standard" multinomial (multi-class) classifier.

The objective function of word2vec is to maximize the log of similarity between the vectors for words that appear close together in the context and minimize the similarity of words that do not. This is called the Softmax function.

Since classes are actual words, the number of neurons is huge. The Softmax function when applied to such a huge output layer will be computationally expensive. To save costly computation of the Softmax in the output layer, noise-contrastive estimation is used. This converts the multinomial classification problem to a binary classification problem.

C is the context word, and W is the focus word. Vc is the embedding of context words and Vw is the embedding of focus words.

Given a pair of words, we will predict the context target or not. For this, we will create positive and negative samples. For example, if 'Orange' and 'juice' are positive, it will be inferred that these appear together in the same context. Similarly, if 'Orange', and 'king' are negative, it will be inferred that both words do not appear in the same context. Positive samples are extracted from the context window whereas negative sample is drawn randomly from the dictionary of all words. Depending on the size of the data, if the data is small, negative sample size k is selected between 5<>20 and for large, k is between 2<>5. Accordingly, word pairs are created consisting of the input word, surrounding context word, and target label whether it is a positive sample or negative sample.

For backpropagation, two matrices of the same size are created namely embedding and context. The number of rows in the matrix is the size of the vocabulary in the corpus and the number of columns is defined as the size of the embedding. The embedding matrix has input word representation and the context matrix has representation from context word. At the start of the training process, random values are initialized in these matrices.

Through the dot product of input embedding with each of the context embeddings, we obtain similarities between the input and context embeddings. Resultant dot product numbers are converted into positive numbers ranging between 0 and 1 using the sigmoid function. The prediction error is obtained for each pair of input words and context words by subtracting sigmoid transformation from the actual label value of the positive sample(1) and negative sample(0). The error so obtained is inserted in the embedding layer for each of the words in input for all the combination pairs. This process is repeated through the dataset, based on the number of epochs defined. Finally, the context matrix is discarded and the embedding matrix is used.