10.2: Feature Construction
For text classification,
the most common features used are 1) N-gram (Unigram, bigram, and trigram) 2)
syntactic N-gram (SN-gram) 3) Domain-specific taxonomy features and 4) Meta
features. This applies to both deep learning and traditional machine learning
algorithms.
10.2.1
N-gram
In human language, words do not occur in isolation. Often,
we need other supporting words to understand the whole situation. For example,
if you want to invite your friend for coffee. You will not simply tell your
friend 'Coffee'. Instead, you might say Let's go for coffee . Here we needed 3
words: Let's go for , to be able to explain our core objective of inviting
friends for 'coffee'. Similarly, in some situations, we will need words after
the main context word to understand the situation. This is not always true
however, let's imagine your friend is sitting at your house and you prepared
coffee in your kitchen. You proceed towards your friend with an extra cup of
coffee and offer him after saying only one word 'coffee'. Most likely your
friend will understand that you are offering him coffee and will accept it
gracefully.
N-grams are a sequence of elements such as words,
characters, part of speech, and dependency tags as they appear in the text. n
in n-grams refers to the number of elements in the sequence.
We can represent a text document in multiple ways in
n-gram for feature construction while considering whether to include previous
and next words. The most common types are unigram, bi-gram, and tri-gram
Uni-gram is a way of representing textual words in
isolation. bi-gram represents texts by considering one surrounding word, in
addition to the current word. tr-gram considers 2 previous words, in addition
to the current word. We can similarly go for quad-gram and so on. However, the
number of unique n-grams increases manifold and becomes difficult to represent
as machine learning features embedding and training a classifier. Beyond uni-gram, in n-gram, words are concatenated with each other
using a special character such as the underscore _ .
Example: For a sentence that reads "You are so adorable!".
Below is an example of n-grams.
uni-gram:
["you", "are", "so", "adorable", ! ]
bi-gram: ["you_are",
"are_so", "so_adorable", adorable_! ]
tri-gram: ["you_are_so",
"are_so_adorable", "so_adorable_!"]
10.2.2 Syntactic
N-gram
If n-grams are extracted by
the order in which the elements are present in syntactic dependency trees that
follow each other in the path of the syntactic tree, and not in the text. We
call such n-grams syntactic n-grams (sn-grams) [1].
Human spoken and written
languages typically follow a hierarchy structure, such as sentences, clauses,
phrases, and words. The dependency tree is a visual representation of the
linguistic structure, in which the grammatical hierarchy is graphically
displayed. Connecting points in the tree diagram are called nodes, with one
word being the head and the other being the dependent of the relation.
Syntactic dependencies are obtained through dependency parsing.
Consider the example
sentence which can be a sarcastic remark to an obese person You should not run
anymore . Figure 10.2.2 shows how the dependency parse tree will look like,
along with.
Fig 10.2.2: syntactic dependency tree for example.
In this example, run is
the root word. Below are the dependency relationships in this sentence.
(run, You), connected by
nominal subject relationship.
(run, should), connected by
an auxiliary.
(run, not), connected by a
negation modifier.
(run, more), connected by
an adverb modifier.
(more, any), connected by
an adverb modifier.
We follow the arrow-marked path
in the dependencies to obtain syntactic n-grams. In this example, all the
relationships are between 2 words only, except for run -> more ->
any , where the path exists between 3 words. This can be a candidate for
syntactic tri-gram.
We used the companion
python library SNgramExtractor for this
book and extracted the below pairs of syntactic bi-grams and tri-grams from the
example sentence.
Syntactic bi-grams: "run_You",
"run_should",
"run_not", "more_any",
"run_more"
Syntactic tri-gram: "run_more_any"
SngramExtractor uses Spacy language model for extracting sn-grams. We can also
specify which language model we want
to use. This allows us to use non-English language models and extract syntactic
n-grams of non-English languages as well. SngramExtractor follows
object-oriented pattern. It can process one sentence at a time. For an example
sentence 'You should not run
any more', below syntax will help us extract bi-gram and tri-gram.
text = 'You should not run any more'
SNgram_obj=SNgramExtractor(text,
meta_tag='original',
trigram_flag='yes',
nlp_model=None)
output=SNgram_obj.get_SNgram()
print('Original
Text:',text)
print('SNGram bigram:',output['SNBigram'])
print('SNGram trigram:',output['SNTrigram'])
In this syntax, we have not specified nlp_model
parameter. This has been set as None . In such a case, we use en_core_web_sm, as the
default language model. Below is how the output will look like for the executed
syntax.
Original Text: You should not run any more
SNGram bigram: run_You run_should run_not more_any run_more
SNGram trigram: run_more_any
Unlike traditional n-grams,
syntactic n-grams are less arbitrary. Dependency parsing is language-specific
and different for each language. Hence by extracting syntactic n-grams, we can
include linguistic-rich features in our model. Additionally, as we are not
including all the possible n-grams that can be created, their numbers are less
than the number of traditional n-grams. Hence, syntactic n-gram features can be
interpreted as a linguistic phenomenon, while traditional n-grams have no
plausible linguistic interpretation and they are merely a statistical artifact.
One shortcoming with
syntactic n-gram is that it is dependent on the availability of syntactic
parser and lexical resources. Not all human language has lexical resources to
be able to build SN-gram. For the languages for which we have lexical resources
available, it requires the construction of dependency parse trees, which
increases processing time. In contrast, traditional n-grams are faster to
compute.
10.2.3
Domain-Specific
Taxonomy Features
The use of background
knowledge is largely unexploited in text classification tasks. To include
background knowledge in the form of metadata, ontology or taxonomy-generated
features can be incorporated in machine learning classifiers. It acts as a
second-level context. It can lead to improved interpretability and performance
of the model. For doing so, we focus on semantic structures, derived through
hypernym relation between words.
Individual words are mapped
to higher-order semantic concepts. For example, for the word tiger , its
WordNet hypernym is the term mammal . It can be further mapped with animals .
We can ultimately reach the most general term, which is an entity.
Before we construct
hypernym-based features, word-sense disambiguation should be performed, get the
exact context behind the word. For example, the word bank can have different
meanings, in different contexts. Between the sentences "I went to the bank
to withdraw money." and "I was relaxing near the bank of the river in
the morning.". In both sentences, the meaning of bank is different. In the
first sentence, the bank refers to a financial institution and in the second
sentence, a bank refers to a recreational place.
For each word in the
document, we find a hypernym path between the word and the most general term.
Document-based taxonomy thus created, is a tree-like structure. Word-specific
tree structures are collated at the document level to create a corpus-based
taxonomy at the document level.
One shortcoming of using
the Wordnet-based taxonomy feature is that Wordnet is a general-purpose
taxonomy. If a customized lexical resource can be created for the specific
domain for which the classifier is being developed, it can sharply increase the
machine learning classifier and its usability even further.
10.2.4 Meta
Features
These features have been
used previously in predicting whether a question posted on StackOverflow
can be deleted. Stackoverflow is an online forum for
software engineers and programmers to post their questions. This website is
moderated by volunteers who delete questions that are too broad or off-topic.
To reduce manual moderation and automate this task, several algorithms and
approaches have been suggested. One such method [2] is to use
meta-features from past user-generated content and site-generated statistics.
Meta features from the users were grouped into 4 categories: profile,
community, content, and syntactic.
Profile features are
defined based on the historical statistics of the user who posted the question.
For example, how old is the account which posted the question, the number of
previous questions, the number of previous answers, the question to the age of
account ratio, answer to the age of account ratio. This can be used for example
in Twitter, how many tweets have been posted, how many tweets replied to, how
many tweets are liked, and how many people follow the Twitter account.
Similarly, many such statistics can be computed for the users as scores based
on functionalities available on the social media website.
Community features in StackOverflow are
average reputation and popularity
statistics for the user. For example, the average answer score, the average
number of comments, and the average question score. On Twitter, for example, we
can calculate the average number of retweets for the tweets of users, the
average number of likes, and replies received in the past for their tweets.
Many of the content and
syntactic meta-features use LIWC scores. LIWC stands for Linguistic Inquiry and
Word Count. LIWC2007 is a software that contains 4,500 words and 64
hierarchical dictionary categories. Given an input natural language text
document, it outputs a score for the input against all 64 categories after
analyzing writing style and psychometric properties.
Content features were
computed based on the textual content of the question. For example, the average
number of URLs, number of previous tags, and LIWC score of Pronouns. On
Twitter, we can similarly calculate the average number of hashtags used by
users in past, and the number of links posted.
Syntactic features are
computed based on the writing style of the written text. For example, the
number of characters, upper case characters, lower case characters, and digits
in the text document. These can be used as it is.