2.2: Domain-Specific Feature Engineering
Feature engineering is the most basic building block in
training machine learning models. It can be likened to teaching the basic
alphabet to children. If a child knows different alphabets, we can then teach
him/her words. Along similar lines, we will need to explicitly create features
to represent domain knowledge. Just because a pattern exists in data, does not
mean that the machine learning model will be able to learn it. It may or may
not. It is always better to explicitly create features. Someone's ability to
create domain-specific features depends on the degree of awareness of the way
domain, and the way the business operates on a day-to-day basis.
2.2.1 Ask Probing Questions
While initiating work on a machine learning problem, it is
beneficial to interact with the end users. Two tasks should be undertaken at
this stage, and if necessary, clarification should be sought from the intended
users. Answers to these two questions could help us understand the domain
better. Some of these answers can then guide us in feature engineering.
1) Try to understand how this task is done currently. If
the users do not already have a machine learning model, do they do it through a
rule-based system or manual calculations? Explore if some of these rules and
calculations can be used for creating features for the machine learning model.
2) What are the things that could impact the dependent
variable positively or negatively?
Let's take a look at a practical example of the hotel
industry. Hotels operate on a check-in date basis. Guests reserve their stay in
a hotel before staying. The day when they arrive at the hotel is called the
"check-in" date. A revenue manager in a hotel might be interested to
know in advance, how many rooms will be occupied by guests in total for a
specific check-in date. This will help him/her to price the rooms in a way to
maximize profit.
As a machine learning engineer, our goal will be to
develop a model that predicts hotel occupancy for a specific future day.
Machine learning engineers who don't know how the hotel industry operates will
believe it is a univariate time series forecast and will develop a forecasting
model. In this case, if we ask the revenue manager our 2 questions, it will
look like
1) How are you forecasting occupancy currently, as you do
not have a model? Even if it's gut feeling, what factors do you consider for
arriving at a conclusion?
2) What are the factors that could result in increased or
decreased occupancy for a specific check-in date?
Some possible answers to the first question could be that
they forecast occupancy by looking at demand for the particular check-in date
so far, and demand for the room on the same check-in date in the previous year.
This could suggest that there is seasonality in hotel demand and there is an
effect from demand so far for the check-in date.
For the second question, hotel occupancy for a specific
date decrease, if competing hotels reduce their price for the same check-in
date. If there are more negative reviews for the hotel, in comparison to
competing hotels, then a smaller number of guests will stay in the hotel. Price
affects hotel occupancy inversely. If price of room is very high, few guests
will come to stay, whereas if price is very low, there might not be any vacant
rooms in the hotel. Online reputation directly affects the hotel reservations.
Demand can also be affected if holidays or events are happening close to the
check-in date.
2.2.2 Literature
Review
Quite often machine learning
engineers are required to solve business problems in new domains where they
lack subject matter expertise. In some of these cases, subject matter experts
(SMEs) are not accessible or access to SMEs is limited. Reasons could be any,
such as workload of their own, SMEs being C-level executives and having limited
time, etc. In such situations, it can be helpful to read research papers and
books on the subject matter. In particular, try to understand the type of
features used and the way data was organized for modeling for similar machine
learning problems.
Let's take the example of hotel
industry booking prediction for check-in date. There is an argument put forth
and proven [1] by data that the social media reputation of a hotel
has an impact on hotel room reservations, as against competitor hotels. A
simple search on google scholar can help us find useful research papers that
can help us improve feature engineering for the project.