2.2: Domain-Specific Feature Engineering

Feature engineering is the most basic building block in training machine learning models. It can be likened to teaching the basic alphabet to children. If a child knows different alphabets, we can then teach him/her words. Along similar lines, we will need to explicitly create features to represent domain knowledge. Just because a pattern exists in data, does not mean that the machine learning model will be able to learn it. It may or may not. It is always better to explicitly create features. Someone's ability to create domain-specific features depends on the degree of awareness of the way domain, and the way the business operates on a day-to-day basis.

2.2.1    Ask Probing Questions

While initiating work on a machine learning problem, it is beneficial to interact with the end users. Two tasks should be undertaken at this stage, and if necessary, clarification should be sought from the intended users. Answers to these two questions could help us understand the domain better. Some of these answers can then guide us in feature engineering.

1) Try to understand how this task is done currently. If the users do not already have a machine learning model, do they do it through a rule-based system or manual calculations? Explore if some of these rules and calculations can be used for creating features for the machine learning model.

2) What are the things that could impact the dependent variable positively or negatively?

Let's take a look at a practical example of the hotel industry. Hotels operate on a check-in date basis. Guests reserve their stay in a hotel before staying. The day when they arrive at the hotel is called the "check-in" date. A revenue manager in a hotel might be interested to know in advance, how many rooms will be occupied by guests in total for a specific check-in date. This will help him/her to price the rooms in a way to maximize profit.

As a machine learning engineer, our goal will be to develop a model that predicts hotel occupancy for a specific future day. Machine learning engineers who don't know how the hotel industry operates will believe it is a univariate time series forecast and will develop a forecasting model. In this case, if we ask the revenue manager our 2 questions, it will look like

1) How are you forecasting occupancy currently, as you do not have a model? Even if it's gut feeling, what factors do you consider for arriving at a conclusion?

2) What are the factors that could result in increased or decreased occupancy for a specific check-in date?

Some possible answers to the first question could be that they forecast occupancy by looking at demand for the particular check-in date so far, and demand for the room on the same check-in date in the previous year. This could suggest that there is seasonality in hotel demand and there is an effect from demand so far for the check-in date.

For the second question, hotel occupancy for a specific date decrease, if competing hotels reduce their price for the same check-in date. If there are more negative reviews for the hotel, in comparison to competing hotels, then a smaller number of guests will stay in the hotel. Price affects hotel occupancy inversely. If price of room is very high, few guests will come to stay, whereas if price is very low, there might not be any vacant rooms in the hotel. Online reputation directly affects the hotel reservations. Demand can also be affected if holidays or events are happening close to the check-in date.

2.2.2    Literature Review

Quite often machine learning engineers are required to solve business problems in new domains where they lack subject matter expertise. In some of these cases, subject matter experts (SMEs) are not accessible or access to SMEs is limited. Reasons could be any, such as workload of their own, SMEs being C-level executives and having limited time, etc. In such situations, it can be helpful to read research papers and books on the subject matter. In particular, try to understand the type of features used and the way data was organized for modeling for similar machine learning problems.

Let's take the example of hotel industry booking prediction for check-in date. There is an argument put forth and proven [1] by data that the social media reputation of a hotel has an impact on hotel room reservations, as against competitor hotels. A simple search on google scholar can help us find useful research papers that can help us improve feature engineering for the project.