Chapter 6: Fundamentals of Feature Selection
6.1: Introduction
The objective behind training a
machine learning model is to accurately predict the target variable using
features. Some of these features facilitate the process of accurately
predicting the target variable. These can be considered as 'signals'. Some
features will be weak at predicting the target variable. These are called as
'noise'.
If a model is trained to predict
hotel reservations for a specific hotel, we might consider the demand for
similar hotels in the vicinity as an indicator of market demand. In the absence
of demand data for competitors, we can consider the pricing of rooms by
competitors as indicator of market demand. Similarly, we can also consider
seasonality for the weekday as another useful feature. i.e., day of the week
out of 7 days. Here we have two sets of features, competitor pricing, and 7
dummy
variables indicating each day of the week.
Let us take the example of another
feature. Employee strength is the number of employees who come to work at the
hotel daily. It is not constant and fluctuates every day. Using this
information might not be the best indicator of demand for the hotel room. Less
number of employees in a hotel in a day might affect hotel operations for a
specific day, and a greater number of employees might affect hotel running
costs. However, it might not have any significant impact on the way people book
hotel rooms on a travel website. This is an example of noise feature.
It must be noted that some features
could be weak signals and might fall in between 'signal' and 'noise'. For
example, the demand for hotel rooms is seasonal in tourist locations. During
the peak tourist season, we can expect hotel bookings at 100%, whereas during
the off-season, occupancy might be less. In this situation, demand is seasonal
and varies between each month. In contrast, the day of the week starting Monday
to Sunday does not give clear indication of occupancy levels. It varies for each
specific month. Here, day of the week could be a noisy feature in comparison to
the month of the year for the check-in date.
There can be an interaction between
a strong signal feature and weak signal feature. It can occur that the combined
presence of both the features in the model gives higher predictive power, than
when they are present individually in the model. Hence, our objective in doing
feature selection is to identify the combination of features that gives the
best predictive power. Feature selection is even more relevant in
high-dimensional data, where we need to reduce noise and identify features that
are rich in the signal. A smaller number of features can significantly reduce
model training time as well as the speed of prediction during production
deployment. It also makes the model less complex and easier to explain the
model prediction to laymen.
We will be performing
cross-validation during feature selection. The methods discussed in this
chapter return a list of features for each cross-validation. For ease of
understanding, we will look at how many features are common across all
cross-validation from among the selected features.