Chapter 6: Fundamentals of Feature Selection

6.1: Introduction

The objective behind training a machine learning model is to accurately predict the target variable using features. Some of these features facilitate the process of accurately predicting the target variable. These can be considered as 'signals'. Some features will be weak at predicting the target variable. These are called as 'noise'.

If a model is trained to predict hotel reservations for a specific hotel, we might consider the demand for similar hotels in the vicinity as an indicator of market demand. In the absence of demand data for competitors, we can consider the pricing of rooms by competitors as indicator of market demand. Similarly, we can also consider seasonality for the weekday as another useful feature. i.e., day of the week out of 7 days. Here we have two sets of features, competitor pricing, and 7 dummy variables indicating each day of the week.

Let us take the example of another feature. Employee strength is the number of employees who come to work at the hotel daily. It is not constant and fluctuates every day. Using this information might not be the best indicator of demand for the hotel room. Less number of employees in a hotel in a day might affect hotel operations for a specific day, and a greater number of employees might affect hotel running costs. However, it might not have any significant impact on the way people book hotel rooms on a travel website. This is an example of noise feature.

It must be noted that some features could be weak signals and might fall in between 'signal' and 'noise'. For example, the demand for hotel rooms is seasonal in tourist locations. During the peak tourist season, we can expect hotel bookings at 100%, whereas during the off-season, occupancy might be less. In this situation, demand is seasonal and varies between each month. In contrast, the day of the week starting Monday to Sunday does not give clear indication of occupancy levels. It varies for each specific month. Here, day of the week could be a noisy feature in comparison to the month of the year for the check-in date.

There can be an interaction between a strong signal feature and weak signal feature. It can occur that the combined presence of both the features in the model gives higher predictive power, than when they are present individually in the model. Hence, our objective in doing feature selection is to identify the combination of features that gives the best predictive power. Feature selection is even more relevant in high-dimensional data, where we need to reduce noise and identify features that are rich in the signal. A smaller number of features can significantly reduce model training time as well as the speed of prediction during production deployment. It also makes the model less complex and easier to explain the model prediction to laymen.

We will be performing cross-validation during feature selection. The methods discussed in this chapter return a list of features for each cross-validation. For ease of understanding, we will look at how many features are common across all cross-validation from among the selected features.