5.3: Putting Everything Together
After learning different methods of
feature engineering, let s now put our knowledge into practice and measure the
model performance. We will use both linear and tree-based nonlinear models for
benchmarking purposes. In linear, we will use linear and logistic regression.
For tree-based nonlinear, we will use Lightgbm, and Xgboost models.
In some cases, linear models will
perform better than nonlinear models and vice versa. We will select the model
which gives the best performance and is easy to explain to a non-technical
audience. The results will be the first benchmark performance. We will try to
find models which perform better than the benchmark performance using methods
discussed in section III.
5.3.1 Hotel Total
Room Booking
Let s try to understand the hotel
total room prediction data. It is a regression problem and the total occupancy
for hotels for a specific check-in date is the dependent variable. We tried
Lightgbm, Xgboost, and linear regression. Lightgbm regression gave the best
performance. Figure 5.3.1 explains the performance of the Lightgbm tree model
on cross-validation test, validation, and external test data.
The average RMSE for the
cross-validation test and validation data together is 16.4. The average RMSE
for the external test data is 12.9. This model predicts the total number of
rooms that will be sold for a future check-in date, based on which the hotel
property manager will ascertain market demand and set the price for the unsold
rooms.
For any prediction made for total occupancy,
RMSE is an indicator as to what extent predictions might have an error. RMSE of
16.4 means, predictions might be off by 16 extra rooms or 16 less rooms. For
the external test data, we can see that model is performing better at 12.9
RMSE. There are 3 issues in this model. The first issue with the model is that
RMSE is very different across different test sets. The second issue is that
within the external test data, RMSE is different for each cross-validation. The
third issue is high RMSE i.e., we will like to have a model which has the
lowest amount of RMSE possible. If we can get a model with a lower RMSE, it
will be easier to convince the stakeholders to use the model.
These 3 issues make the model
unreliable to use. We will try to use feature selection to reduce the noise in
the model. We will discuss different methods for feature selection across
different chapters in section III.
Figure 5.3.1 performance of Lightgbm
tree model on cross-validation test, validation, and external test data for hotel
total room booking prediction
5.3.2 Hotel Booking
Cancellation
Let s try to understand the model
performance for hotel booking cancellations data. It is a classification
problem and 1 means canceled and 0 means not canceled. We tried Lightgbm,
Xgboost, and logistic regression models for classification.
We are using the precision score for
ascertaining model performance. Overbooking is a phenomenon wherein hotels
sometimes sell more rooms than their available. As a result, a situation can
arise when more than one guest can come to the hotel on the date of check-in to
request their stay. On the other hand, many guests cancel their booking before
the check-in date. This might lead to another problem wherein hotels have an
unsold inventory of rooms.
If we can develop a model that can
identify transactions most likely to be canceled, with a high degree of
precision, hotels can minimize the loss occurring due to unsold inventory and
increase profit by indulging in overbooking.
Xgboost had the highest precision,
followed by Lightgbm. Xgboost had lower recall than Lightgbm. Figure 5.3.2
explains the performance of the Xgboost tree model on cross-validation test,
validation, and external test data for hotel booking cancellation.
The Xgboost classifier has a few issues.
Firstly, the precision values are not above 0.9 in both the test data.
Secondly, the precision score varies across cross-validation samples. For
example, in the first cross-validation training data, external test data has
the lowest precision score and for the third cross-validation training data,
precision is the highest. Thirdly, although precision scores are very near to
each other, there is a difference of 0.04 in precision scores between external
test data and the combination of cross-validation test data and validation
data.
Figure 5.3.2 performance of Xgboost
tree model on cross-validation test, validation, and external test data for
hotel booking cancellation.
We will try different methods of
feature selection discussed across different chapters in section III to help
resolve the noise issue and try to identify features that are high in the
signal.
5.3.3 Car Sales
Let s try to understand the model
performance for car sales data. It is a regression problem and the selling
price of the car is the dependent variable. We tried linear regression,
Lightgbm, and Xgboost regressors to model the car price. The dependent variable
has car prices for used cars in Indian rupees.
Lightgbm performed better than the
rest of the modeling techniques. Figure 5.3.3 explains the performance of
Lightgbm for the combination of cross-validation test and validation data, as
well as for external test data. RMSE for the linear model is unreasonably high.
For the Lightgbm model, for the combination of cross-validation test data, and
validation data set, RMSE came as 406737.9, and for the external test data,
RMSE came as 264851.9. This means while predicting the price of used cars, the
model can make an error on an average to the extent of a little more than
400000 Indian rupees. In the US Dollar to Indian rupees exchange rate of 1 USD
= 80 Indian rupees, the model can make an error to the extent of $5000. This is
a very high error margin for a model that is trying to predict the price of
used cars.
There are two more problems.
Firstly, RMSE is not consistent across both test datasets. A good model should
be able to generalize well. In this situation, the model cannot generalize
equally on the test, validation, and external test data. Secondly, RMSE varies
hugely across different test data samples in cross-validation. This is even
true for external test data. Here we can see that model with the provided set
of features is unable to predict reliably on the external test data, at
different cross-validation.
Figure 5.3.3 performance of Lightgbm
tree model on cross-validation test, validation, and external test data for
used car price prediction.
We will try noise reduction from
features with the help of feature selection techniques in the next section. We
will try to develop a model which can predict with smaller, and acceptable
RMSE.
5.3.4 Coupon
Recommendation
We used Lightgbm, Xgboost, and
Logistic regression models and checked precision, and recall. Lightgbm
performed the best and is presented in figure 5.3.4 for the coupon recommendation
dataset.
Figure 5.3.4 performance of Lightgbm
tree model on cross-validation test, validation, and external test data for
coupon recommendation.
Precision for both the test and
validation is close to 70, whereas, for the external test data, it is 72.44.
Recall for both the datasets are 74.5, and 77.9 respectively. Neither the
precision nor recall are satisfactory for this model. Our goal for this dataset
will be to remove noise from the data so that we can improve both the precision
and recall.
We will try to use feature selection
to find features that are high in signal and we will remove features that are
high in noise in the next section of this book.