6.5: Putting Everything Together
We tried all the methods discussed
in this chapter for the 4 datasets. In this section, we will look at the best
results achieved for each dataset.
Certain feature selection methods
are computationally expensive, whereas some others are computationally
prohibitive. Backward and step feature selection methods in sequential feature
selection are of computationally prohibitive type. We tried executing these
methods in an intel i7, 64GB RAM machine. For linear and Xgboost models, many
times its execution wasn't completed even after 48 hours. At this point, we
stopped the Jupyter notebook cell and moved on to the next method. This
criterion was applied to all methods where it took more than 48 hours, it was
stopped.
6.5.1 Hotel Total
Room Booking
We tried different models and
feature selection methods. Of all the methods, Xgboost regression, when used
with the filter method, gave the best performance. For cross-validation test,
and validation data, the RMSE was observed to be 13.9. RMSE for the external
test data was identified as 5.9. The detailed results for each cross-validation
can be seen in figure 6.5.1.
Figure 6.5.1 performance of Xgboost
tree model with filter method for feature selection on cross-validation test,
validation, and external test data for hotel total room booking prediction
The results came as an improvement
upon the results reported in chapter 5. It still has a few areas for
improvement. Firstly, the difference in results between the two test data sets
is quite high. Also, among different cross-validations, the results are not
consistent and fluctuate drastically. This can be seen by comparing the first
vs third cross-validation results for external test data. Secondly, the results
despite being better than what was reported in chapter 5, still fall short and
RMSE are quite high. We will need to try other feature selection methods to see
if the results are any different.
6.5.2 Hotel Booking
Cancellation
We tried different models and
feature selection methods. Of all the methods, the Xgboost classifier, when
used with the filter method, gave the best performance. For the
cross-validation test, and the validation data precision was recorded at 0.835,
and 0.877 for external test data. The detailed results for each
cross-validation can be seen in figure 6.5.2.
Figure 6.5.2 performance of Xgboost
tree model with filter method for feature selection on cross-validation test,
validation, and external test data for hotel booking cancellation prediction
The previous results achieved in
chapter 5 were better than the results obtained through the filter method.
However, recall improved to a slight extent, in comparison to previous results.
Even then, we still need to improve on both the precision and recall for both
datasets. Also, we need to bring the performance of both datasets to a similar
level. Finally, the results in the 1st cross-validation is poor than the rest
of the cross-validations. All these points suggest us to try other methods of
feature selection to find a better solution.
6.5.3 Car Sales
We tried different models and
feature selection methods. Lightgbm regression, when used with the filter
method, gave the best performance of all the methods. For cross-validation
test, and validation data RMSE was observed to be 398263, and 230356 for
external test data. The detailed results for each cross-validation can be seen
in figure 6.5.3.
Figure 6.5.3 performance of Lightgbm
tree model with filter method for feature selection on cross-validation test,
validation, and external test data for used car price prediction.
The results obtained through the
filter method are marginally better than the previous results presented in
chapter 5. Even then, it suffers from the same issues we saw in chapter 5.
Results across different test datasets are inconsistent. Results across
different cross-validations are not consistent. If we had to use this model, we
cannot say with confidence that it will generalize well on new unseen data, to
the extent it does for the test, validation, and external test data. Also, the
RMSE values for car prices are very high to be considered acceptable.
As the results presented for car
prices are not acceptable enough to be used as a model, we will try other
methods of feature selection in subsequent chapters.
6.5.4 Coupon Recommendation
We tried different models and
feature selection methods. Of all the methods, the Xgboost classifier, when
used with the filter method, gave the best performance. For the
cross-validation test, validation data precision was recorded at 0.708, and
0.755 for external test data. Although recall worsened than previous results to
a small extent. The detailed results for each cross-validation for precision
can be seen in figure 6.5.4 below.
Figure 6.5.4 performance of Xgboost
tree model with filter method for feature selection on cross-validation test,
validation, and external test data for coupon recommendation dataset.
This model, like its predecessor model discussed in chapter
5 suffers from the same issues. The precision and recall are not good enough
and the results in external test data are not consistent with cross-validations
test and validation data. We will try other methods of feature selection in the
next chapters to see if any improvement is possible.