7.9: Putting Everything Together

For the 4 datasets, we will process all the feature selection methods discussed in the book. In this section, we will present the best results obtained for the dataset, across modeling techniques and feature selection methods in this chapter. The best performance will be briefly compared against the previously obtained best performance in previous chapters, we will be performing cross-validation. The methods discussed in this chapter return the list of features for each cross-validation. For ease of understanding, we will look at how many features are common across all cross-validation from among the selected features.

For the tree feature importance, we have considered the top 90 percent of features. For linear regression, we have kept the top 95 percent of features based on the beta coefficient.

7.9.1 Hotel Total Room Booking

We tried different models and feature selection methods. For Lightgbm regression, when used with the feature importance for 90% of top features, gave the best performance of all the methods. For cross-validation test, and validation data RMSE was observed to be 17.9, and 12.3 for external test data. The detailed results for each cross-validation can be seen in figure 7.9.1.

Figure 7.9.1 performance of Lightgbm tree model with filter method for feature selection on cross-validation test, validation, and external test data for hotel total room booking prediction

This is worse than the previous results presented in chapter 6. In addition to this, the results are inconsistent across different cross-validations and different test, and validation sets. Hence, we will discard this method.

7.9.2 Hotel Booking Cancellation

We tried different models and feature selection methods. Of all the methods, the Xgboost classifier, when used with Boruta, gave the best performance. For the cross-validation test, and the validation data precision was recorded at 0.835, and 0.881 for external test data. The detailed results for each cross-validation can be seen in figure 7.9.2.

The results obtained are almost similar to the results obtained in the previous chapter. Although the precision improved very marginally, the recall worsened than previous levels. There is no additional advantage in the results, as the model suffers from the same inadequacies that the model in chapter 6 suffered from. In such a scenario, it is up to the judgment of the analyst whether the solution should be accepted or we should keep searching for other solutions. In our case, we will still like to try feature selection using metaheuristics techniques in chapter 8.

Figure 7.9.2 performance of Xgboost tree model with Boruta method for feature selection on cross-validation test, validation, and external test data for hotel total booking cancellation prediction

7.9.3 Car Sales

Lasso regression performed the best for car sales data. For cross-validation test and validation data, RMSE was 233161, whereas for the external test data it is 260101. It is better than the results obtained in chapter 6.

Figure 7.9.3 shows the model performance across different cross-validations. For the Lasso feature selection, results between external test data and other test and validation data are smaller than previously achieved results. It still suffers from 2 issues. Firstly, the RMSE is still higher than acceptable limits. As 200000 Indian rupees is still a very high error margin. Also, RMSE is not very consistent across all cross-validations. Hence, we might need more improvements to find a model with an acceptable RMSE.

Figure 7.9.3 performance of the Lasso regression model on cross-validation test, validation, and external test data for used car price prediction.

7.9.4 Coupon Recommendation

We tried different models and feature selection methods. Of all the methods, the Xgboost classifier, when used with the feature importance for the top 90 percent of features method, gave the best performance. For the cross-validation test, validation data precision was recorded at 0.700, and 0.753 for external test data. This explains that the results are worse than the previously recorded best performance in chapter 6. In addition to this, recall worsened than previous results to a small extent. The detailed results for each cross-validation for precision can be seen in figure 7.9.4.

We will like to try the metaheuristics feature selection methods to see if these methods can bring any improvements.

Figure 7.9.4 performance of Xgboost tree model with feature importance feature selection for top 90 percent of features on cross-validation test, validation, and external test data for coupon recommendation dataset.