9.4: Putting Everything Together
After understanding different
methods of model explainability, now let us try to apply the methods for the
hotel room booking prediction, and hotel booking cancellation datasets.
9.4.1 Hotel Total
Room Booking
We tried all 4 methods of explaining
the Lightgbm model, partial dependence plot, accumulated local effects plot,
permutation feature importance, and surrogate model. For these two datasets, we
were able to create acceptable level of model performance through
metaheuristics feature selection.
We tried a surrogate linear
regression model. However, the RMSE of the model was more than 30. Hence, we
will not include the surrogate model for understanding the Lightgbm regression
model. Permutation feature importance is the easiest to interpret, as it gives
the features in decreasing order of importance for the model. Let us start with
this method. This can be seen in figure 9.4.1.1.
Figure 9.4.1.1 permutation feature
importance plot for Lightgbm regression model for the hotel total room booking
dataset.
The first feature is a higher order
feature of cumulative rooms sold for the hotel, for a specific check-in date,
at a lead time. Total rooms to be sold does have an impact on seasonality, as
evidenced by the second most important feature, which is the month of the year
encoded feature.
To a layman, we can explain that
sold rooms inventory for a check-in date, and the monthly seasonality of the
booking demand have the biggest impact on total room demand for a check-in
date. This can help the machine learning engineer to speak in layman's terms
and convince the users to adopt the model.
We will now look at the partial
dependence plot for the Lightgbm regression model in figure 9.4.1.2.
Figure 9.4.1.2 partial dependence
plot of Lightgbm regression model for the hotel total room booking dataset.
The most impactful features have the
sharpest curves. The most important is the plot represented in the second row,
first column. This is a higher order feature of the cumulative number of net
rooms sold. This is an almost linear relationship. The second most impactful
feature is the last subplot in the third row and third column. This is an
encoded feature of the month feature. The months are encoded from 0 to 11.
Since this is a categorical feature, it will not be appropriate to deduct a
conclusion based on the shape of the relationship. We can however conclude the
different levels of hotel reservations for different months. Now let us look at
the accumulated local effects plot for the same dataset in figure 9.4.1.3.
Figure 9.4.1.3 accumulated local
effects plot of Lightgbm regression model for the hotel total room booking
dataset.
The accumulated local effects plot
explains the model performance after accommodating the correlation among
features. We can see that the most important feature is the same as it was in
the partial dependence plot. For the second most important feature, the extent
of the impact is less. However, it still has the second-highest sharp changes
for different values of the feature.
In addition to the inferences drawn
from the three plots, we see that there is very little difference between the
partial dependence plot and the accumulated local effects plot. There is one
advantage with the accumulated local effects plot, as it overcomes the
disadvantage of a partial dependence plot, which cannot work with correlated
features. Although the partial dependence plot and accumulated local effects
plots carry more information than the permutation feature importance plot, the
latter is more legible and easier to read if the model has a huge number of
features.
For the hotel bookings cancellations
dataset, we will restrict our investigation for overall model explanation to
accumulated local effects plot, and permutation feature importance. For the
rest of this section, we will discuss explaining individual predictions of
hotel total room booking prediction.
Figure 9.4.1.4 Individual
Conditional Expectation plot of Lightgbm regression model for the hotel total
room booking dataset for the first 10 rows of external test data.
The ICE plot in figure 9.4.1.4
suggests that there is a degree of non-linearity between the feature 'CumulativeNumberOfRoomsNet_Quartile_Encoded'
and the dependent variable. For many cases, the number of total bookings
increases as we move up towards the higher quartile of the number of net
cumulative rooms sold. However, in some cases, it decreases after increasing
for a short while. Hence the relationship could be non-linear.
Let us now look at the LIME
interpretation of a single row of data from the 4th index of external test
data, as displayed in figure 9.4.1.5.
Figure 9.4.1.5 LIME plot of Lightgbm
regression model for the hotel total room booking dataset for the 4th row of
external test data.
The above plot has 3 parts. Let us
understand the first part. The predicted value displays higher values in orange
color and smaller values in blue color. The prediction from model 189.01 is a
high value. The second subplot has a negative and positive relationship
indicator against the feature. For example, for the DayOfWeek_Encoded feature, total
rooms increase in demand for days that are farther from Monday. Similarly, for
the AdjustedLeadTimeCumulativeNumberOfRoomsNet_Quartile_Encoded
feature, it has a negative relationship with total room demand. This is the
interaction between lead time and the net number of rooms quartile feature. The
second part of the plot also suggests the current value for the feature,
against a threshold set by the model. For example, the DayOfMonth_Encoded feature, has
a negative relationship with the total rooms sold for a check-in date. I.e.
Total number of rooms is sold more towards the beginning of the month, and then
gradually decreases as the month passes. Here the value is 20, which is higher
than the set threshold of 15, and the check-in date for which the model has
predicted is farther in the month.
The third part of the plot is a
table and simply denotes each feature in orange and blue color, depending on
whether the feature has a positive or negative relationship with the dependent
variable. The second column of the table shows the actual values of the feature
for the specific row.
Now let us look at the
counterfactual model explanation for the same observation in 4th
row, in figure 9.4.1.6.
Figure 9.4.1.6 Counterfactual plot
of Lightgbm regression model for the hotel total room booking dataset for the
4th row of external test data.
We can clearly see that month of the
year encoded feature has highest impact, as identified by the counterfactual
plot. A small change the value of month can bring a drastic change in the model
prediction. This was followed by the AdjustedLeadTimeCumulativeRevenueNet_Quartile_Encoded
feature.
Now let us look at the SHAP model
explanation for the same observation in 4th row, in figure 9.4.1.7.
Figure 9.4.1.7 SHAP plot of Lightgbm
regression model for the hotel total room booking dataset for the 4th row of
external test data.
This plot is a simple and
easy-to-understand explanation of the prediction for 4th row of
data. The plot ranks the extent of impact each feature had on the specific
prediction. The month of the year and day of the week has the most impact on
predicting the 4th row of data. This indicates a strong trend and
seasonality impact for this check-in date.
9.4.2 Hotel Booking
Cancellation
We will look at permutation feature
importance, and the accumulated local effects plot for the overall model
explanation in this section. We tried creating a logistic regression surrogate
model. However, its precision was found to be very low at 0.19 for the external
test data. Hence, we will not try the surrogate model explanation for the hotel
booking cancellation dataset.
Amongst the partial dependence plots
and accumulated local effects plots, the latter is more robust as it considers
the correlation among features. Hence, we will discuss the latter. As the
number of features in the model is quite high, we will restrict our model
explanation to the topmost features. Let us now look at figure 9.4.2.1 for the
top 7 features based on the variation each feature has concerning the dependent
variable.
Figure 9.4.2.1 Accumulated local
effects plot of Xgboost classification model for the hotel booking cancellation
dataset
We can see from the plot that the
lead time, followed by annual daily revenue (ADR) makes a huge impact on the
dependent variable. This is confirmed by the huge variation in the plot, as
well as distinctly visible scatter data points.
Let us now perform permutation
feature importance for the top 40 features in figure 9.4.2.2.
Figure 9.4.2.2 Permutation feature
importance plot of Xgboost classification model for the hotel booking
cancellation dataset
Permutation feature importance
suggests that country is the biggest contributor to booking cancellation in the
model. This is followed by lead time and agent. Guests from certain countries,
as well as reservations from certain agents, are more likely to lead to
cancelation in comparison to others. While reporting this, we also need to
consider the ethical aspects of the model, so that it is not inherently biased
and discriminatory towards different nationalities.
After understanding the model as a
whole, now let us try to explore individual predictions made by the model. Let
us start with ICE plots in figure 9.4.2.3 with 10 example observations from
external test data.
Figure 9.4.2.3 Individual
Conditional Expectation plot of Xgboost classification model for the hotel
booking cancellation dataset for the first 10 rows of external test data.
The ICE plot in figure 9.4.2.3
suggests that lead time and previous cancellations are clear indicators of the
likelihood of cancellation for the hotel reservation. Although in some cases,
it is difficult to differentiate, as seen for the lead time square root value
between 7.5 and 10. However, in comparison to other features, these features
give a clear indication of cancellation behavior.
Now let us look at the LIME plot for
the Xgboost model in figure 9.4.2.3. As the number of features is numerous, we
will be focusing on top features only.
Figure 9.4.2.3 LIME plot of Xgboost
classification model for the hotel booking cancellation dataset for the 4th
row of external test data.
The prediction value is 0.97 as the
probability of the reservation being canceled. The table on the right-hand side
in figure 9.4.2.3 has the actual value of different features, based on which
the Xgboost model predicted 0.97. The current value for the feature LeadTime_Sqrt is
19.57, which is higher than the threshold identified by LIME as 12.49 for the
feature, beyond which the likelihood of cancellation increases.
We tried counterfactual
explanations. However, it was not conclusive for the 4th observation in
external test data. Hence, we will look at SHAP explanations for the 4th
observation in the external test data for the top 20 features, as identified by
the SHAP explanation.
Figure 9.4.2.4 SHAP plot of Xgboost
classification model for the hotel booking cancellation dataset for the 4th row
of external test data.
The highest impact for the model
prediction is made by encoded features of the country and the square root of
lead time respectively. This matches with permutation feature importance in
figure 9.4.2.2.