11.3: Ensemble Learning
For the case of room booking
prediction for hotels, let's imagine that linear regression predicts with a
high degree of accuracy for December month and random forest for the first 6
months of the year. Xgboost on the other hand outperforms for the remaining
months of the year.
If we can leverage the individual
strength of different models, we can come up with a highly accurate model for
predicting hotel booking. Ensemble learning helps us perform just that by
leveraging the strength of different models to provide a highly reliable
prediction.
The simplest form of ensembling is
called averaging, wherein we take the prediction from multiple models and take
a mathematical average of prediction from all the models to arrive at the final
prediction. If some modeling technique performs better than others, we can give
weights to each model, which signifies the degree to which the model is
accurate. This will allow us to perform weighted average ensembling. Ensembling
by conditional averaging is another approach in which we take a simple average
from 2 models based on some conditions laid upon output values. For example, if
there are 2 models and model 1 predicts accurately values under 50 and model 2
predicts values above 50. We can simply add an if-else condition for prediction
on model 1.
Bagging is a useful form of
ensembling for high-variance techniques. The high variance techniques are prone
to overfitting, such as decision trees. We can create multiple models of the
same techniques, simply by changing random seed, parameter tuning, changing the
number of features and records in the dataset, etc. This can be impactful if we
have a larger number of models. We can also use BaggingClassifier
and BaggingRegressor
in Sklearn to do the same
task. In contrast, for boosting-based ensemble learning, newer models are added
sequentially depending on how well the previous model performed. We can also
give weight for performing boosting.
Another form of ensemble learning is
called stacking. For stacking, training data is divided into two parts. One
part is used for training models using a different diverse set of techniques.
These individual models then predict values for the second part of the dataset.
For the second part of the dataset, predictions obtained from individual models
are used as features. By using the dependent variable in the second part of the
training dataset, a final ensemble model is trained. For stacking to be
successful, we should try a different diverse set of techniques, such as
linear, non-linear, etc.