11.3: Ensemble Learning

For the case of room booking prediction for hotels, let's imagine that linear regression predicts with a high degree of accuracy for December month and random forest for the first 6 months of the year. Xgboost on the other hand outperforms for the remaining months of the year.

If we can leverage the individual strength of different models, we can come up with a highly accurate model for predicting hotel booking. Ensemble learning helps us perform just that by leveraging the strength of different models to provide a highly reliable prediction.

The simplest form of ensembling is called averaging, wherein we take the prediction from multiple models and take a mathematical average of prediction from all the models to arrive at the final prediction. If some modeling technique performs better than others, we can give weights to each model, which signifies the degree to which the model is accurate. This will allow us to perform weighted average ensembling. Ensembling by conditional averaging is another approach in which we take a simple average from 2 models based on some conditions laid upon output values. For example, if there are 2 models and model 1 predicts accurately values under 50 and model 2 predicts values above 50. We can simply add an if-else condition for prediction on model 1.

Bagging is a useful form of ensembling for high-variance techniques. The high variance techniques are prone to overfitting, such as decision trees. We can create multiple models of the same techniques, simply by changing random seed, parameter tuning, changing the number of features and records in the dataset, etc. This can be impactful if we have a larger number of models. We can also use BaggingClassifier and BaggingRegressor in Sklearn to do the same task. In contrast, for boosting-based ensemble learning, newer models are added sequentially depending on how well the previous model performed. We can also give weight for performing boosting.

Another form of ensemble learning is called stacking. For stacking, training data is divided into two parts. One part is used for training models using a different diverse set of techniques. These individual models then predict values for the second part of the dataset. For the second part of the dataset, predictions obtained from individual models are used as features. By using the dependent variable in the second part of the training dataset, a final ensemble model is trained. For stacking to be successful, we should try a different diverse set of techniques, such as linear, non-linear, etc.