1.3: Preventing Overfitting

A dataset can be divided into a training dataset, and external test data in an 80:20, 70:30, or even 60:40 ratio, depending on the availability of data. Then a further split for validation data is created from the remaining training data. The remaining training dataset can be divided further into train and development test data in the same ratio. We then train our model in iterations after performing feature engineering and feature selection. At each iteration, we train our model, and we test its performance on the development test data and validation data. Once we get a model of acceptable performance, we test it on external test data to ensure it generalizes well on unseen data.

Bias-variance tradeoff tells us that by adding additional features, overfitting can happen. Overfitting can also happen by doing feature selection. If the model has learned our data in a way that gives the best predictive power for the dependent variable, it doesn't guarantee that it will generalize well on unseen data. What it only guarantees is that it can perform well on the development test data and validation data for validating our model.

One way to overcome this challenge is to test your model on multiple development test data to understand how well it can generalize. This can be performed by cross-validation.

Generally, we perform 5-fold cross-validation. We obtain the training dataset after separating external test data and validation data. The remaining training data is divided into 5 equal parts, and 4 parts are used for training. The remaining 1 part is for the development test. We also use validation data for measuring model performance. We repeat this 5 times and average the model metric across 5 samples for the development test data and validation data. Doing this reduces the likelihood of overfitting. In addition to cross-validation, we will test the model on external test data to ensure that indeed the cross-validation score is reliable.

If the average of the model metric of 5 cross-validations is very different from the model metric of external test data, we will need to do further probing in the dataset to check if the data distribution is very different in 5-fold cross-validation and validation data Vs external test data. The object of the probe should be to ensure that data distribution is close to the real-world values in all 3 datasets. i.e. 1) training, 2) development test data in all cross-validation samples, 3) validation data, as well as in 4) external test data. If any anomaly is identified, it should be subject to further investigation.

Using 5-fold cross-validation is subject to the availability of data. Although used as a universal standard, we should use 5-fold only when we have sufficient data. In the absence of volume, we can increase cross-validation from 5-fold to 10-fold, to get a more robust averaged model metric. 'More' and 'less' are subjective when it comes to datasets. For example, for the training of the Twitter sentiment classifier, 8000 tweets can be considered small, if we compare it against billions of tweets present in the Twitter database. If we still have to use such a small number of tweets dataset for modeling, it will be wise to use 10-fold cross-validation. However, while modeling the average lifespan of endangered species, 8000 observations can be considered voluminous data and we can limit cross-validation to 5-fold.

If the data has distinct strata, we should ensure that train and test samples in cross-validation, as well as external hold-out samples, should all be representative of all the strata. We should proceed with stratified k-fold cross-validation in such cases. For example, if we are developing a model for predicting the likelihood of occurrence of a specific disease amongst people living in the US, given all the body vitals and characteristics such as blood pressure, sugar level, heartbeat rate, sodium level, cholesterol, daily minutes of exercise, body weight, height, etc. Such a dataset should be collected from all the states. Overall health and well-being might have less impact on a person based on which state they belong to. Hence, we cannot use the state as a feature to train our model. However, we should ensure that in the training and test samples across all cross-validations, each state should have an equal percentage of observations. This will ensure that model will generalize well to real-world circumstances.

While doing train vs test split and even for external test data, we should give due regard to the unit of data. Most often one single observation is a unit of data. We can split such a dataset into cross-validation samples. However, when a data observation belongs to a distinguishable group of similar observations, as a single unit, we will consider it as a unit of data.

For example, if we are evaluating the effectiveness of new medicine in regulating the blood pressure of individual patients, we will record the dosage amount and resultant blood pressure for multiple days. In this case, data recorded for all days for a specific patient will be a unit of data. While splitting data into the training set, external test data, and development test data, we should ensure that a patient's data is present entirely either in training, external test data, validation data, or development test data. In no case, a patient's data should be present in more than one group from training, external test data, or development test data. If a patient's data is present in both training and test data simultaneously, then the results could be biased, as it will lead to over-optimistic results for the same patient, as the model is already familiar with similar values for the patient in training data. At an aggregate level, it can give us overoptimistic model performance. Hence, we should identify units of data and should ensure that data from a single unit is not present in training, test, and/or external test data simultaneously.

If cross-validation is done right, feature engineering and feature selection can help us identify a smaller set of features that give the best performance for predicting the target variable.