1.3: Preventing Overfitting
A dataset can be divided into a
training dataset, and external test data in an 80:20, 70:30, or even 60:40
ratio, depending on the availability of data. Then a further split for
validation data is created from the remaining training data. The remaining
training dataset can be divided further into train and development test data in
the same ratio. We then train our model in iterations after performing feature
engineering and feature selection. At each iteration, we train our model, and
we test its performance on the development test data and validation data. Once
we get a model of acceptable performance, we test it on external test data to
ensure it generalizes well on unseen data.
Bias-variance tradeoff tells us that
by adding additional features, overfitting can happen. Overfitting can also
happen by doing feature selection. If the model has learned our data in a way
that
gives the best predictive power for the dependent variable, it doesn't
guarantee that it will generalize well on unseen data. What it only guarantees
is that it can perform well on the development test data and validation data
for validating our model.
One way to overcome this challenge
is to test your model on multiple development test data to understand how well
it can generalize. This can be performed by cross-validation.
Generally, we perform 5-fold
cross-validation. We obtain the training dataset after separating external test
data and validation data. The remaining training data is divided into 5 equal
parts, and 4 parts are used for training. The remaining 1 part is for the
development test. We also use validation data for measuring model performance.
We repeat this 5 times and average the model metric across 5 samples for the
development test data and validation data. Doing this reduces the likelihood of
overfitting. In addition to cross-validation, we will test the model on
external test data to ensure that indeed the cross-validation score is
reliable.
If the average of the model metric
of 5 cross-validations is very different from the model metric of external test
data, we will need to do further probing in the dataset to check if the data
distribution is very different in 5-fold cross-validation and validation data
Vs external test data. The object of the probe should be to ensure that data
distribution is close to the real-world values in all 3 datasets. i.e. 1)
training, 2) development test data in all cross-validation samples, 3)
validation data, as well as in 4) external test data. If any anomaly is
identified, it should be subject to further investigation.
Using 5-fold cross-validation is
subject to the availability of data. Although used as a universal standard, we
should use 5-fold only when we have sufficient data. In the absence of volume,
we can increase cross-validation from 5-fold to 10-fold, to get a more robust
averaged model metric. 'More' and 'less' are subjective when it comes to
datasets.
For example, for the training of the Twitter sentiment classifier, 8000 tweets
can be considered small, if we compare it against billions of tweets present in
the Twitter database. If we still have to use such a small number of tweets
dataset for modeling, it will be wise to use 10-fold cross-validation. However,
while modeling the average lifespan of endangered species, 8000 observations
can be considered voluminous data and we can limit cross-validation to 5-fold.
If the data has distinct strata, we
should ensure that train and test samples in cross-validation, as well as
external hold-out samples, should all be representative of all the strata. We
should proceed with stratified k-fold cross-validation in such cases. For
example, if we are developing a model for predicting the likelihood of
occurrence of a specific disease amongst people living in the US, given all the
body vitals and characteristics such as blood pressure, sugar level, heartbeat
rate, sodium level, cholesterol, daily minutes of exercise, body weight,
height, etc. Such a dataset should be collected from all the states. Overall
health and well-being might have less impact on a person based on which state
they belong to. Hence, we cannot use the state as a feature to train our model.
However, we should ensure that in the training and test samples across all
cross-validations, each state should have an equal percentage of observations.
This will ensure that model will generalize well to real-world circumstances.
While doing train vs test split and
even for external test data, we should give due regard to the unit of data.
Most often one single observation is a unit of data. We can split such a
dataset into cross-validation samples. However, when a data observation belongs
to a distinguishable group of similar observations, as a single unit, we will
consider it as a unit of data.
For example, if we are evaluating
the effectiveness of new medicine in regulating the blood pressure of
individual patients, we will record the dosage amount and resultant blood
pressure for multiple days. In this case, data recorded for all days for a
specific patient will be a unit of data. While splitting data into the training
set, external test data, and development test data, we should ensure that a
patient's data is present entirely either in training, external test data,
validation data, or development test data. In no case, a patient's data should
be present in more than one group from training, external test data, or
development test data. If a patient's data is present in both training and test
data simultaneously, then the results could be biased, as it will lead to
over-optimistic results for the same patient, as the model is already familiar
with similar values for the patient in training data. At an aggregate level, it
can give us overoptimistic model performance. Hence, we should identify units
of data and should ensure that data from a single unit is not present in
training, test, and/or external test data simultaneously.
If cross-validation is done right,
feature engineering and feature selection can help us identify a smaller set of
features that give the best performance for predicting the target variable.