7.6: Linear Regression
Most common method of feature
selection for anyone who went through a linear regression tutorial will be to
remove features whose p-value is above 0.05. This approach to feature selection
in linear models is problematic. Ideally, we include or exclude features based
on the assumption that features are good or bad predictors for the outcome
variable. Not simply by observing the p-value.
The p-value can tell us about the
correlation of a feature with the variance in the outcome variable, given all
other features have explained the variance in the outcome variable. However, it
does not prove that the feature and dependent variables are unrelated. Also,
P-values are subjective. Some consider 0.05, while some may stick to more
stringent levels of the p-value.
The P-value of a feature, say X1 in
a regression model doesn't explain what will be the impact on other features'
coefficients and its p-values if we drop X1 for being insignificant. It might
happen that non-significant features might become significant after removing an
insignificant feature. Similarly, features that are already significant can
also become non-significant after removing an insignificant feature. Another
aspect of going by the method of removing features whose p-value is less than
0.05 will be that features that are difficult to explain to the layman can be
kept in the model, just because the feature is significant.
Below are circumstances when we can
still include features that are statistically not significant.
1) We have a categorical variable
and it is represented as dummy encoding. Of all the dummy encoding features,
some are statistically significant while some are not. In such a case, we should
keep all the dummy features, including those categories whose p-value is above
0.05.
2) Interaction variable between two
variables is significant, whereas one of the original features is
insignificant. We cannot keep the interaction effect without keeping the main
effects. Hence it is not appropriate to drop insignificant features.
3) Model has a feature that is
insignificant and after removing the feature, the predictive power of the model
reduces. We will be better off keeping such features in the model if our main
goal is predictive power.
4) For features that are the main
focus of the research, even if these are statistically insignificant, we can
keep these. For example, predicting alcohol consumption quantity with the
probability of a car accident while driving. If alcohol consumption is
statistically insignificant, we cannot remove the feature, as this is the
central focus of the research.
5) In certain areas of science, we
will need convincing reasons to exclude certain features. For example, in biostatistics,
removing age, gender, etc. from the model is counterintuitive.
It must be noted that it is not a
compulsion to keep insignificant features in the model always. We can remove an
insignificant feature in below situations.
1) If the beta coefficient of the
feature is nearly zero or negligible, then it will have minimal impact on the
error and r-square of the model. In such cases, we can remove the feature if it
is insignificant.
2) If a feature has high
multicollinearity, dropping a feature will have less impact on regression error
and goodness of fit.
3) If there are other and better
features present, which have a very strong bivariate correlation with the
outcome variable as well as with other features, we can drop the insignificant feature.
4) If a feature is hard to explain
and statistically insignificant, we can remove such a feature.
5) If the presence or absence of the
insignificant feature has a negligible impact on the predictive power of the
model, we can remove it.
Instead of using p-value, we can use
the beta coefficient for selecting features in linear regression model. Beta
co-efficient shows the level of importance of the feature in the model.
Features that have small beta are of less importance. If all the features are in
the same scale of measurement through some transformation such as z-score, and
there is no collinearity, we can compare features by the beta coefficients and
select important ones.
If there is multicollinearity
amongst features, we will first need to fix multicollinearity. If
multicollinearity is hard to fix, we should either do ridge regression or
ElasticNet regression. We can then use the coefficients to keep only those
features that are highly impactful for the dependent variable.