7.6: Linear Regression

Most common method of feature selection for anyone who went through a linear regression tutorial will be to remove features whose p-value is above 0.05. This approach to feature selection in linear models is problematic. Ideally, we include or exclude features based on the assumption that features are good or bad predictors for the outcome variable. Not simply by observing the p-value.

The p-value can tell us about the correlation of a feature with the variance in the outcome variable, given all other features have explained the variance in the outcome variable. However, it does not prove that the feature and dependent variables are unrelated. Also, P-values are subjective. Some consider 0.05, while some may stick to more stringent levels of the p-value.

The P-value of a feature, say X1 in a regression model doesn't explain what will be the impact on other features' coefficients and its p-values if we drop X1 for being insignificant. It might happen that non-significant features might become significant after removing an insignificant feature. Similarly, features that are already significant can also become non-significant after removing an insignificant feature. Another aspect of going by the method of removing features whose p-value is less than 0.05 will be that features that are difficult to explain to the layman can be kept in the model, just because the feature is significant.

Below are circumstances when we can still include features that are statistically not significant.

1) We have a categorical variable and it is represented as dummy encoding. Of all the dummy encoding features, some are statistically significant while some are not. In such a case, we should keep all the dummy features, including those categories whose p-value is above 0.05.

2) Interaction variable between two variables is significant, whereas one of the original features is insignificant. We cannot keep the interaction effect without keeping the main effects. Hence it is not appropriate to drop insignificant features.

3) Model has a feature that is insignificant and after removing the feature, the predictive power of the model reduces. We will be better off keeping such features in the model if our main goal is predictive power.

4) For features that are the main focus of the research, even if these are statistically insignificant, we can keep these. For example, predicting alcohol consumption quantity with the probability of a car accident while driving. If alcohol consumption is statistically insignificant, we cannot remove the feature, as this is the central focus of the research.

5) In certain areas of science, we will need convincing reasons to exclude certain features. For example, in biostatistics, removing age, gender, etc. from the model is counterintuitive.

It must be noted that it is not a compulsion to keep insignificant features in the model always. We can remove an insignificant feature in below situations.

1) If the beta coefficient of the feature is nearly zero or negligible, then it will have minimal impact on the error and r-square of the model. In such cases, we can remove the feature if it is insignificant.

2) If a feature has high multicollinearity, dropping a feature will have less impact on regression error and goodness of fit.

3) If there are other and better features present, which have a very strong bivariate correlation with the outcome variable as well as with other features, we can drop the insignificant feature.

4) If a feature is hard to explain and statistically insignificant, we can remove such a feature.

5) If the presence or absence of the insignificant feature has a negligible impact on the predictive power of the model, we can remove it.

Instead of using p-value, we can use the beta coefficient for selecting features in linear regression model. Beta co-efficient shows the level of importance of the feature in the model. Features that have small beta are of less importance. If all the features are in the same scale of measurement through some transformation such as z-score, and there is no collinearity, we can compare features by the beta coefficients and select important ones.

If there is multicollinearity amongst features, we will first need to fix multicollinearity. If multicollinearity is hard to fix, we should either do ridge regression or ElasticNet regression. We can then use the coefficients to keep only those features that are highly impactful for the dependent variable.