7.1: Lasso, Ridge, and ElasticNet
Least absolute shrinkage and
selection operator (Lasso), Ridge, and ElasticNet are regularization techniques
that prevent overfitting. Overfitting happens when the model performs poorly on
unseen data in comparison to train/test data used during training. To fix this,
a regularization parameter can be introduced in the model to suppress the bias
in learned coefficients. It does so by reducing the size of the coefficients.
Lasso, and Ridge, also known as L1 and L2 are two different regularization
methods to regularize the linear model. These techniques can also be used for
feature selection.
Let's discuss Lasso. Imagine there
are 2 independent variables x1 and x2 with weights 11.1 and 10.5 respectively.
Lasso will suppress the coefficient of either of the one variable to 0. So new
weights might look like 11.1 or some different value for x1, and 0 for x2. Vice
versa can also be true. As coefficients for some features become 0 in Lasso, it
automatically means these variables are of no use for the outcome variable.
Hence as a byproduct of L1 regularization, we can do feature selection to
remove some features from the model and make it more robust, less complex, and
computationally faster.
There are a few limitations of the
Lasso model. Firstly, Lasso is more appropriate when we have a handful
variables. If we have a large number of features, for example, a huge number of
dummy variables, it will randomly select one out of the many features. As a
result, features that have very little bearing, based on business understanding
can get selected. At the same time, variables that have a relatively higher
impact might get rejected. It will select a useful set of features, not
necessarily the most important features. Secondly, when features are collinear,
it's not the best choice to use Lasso feature selection on collinear variables.
However, if we standardize the features and perform feature extraction methods
such as PCA on standardized features, we can perform Lasso feature selection on
PCA components. The third limitation is that Lasso doesn't have any means of
suggesting which feature is statistically significant. To know which variables
are significant, we might have to feed the variables selected by Lasso and into
a linear regression model.
Ridge regression shrinks
coefficients towards zero but doesn't make the coefficients zero. New weights
will reduce in magnitude and can become negligible but not zero. For structured
data with sparse features, e.g. hundreds of thousands of dummy variables, ridge
or ElasticNet with ridge weight can be used for feature selection. Although it
will practically not 'remove' features, it will give higher weights for many
useful features.