7.1: Lasso, Ridge, and ElasticNet

Least absolute shrinkage and selection operator (Lasso), Ridge, and ElasticNet are regularization techniques that prevent overfitting. Overfitting happens when the model performs poorly on unseen data in comparison to train/test data used during training. To fix this, a regularization parameter can be introduced in the model to suppress the bias in learned coefficients. It does so by reducing the size of the coefficients. Lasso, and Ridge, also known as L1 and L2 are two different regularization methods to regularize the linear model. These techniques can also be used for feature selection.

Let's discuss Lasso. Imagine there are 2 independent variables x1 and x2 with weights 11.1 and 10.5 respectively. Lasso will suppress the coefficient of either of the one variable to 0. So new weights might look like 11.1 or some different value for x1, and 0 for x2. Vice versa can also be true. As coefficients for some features become 0 in Lasso, it automatically means these variables are of no use for the outcome variable. Hence as a byproduct of L1 regularization, we can do feature selection to remove some features from the model and make it more robust, less complex, and computationally faster.

There are a few limitations of the Lasso model. Firstly, Lasso is more appropriate when we have a handful variables. If we have a large number of features, for example, a huge number of dummy variables, it will randomly select one out of the many features. As a result, features that have very little bearing, based on business understanding can get selected. At the same time, variables that have a relatively higher impact might get rejected. It will select a useful set of features, not necessarily the most important features. Secondly, when features are collinear, it's not the best choice to use Lasso feature selection on collinear variables. However, if we standardize the features and perform feature extraction methods such as PCA on standardized features, we can perform Lasso feature selection on PCA components. The third limitation is that Lasso doesn't have any means of suggesting which feature is statistically significant. To know which variables are significant, we might have to feed the variables selected by Lasso and into a linear regression model.

Ridge regression shrinks coefficients towards zero but doesn't make the coefficients zero. New weights will reduce in magnitude and can become negligible but not zero. For structured data with sparse features, e.g. hundreds of thousands of dummy variables, ridge or ElasticNet with ridge weight can be used for feature selection. Although it will practically not 'remove' features, it will give higher weights for many useful features.