Explained: How Does L1 Regularization Perform Feature Selection?

-

is the technique of choosing an optimal subset of features from a given set of features; an optimal feature subset is the one which maximizes the performance of the model on the given task.

Feature selection could be a manual or slightly explicit process when performed with filter or wrapper methods. In these methods, features are added or removed iteratively based on the worth of a set measure, which quantifies the relevance of the feature within the making the prediction. The measures might be information gain, variance or the chi-squared statistic, and the algorithm would make a choice to simply accept/reject the feature considering a set threshold on the measure. Note, these methods will not be a component of the model training stage and are performed prior to it.

Embedded methods perform feature selection implicitly, without using any pre-defined selection criteria and deriving it from the training data itself. This intrinsic feature selection process is a component of the model training stage. The model learns to pick features and make relevant predictions at the identical time. In later sections, we are going to describe the role of regularization in performing this intrinsic feature selection.

Regularization and Model Complexity

Regularization is the technique of penalizing the complexity of the model to avoid overfitting and achieve generalization over the task. 

Here, the complexity of the model is analogous to its power to adapt to the patterns within the training data. Assuming a straightforward polynomial model in ‘’ with degree ‘’, as we increase the degree ‘’ of the polynomial, the model achieves greater flexibility to capture patterns within the observed data.

Overfitting and Underfitting

If we try to suit a polynomial model with on a set of coaching samples which were derived from a cubic polynomial with some noise, the model is not going to have the ability to capture the distribution of the samples to a sufficient extent. The model simply lacks the or to model the info generated from a level 3 (or higher order) polynomials. Such a model is claimed to on the training data.

Working on the identical example, assume we now have a model with . Now with increased complexity, it must be easy for the model to estimate the unique cubic polynomial that was used to generate the info (like setting the coefficients of all terms with exponent > 3 to 0). If the training process is just not terminated at the fitting time, the model will proceed to utilize its additional flexibility to scale back the error inside further and begin capturing within the noisy samples too. This can reduce the training error significantly, however the model now the training data. The noise will change in real-world settings (or within the test phase) and any knowledge based on predicting them will disrupt, resulting in high test error.

The best way to determine the optimal model complexity?

In practical settings, we’ve little-to-no understanding of the data-generation process or the true distribution of the info. Finding the optimal model with the fitting complexity, such that no under-fitting or overfitting occurs is a challenge. 

One technique might be to begin with a sufficiently powerful model after which reduce its complexity via feature selection. Lesser the features, lesser is the complexity of the model.

As discussed within the previous section, feature selection could be explicit (filter, wrapper methods) or implicit. Redundant features which have insignificant relevance within the determining the worth of the response variable must be eliminated to avoid the model learning uncorrelated patterns in them. Regularization, also performs an identical task. So, how are regularization and have selection connected to realize a standard goal of optimal model complexity?

L1 Regularization As A Feature Selector

Continuing with our polynomial model, we represent it as a function f, with inputs , parameters and degree ,

(Image by writer)

For a polynomial model, each power of the input could be regarded as a feature, forming a vector of the shape,

(Image by writer)

We also define an objective function, which on minimizing leads us to the optimal parameters and features a term penalizing the complexity of the model. 

(Image by writer)

To find out the minima of this function, we want to investigate all of its critical points i.e. points where the derivation is zero or undefined.

The partial derivative w.r.t. one the parameters, , could be written as,

(Image by writer)

where the function is defined as,

(Image by writer)

Note: The derivative of absolutely the function is different from the sgn function defined above. The unique derivative is undefined at x = 0. We augment the definition to remove the inflection point at x = 0 and to make the function differentiable across its entire domain. Furthermore, such augmented functions are also utilized by ML frameworks when the underlying computation involves absolutely the function. Check this thread on the PyTorch forum.

By computing the partial derivative of the target function w.r.t. a single parameter and setting it to zero, we are able to construct an equation that relates the optimal value of with the predictions, targets, and features.

(Image by writer)
(Image by writer)

Allow us to examine the equation above. If we assume that the inputs and targets were centered in regards to the mean (i.e. the info had been standardized within the preprocessing step), the term on the LHS effectively represents the covariance between the j feature and the difference between the anticipated and goal values.

Statistical covariance between two variables quantifies how much one variable influences the worth of the second variable (and vice-versa)

The sign function on the RHS forces the covariance on the LHS to assume only three values (because the sign function only returns -1, 0 and 1). If the feature is redundant and doesn’t influence the predictions, the covariance will probably be nearly zero, bringing the corresponding parameter to zero. This leads to the feature being eliminated from the model. 

Imagine the sign function as a canyon carved by a river. You’ll be able to walk within the canyon (i.e. the river bed) but to get out of it, you might have these huge barriers or steep slopes. L1 regularization induces an identical ‘thresholding’ effect for the gradient of the loss function. The gradient have to be powerful enough to interrupt the barriers or turn into zero, which eventually brings the parameter to zero.

For a more grounded example, consider a dataset that comprises samples derived from a straight line (parameterized by two coefficients) with some added noise. The optimal model should not have any greater than two parameters, else it’s going to adapt to the noise present in the info (with the added freedom/power to the polynomial). Changing the parameters of the upper powers within the polynomial model doesn’t affect the difference between the targets and the model’s predictions, thus reducing their covariance with the feature.

Through the training process, a relentless step gets added/subtracted from the gradient of the loss function. If the gradient of the loss function (MSE) is smaller than the constant step, the parameter will eventually reach to a worth of 0. Observe the equation below, depicting how parameters are updated with gradient descent,

(Image by writer)
(Image by writer)

If the blue part above is smaller than , which itself is a really small number, is the nearly a relentless step . The sign of this step (red part) will depend on , whose output will depend on . If is positive i.e. greater than , equals 1, hence making approx. equal to – pushing it towards zero.

To suppress the constant step (red part) that makes the parameter zero, the gradient of the loss function (blue part) must be larger than the step size. For a bigger loss function gradient, the worth of the feature must affect the output of the model significantly.

That is how a feature is eliminated, or more precisely, its corresponding parameter, whose value doesn’t correlate with the output of the model, is zero-ed by L1 regularization throughout the training.

Further Reading And Conclusion

  • To get more insights on the subject, I even have posted a matter on r/MachineLearning subreddit and the resulting thread comprises different explanations that chances are you’ll wish to read.
  • Madiyar Aitbayev also has an interesting blog covering the identical query, but with a geometrical explanation.
  • Brian Keng’s blog explains regularization from a probabilistic perspective.
  • This thread on CrossValidated explains why L1 norm encourages sparse models. An in depth blog by Mukul Ranjan explains why L1 norm encourages the parameters to turn into zero and never the L2 norm.

“L1 regularization performs feature selection” is an easy statement that the majority ML learners agree with, without diving deep into how it really works internally. This blog is an try and bring my understanding and mental-model to the readers in an effort to answer the query in an intuitive manner. For suggestions and doubts, you will discover my email at my website. Continue to learn and have a pleasant day ahead!

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x