Courage to Learn ML: Demystifying L1 & L2 Regularization (part 3)

Artificial Intelligence

Courage to Learn ML: Demystifying L1 & L2 Regularization (part 3)

admin

November 29, 2023

Courage to Learn ML: Demystifying L1 & L2 Regularization (part 3)

I’m glad you brought up this query. To get straight to the purpose, we typically avoid p values lower than 1 because they result in non-convex optimization problems. Let me illustrate this with a picture showing the form of Lp norms for various p values. Take an in depth have a look at when p=0.5; you’ll notice that the form is decidedly non-convex.

The shape of Lp norms for different p value. Source: https://lh5.googleusercontent.com/EoX3sngY7YnzCGY9CyMX0tEaNuKD3_ZiF4Fp3HQqbyqPtXks2TAbpTj5e4tiDv-U9PT0MAarRrPv6ClJ06C0HXQZKHeK40ZpVgRKke8-Ac0TAqdI7vWFdCXjK4taR40bdSdhGkWB — The form of Lp norms for various p value. Source: https://lh5.googleusercontent.com/EoX3sngY7YnzCGY9CyMX0tEaNuKD3_ZiF4Fp3HQqbyqPtXks2TAbpTj5e4tiDv-U9PT0MAarRrPv6ClJ06C0HXQZKHeK40ZpVgRKke8-Ac0TAqdI7vWFdCXjK4taR40bdSdhGkWB

This becomes even clearer after we have a look at a 3D representation, assuming we’re optimizing three weights. On this case, it’s evident that the issue isn’t convex, with quite a few local minima appearing along the boundaries.

Source: https://ekamperi.github.io/images/lp_norms_3d.png

The rationale why we typically avoid non-convex problems in machine learning is their complexity. With a convex problem, you’re guaranteed a world minimum — this makes it generally easier to resolve. Then again, non-convex problems often include multiple local minima and might be computationally intensive and unpredictable. It’s exactly these sorts of challenges we aim to sidestep in ML.

After we use techniques like Lagrange multipliers to optimize a function with certain constraints, it’s crucial that these constraints are convex functions. This ensures that adding them to the unique problem doesn’t alter its fundamental properties, making it harder to resolve. This aspect is critical; otherwise, adding constraints could add more difficulties to the unique problem.

You questions touches an interesting aspect of deep learning. While it’s not that we prefer non-convex problems, it’s more accurate to say that we frequently encounter and must take care of them in the sphere of deep learning. Here’s why:

Nature of Deep Learning Models results in a non-convex loss surface: Most deep learning models, particularly neural networks with hidden layers, inherently have non-convex loss functions. That is as a consequence of the complex, non-linear transformations that occur inside these models. The mixture of those non-linearities and the high dimensionality of the parameter space typically ends in a loss surface that’s non-convex.
Local Minima aren’t any longer an issue in deep learning: In high-dimensional spaces, that are typical in deep learning, local minima aren’t as problematic as they could be in lower-dimensional spaces. Research suggests that lots of the local minima in deep learning are close in value to the worldwide minimum. Furthermore, saddle points — points where the gradient is zero but are neither maxima nor minima — are more common in such spaces and are an even bigger challenge.
Advanced optimization techniques exist which can be more practical in coping with non-convex spaces. Advanced optimization techniques, akin to stochastic gradient descent (SGD) and its variants, have been particularly effective to find good solutions in these non-convex spaces. While these solutions may not be global minima, they often are adequate to realize high performance on practical tasks.

Despite the fact that deep learning models are non-convex, they excel at capturing complex patterns and relationships in large datasets. Moreover, research into non-convex functions is continually progressing, enhancing our understanding. Looking ahead, there’s potential for us to handle non-convex problems more efficiently, with fewer concerns.

Recall the image we discussed earlier showing the shapes of Lp norms for various values of p. As p increases, the Lp norm’s shape evolves. For instance, at p = 3, it resembles a square with rounded corners, and as p nears infinity, it forms an ideal square.

In our optimization problem’s context, consider higher norms like L3 or L4. Just like L2 regularization, where the loss function and constraint contours intersect at rounded edges, these higher norms would encourage weights to approximate zero, similar to L2 regularization. (If this part isn’t clear, be at liberty to revisit Part 2 for a more detailed explanation.) Based on this statement, we are able to talk in regards to the two crucial the explanation why L3 and L4 norms aren’t commonly used:

L3 and L4 norms display similar effects as L2, without offering significant latest benefits (make weights near 0). L1 regularization, in contrast, zeroes out weights and introduces sparsity, useful for feature selection.
Computational complexity is one other vital aspect. Regularization affects the optimization process’s complexity. L3 and L4 norms are computationally heavier than L2, making them less feasible for many machine learning applications.

To sum up, while L3 and L4 norms might be utilized in theory, they don’t provide unique advantages over L1 or L2 regularization, and their computational inefficiency makes them less practical alternative.

Yes, it’s indeed possible to mix L1 and L2 regularization, a way sometimes called Elastic Net regularization. This approach blends the properties of each L1 (lasso) and L2 (ridge) regularization together and might be useful while difficult.

Elastic Net regularization is a linear combination of the L1 and L2 regularization terms. It adds each the L1 and L2 norm to the loss function. So it has two parameters to be tuned, lambda1 and lambda2

Elastic Net regularization. Source: https://wikimedia.org/api/rest_v1/media/math/render/svg/a66c7bfcf201d515eb71dd0aed5c8553ce990b6e

By combining each regularization techniques, Elastic Net can improve the generalization capability of the model, reducing the chance of overfitting more effectively than using either L1 or L2 alone.

Let’s break it down its benefits:

Elastic Net provides more stability than L1. L1 regularization can result in sparse models, which is beneficial for feature selection. But it may even be unstable in certain situations. For instance, L1 regularization can select features arbitrarily amongst highly correlated variables (while make others’ coefficients turn out to be 0). While Elastic Net can distribute the weights more evenly amongst those variables.
L2 might be more stable than L1 regularization, nevertheless it doesn’t encourage sparsity. Elastic Net goals to balance these two facets, potentially resulting in more robust models.

Nevertheless, Elastic Net regularization introduces an additional hyperparameter that demands meticulous tuning. Achieving the appropriate balance between L1 and L2 regularization and optimal model performance involves increased computational effort. This added complexity is why it’s not regularly used.

LEAVE A REPLY Cancel reply