Home Artificial Intelligence SNRAdam: Improving the Adam Optimizer

SNRAdam: Improving the Adam Optimizer

0
SNRAdam: Improving the Adam Optimizer

Optimizers are essential tools within the modeling stack. One of the vital widely used optimizers is the Adam optimizer introduced by Kingma and Ba [paper]. This optimizer keeps track of running averages of the gradient (aka momentum term) and the second moment (aka energy term) of the gradient using exponential moving average (EMA) filters and uses the square-root of the energy term to normalize the momentum term before taking a step.

Implementation of Adam from PyTorch documentation

This looks as if a very good idea, right? For one it looks like a diagonal approximation of second order optimization and two, when the gradients are noisy, the denominator (square root of the energy term) is large relative to the numerator (momentum term) and steps are small. Then again when gradients are consistent, the denominator is roughly equal to the numerator, and we take constant sized steps equal to the educational rate. That’s why this optimizer is the de facto selection amongst ML researchers and practitioners.

Many variations of the Adam optimizer have been proposed and studied (see AdamW from PyTorch docs and optimizers like QHAdam from the torch_optimizer package). Here, we propose a variation that doesn’t deviate from Adam significantly for noisy “recent” gradients but greatly amplifies the actual step size for parameters with a consistent “recent” gradient history (how recent is dependent upon the EMA parameters β1 and β2).

Our modification is easy, yet effective: We replace the EMA filter of the gradient energy term with an EMA filter of the gradient variance term. This implies final update equation θ(t) = θ(t-1) – γ * SNR, where SNR refers back to the signal-to-noise ratio of the “recent” gradient history (SNR is a term borrowed from signal processing literature and refers back to the ratio of the common of a signal to its standard-deviation). So parameters with a high SNR, i.e., with consistent “recent” gradient histories will see a much larger step size (as large as infinity if the gradient is constant) than those who have noisy “recent” gradient histories (or low gradient SNRs). The implementation of this optimizer is easy and is given below for completeness.

We ran experiments using an easy 100K parameter Vision Transformer model (loosely inspired by this medium post) on the MNIST dataset with batch size 4096 over 20 epochs with learning rate 1e-3. The outcomes point towards quick convergence for the proposed optimizer in comparison to Adam optimizer (we plot one among the runs but noticed the identical behavior consistently for this model and dataset combination):

Training loss converges quickly for SNRAdam in comparison to Adam
Validation loss converges quickly for SNRAdam in comparison to Adam

In an effort to disentangle the source of the gains, we make the batch size = ∞ and compare the 2 algorithms. This shows whether the gains from correcting for the “stochastic” in stochastic gradient descent or the noise within the gradient coming from the optimization trajectory (the gradient descent portion). We see that the gains come from compensating for the latter (noise in gradient coming from the trajectory quite than SGD):

Comparison of the 2 algorithms for batch-size = dataset size (train)
Comparison of the 2 algorithms for batch-size = dataset size (validation)

LEAVE A REPLY

Please enter your comment!
Please enter your name here