An introduction to the mechanics of AutoDiff, exploring its mathematical principles, implementation strategies, and applications in currently most-used frameworks
At the center of machine learning lies the optimization of loss/objective functions. This optimization process heavily relies on computing gradients of those functions with respect to model parameters. As Baydin et al. (2018) elucidate of their comprehensive survey [1], these gradients guide the iterative updates in optimization algorithms reminiscent of stochastic gradient descent (SGD):
θₜ₊₁ = θₜ – α ∇θ L(θₜ)
Where:
- θₜ represents the model parameters at step t
- α is the educational rate
- ∇_θ L(θₜ) denotes the gradient of the loss function L with respect to the parameters θ
This straightforward update rule belies the complexity of computing gradients in deep neural networks with tens of millions and even billions of parameters.