From Physics to Probability: Hamiltonian Mechanics for Generative Modeling and MCMC

Phase space of a nonlinear pendulum. Photo by the writer.

mechanics is a technique to describe how physical systems, like planets or pendulums, move over time, specializing in energy somewhat than simply forces. By reframing complex dynamics through energy lenses, this Nineteenth-century physics framework now powers cutting-edge generative AI. It uses generalized coordinates ( q ) (like position) and their conjugate momenta ( p ) (related to momentum), forming a phase space that captures the system’s state. This approach is especially useful for complex systems with many parts, making it easier to search out patterns and conservation laws.

Mathematical Reformation: From Second-Order to First Order ⚙️

Newton’s ( F = mddot{q} ) requires solving second-order differential equations, which turn into unwieldy for constrained systems or when identifying conserved quantities.

The Core Idea

Hamiltonian mechanics splits ( ddot{q} = F(q)/m ) into two first-order equations by introducing conjugate momentum ( p ):

[
begin{align*}
dot{q} = frac{partial H}{partial p} & text{(Position)}, quad dot{p} = -frac{partial H}{partial q} & text{(Momentum)}
end{align*}
]

It decomposes acceleration into complementary momentum/position flows. This phase space perspective reveals hidden geometric structure.

Lagrangian Prelude: Motion Principles

The Lagrangian ( mathcal{L}(q, dot{q}) = K – U ) results in Euler-Lagrange equations via variational calculus:

[
frac{d}{dt}left( frac{partial mathcal{L}}{partial dot{q}} right) – frac{partial mathcal{L}}{partial q} = 0
]

Kinetic Energy Symbol

Note that the ( K ) within the ( mathcal{L}(q, dot{q}) = K – U ) can be represented as ( T ).

But these remain second-order. The critical leap comes through Legendre Transformation ( (dot{q} rightarrow p) ). The Hamiltonian is derived from the Lagrangian through a Legendre transformation by defining the conjugate momentum as ( p_i = frac{partial mathcal{L}}{partial dot{q}_i} ); then the Hamiltonian will be written as:

[
H(q,p) = sum_i p_i dot{q}_i – mathcal{L}(q, dot{q})
]

We are able to write ( H(q,p) ) more intuitively as:

[
H(q,p) = K(p) + U(q)
]

This flips the script: as an alternative of ( dot{q} )-centric dynamics, we get symplectic phase flow.

Why This Matters

The Hamiltonian becomes the system’s total energy ( H = K + U ) for a lot of physical systems. It also provides a framework where time evolution is a canonical transformation – a symmetry preserving the basic Poisson bracket structure ( {q_i, p_j} = delta_{ij} ).

For more about canonical, non-canonical transformations, and Poisson bracket, including detailed math and examples, try the TorchEBM post on Hamiltonian mechanics.

This transformation isn’t canonical since it doesn’t preserve the Poisson bracket structure.

Newton vs. Lagrange vs. Hamilton: A Philosophical Showdown

Aspect	Newtonian	Lagrangian	Hamiltonian
State Variables	Position ( x ) and velocity ( dot{x} )	Generalized coordinates ( q ) and velocities ( dot{q} )	Generalized coordinates ( q ) and conjugate momenta ( p )
Formulation	Second-order differential equations ( (F=ma) )	Principle of least motion (( delta int L , dt = 0 )): ( L = K – U )	First-order differential equations from Hamiltonian function (Phase flow ( (dH) )): ( H = K + U )
Identifying Symmetries	Manual identification or through specific methods	Noether’s theorem	Canonical transformations and Poisson brackets
Machine Learning Connection	Physics-informed neural networks, simulations	Optimal control, reinforcement learning	Hamiltonian Monte Carlo (HMC) sampling, energy-based models
Energy Conservation	Not inherent (have to be derived)	Built-in through conservation laws	Central (Hamiltonian is energy)
General Coordinates	Possible, but often cumbersome	Natural fit	Natural fit
Time Reversibility	Yes	Yes	Yes, especially in symplectic formulations

Hamilton’s Equations: The Geometry of Phase Space ⚙️

The phase space is a mathematical space where we will represent the set of possible states of a physical system. For a system with ( n ) degrees of freedom, the phase space is a ( 2n )-dimensional space, often visualized as a map where each point ( (q, p) ) represents a novel state. The evolution of the system is described by the motion of a degree on this space, governed by Hamilton’s equations.

Phase space portrait of a nonlinear pendulum showing oscillatory motion (closed orbits), rotational motion (wavy trajectories), and separatrices (red curves) connecting unstable equilibrium points. Position (q) and momentum (p) dynamics illustrate energy conservation principles fundamental to Hamiltonian systems. The animation is finished by the writer.

This formulation offers several benefits. It makes it straightforward to discover conserved quantities and symmetries through canonical transformations and Poisson brackets, which provides deeper insights into the system’s behavior. As an illustration, Liouville’s theorem states that the amount in phase space occupied by an ensemble of systems stays constant over time, expressed as:

[
frac{partial rho}{partial t} + {rho, H} = 0
]

or equivalently:

[
frac{partial rho}{partial t} + sum_i left(frac{partial rho}{partial q_i}frac{partial H}{partial p_i} – frac{partial rho}{partial p_i}frac{partial H}{partial q_i}right) = 0
]

where ( rho(q, p, t) ) is the density function. This helps us to represent the phase space flows and the way they preserve area under symplectic transformations. Its relation to symplectic geometry enables mathematical properties which might be directly relevant to many numerical methods. As an illustration, it enables hamiltonian monte carlo to perform well in high-dimensions by defining MCMC strategies that increases the probabilities of accepting a sample (particle).

Symplecticity: The Sacred Invariant

Hamiltonian flows preserve the symplectic 2-form ( omega = sum_i dq_i wedge dp_i ).

Symplectic 2-form ( omega )
The symplectic 2-form, denoted by ( omega = sum_i dq_i wedge dp_i ), is a mathematical object utilized in symplectic geometry. It measures the world of parallelograms formed by vectors within the tangent space of a phase space.

( dq_i ) and ( dp_i ): Infinitesimal changes in position and momentum coordinates.
( wedge ): The wedge product, which mixes differential forms in an antisymmetric way meaning that ( dq_i wedge dp_i = -dp_i wedge dq_i ).
( sum_i ): Sum over all degrees of freedom.

Imagine a phase space where each point represents a state of a physical system. The symplectic form assigns a price to every pair of vectors, effectively measuring the world of the parallelogram they span. This area is preserved under Hamiltonian flows.
Key Properties

Closed: ( domega = 0 ) which suggests its exterior derivative is zero ( domega=0 ). This property ensures that the shape doesn’t change under continuous transformations.
Non-degenerate: The shape is non-degenerate if ( domega(X,Y)=0 ) for all ( Y )s, then ( X=0 ). This ensures that each vector has a novel “partner” vector such that their pairing under ( omega ) is non-zero.

Example
For a straightforward harmonic oscillator with one degree of freedom, ( omega = dq wedge dp ). This measures the world of parallelograms within the phase space spanned by vectors representing changes in position and momentum.
A Very Simplistic PyTorch Code:
While PyTorch doesn’t directly handle differential forms, you possibly can conceptually represent the symplectic form using tensors:

This code illustrates the antisymmetric nature of the wedge product.

Numerically, this implies good integrators must respect:

[
frac{partial (q(t + epsilon), p(t + epsilon))}{partial (q(t), p(t))}^T J frac{partial (q(t + epsilon), p(t + epsilon))}{partial (q(t), p(t))} = J text{where } J = begin{pmatrix} 0 & I -I & 0 end{pmatrix}
]

Breaking Down the Formula

Geometric numerical integration: Solves differential equations while preserving geometric properties of the system.
Symplecticity: A geometrical property inherent to Hamiltonian systems. It ensures that the world of geometric structures (e.g., parallelograms) in phase space ( (q, p) ) stays constant over time. That is encoded within the symplectic form ( omega = sum_i dq_i wedge dp_i ).
A numerical method is symplectic: If it preserves ( omega ). The Jacobian matrix of the transformation from ( (q(t), p(t)) ) to ( (q(t + epsilon), p(t + epsilon)) ) must satisfy the condition above.
Jacobian matrix ( frac{partial (q(t + epsilon), p(t + epsilon))}{partial (q(t), p(t))} ): Quantifies how small changes within the initial state ( (q(t), p(t)) ) propagate to the following state ( (q(t + epsilon), p(t + epsilon)) ).
( q(t) ) and ( p(t) ): Position and momentum at time ( t ).
( q(t + epsilon) ) and ( p(t + epsilon) ): Updated position and momentum after one time step ( epsilon ).
( frac{partial}{partial (q(t), p(t))} ): Partial derivatives with respect to the initial state.

How are We Going to Solve it?

Numerical solvers for differential equations inevitably introduce errors that affect solution accuracy. These errors manifest as deviations from the true trajectory in phase space, particularly noticeable in energy-conserving systems just like the harmonic oscillator. The errors fall into two principal categories: local truncation error, arising from the approximation of continuous derivatives with discrete steps (proportional to ( mathcal{O}(epsilon^n+1) ) where ( epsilon ) is the step size and n depends upon the strategy); and global accumulation error, which compounds over integration time.

Forward Euler Method Fails at This!

Key Issue: Energy Drift from Non-Symplectic Updates

The forward Euler method (FEM) violates the geometric structure of Hamiltonian systems, resulting in energy drift in long-term simulations. Let’s dissect why.

For an in depth exploration of how methods like Forward Euler perform in Hamiltonian systems and why they don’t preserve symplecticity—including mathematical breakdowns and practical examples—try this post on Hamiltonian mechanics from the TorchEBM library documentation.

To beat this, we turn to symplectic integrators—methods that respect the underlying geometry of Hamiltonian systems, leading us naturally to the Leapfrog Verlet method, a robust symplectic alternative. 🚀

Symplectic Numerical Integrators 💻

Leapfrog Verlet

For a separable Hamiltonian ( H(q,p) = K(p) + U(q) ), where the corresponding probability distribution is given by:

[
P(q,p) = frac{1}{Z} e^{-U(q)} e^{-K(p)},
]

the Leapfrog Verlet integrator proceeds as follows:

[
begin{aligned}
p_{i}left(t + frac{epsilon}{2}right) &= p_{i}(t) – frac{epsilon}{2} frac{partial U}{partial q_{i}}(q(t))
q_{i}(t + epsilon) &= q_{i}(t) + epsilon frac{partial K}{partial p_{i}}left(pleft(t + frac{epsilon}{2}right)right)
p_{i}(t + epsilon) &= p_{i}left(t + frac{epsilon}{2}right) – frac{epsilon}{2} frac{partial U}{partial q_{i}}(q(t + epsilon))
end{aligned}
]

This Störmer-Verlet scheme preserves symplecticity exactly, with local error ( mathcal{O}(epsilon^3) ) and global error ( mathcal{O}(epsilon^2) ). You’ll be able to read more about numerical methods and evaluation in Python here.

How Exactly?

Need to know exactly how the Leapfrog Verlet method ensures symplecticity with detailed equations and proofs? The TorchEBM library documentation on Leapfrog Verlet breaks it down step-by-step.

Why Symplecticity Matters

They’re the reversible neural nets of physics simulations!

Symplectic integrators like Leapfrog Verlet are critical for long-term stability in Hamiltonian systems.

Phase space preservation: The amount in ( (q, p) )-space is conserved exactly, avoiding artificial energy drift.
Approximate energy conservation: While energy ( H(q,p) ) isn’t perfectly conserved (as a result of ( mathcal{O}(epsilon^2) ) error), it oscillates near the true value over exponentially long timescales.
Practical relevance: This makes symplectic integrators indispensable in molecular dynamics and Hamiltonian Monte Carlo (HMC), where accurate sampling relies on stable trajectories.

Comparison of numerical integration methods for a straightforward harmonic oscillator in phase space. Color gradients indicate error magnitude with brighter colours showing larger divergence from the precise solution (white). Euler’s method (a) exhibits energy growth, Modified Euler’s method (b) shows improved stability, while Leapfrog maintains excellent energy conservation at small stepsize (c) but develops geometric distortion at larger stepsize (d). Photo by the writer.

Euler’s method (first-order) systematically injects energy into the system, causing the characteristic outward spiral seen within the plots. Modified Euler’s method (second-order) significantly reduces this energy drift. Most significantly, symplectic integrators just like the Leapfrog method preserve the geometric structure of Hamiltonian systems even with relatively large step sizes by maintaining phase space volume conservation. This structural preservation is why Leapfrog stays the popular method for long-time simulations in molecular dynamics and astronomy, where energy conservation is critical despite the visible polygon-like discretization artifacts at large step sizes.

Non-symplectic methods (e.g., Euler-Maruyama) often fail catastrophically in these settings.

Integrator	Symplecticity	Order	Type
Euler Method	❌	1	Explicit
Symplectically Euler	✅	1	Explicit
Leapfrog (Verlet)	✅	2	Explicit
Runge-Kutta 4	❌	4	Explicit
Forest-Ruth Integrator	✅	4	Explicit
Yoshida Sixth-order	✅	6	Explicit
Heun’s Method (RK2)	❌	2	Explicit
Third-order Runge-Kutta	❌	3	Explicit
Implicit Midpoint Rule	✅	2	Implicit (solving equations)
Fourth-order Adams-Bashforth	❌	4	Multi-step (explicit)
Backward Euler Method	❌	1	Implicit (solving equations)

For more details on things like local and global errors or what these integrators are best fitted to, there’s a handy write-up over at Hamiltonian mechanics: Why Symplecticity Matters that covers all of it.

Hamiltonian Monte Carlo

Hamiltonian Monte Carlo (HMC) is a Markov chain Monte Carlo (MCMC) method that leverages Hamiltonian dynamics to efficiently sample from complex probability distributions, particularly in Bayesian statistics and Machine Learning.

From Phase Space to Probability Space

HMC interprets goal distribution ( P(z) ) as a Boltzmann distribution:

[
P(z) = frac{1}{Z} e^{frac{-E(z)}{T}}
]

Substituting into this formulation, the Hamiltonian gives us a joint density:

[
P(q,p) = frac{1}{Z} e^{-U(q)} e^{-K(p)} text{where } U(q) = -log[p(q), p(q|D)]
]

where ( p(q|D) ) is the likelihood of the given data ( D ) and T=1 and due to this fact removed. We estimate our posterior distribution using the potential energy ( U(q) ) since ( P(q,p) ) consists of two independent probability distributions.

Augment with artificial momentum ( p sim mathcal{N}(0,M) ), then simulate Hamiltonian dynamics to propose latest ( q’ ) based on the distribution of the position variables ( U(q) ) which acts because the “potential energy” of the goal distribution ( P(q) ), thereby creating valleys at high-probability regions.

For more on HMC, try this explanation or this tutorial.

Physical Systems: ( H(q,p) = U(q) + K(p) ) represents total energy

Sampling Systems: ( H(q,p) = -log P(q) + frac{1}{2}p^T M^{-1} p ) defines exploration dynamics

The kinetic energy with the favored type of ( K(p) = frac{1}{2}p^T M^{-1} p ), often Gaussian, injects momentum to traverse these landscapes. Crucially, the mass matrix ( M ) plays the role of a preconditioner – diagonal ( M ) adapts to parameter scales, while dense ( M ) can align with correlation structure. ( M ) is symmetric, positive definite and typically diagonal.

What’s Positive Definite?

Positive Definite: For any non-zero vector ( x ), the expression ( x^T M x ) is all the time positive. This ensures stability and efficiency.

Illustration of various quadratic forms in two variables that shows how different covariance matrices
influence the form of those forms. The plots depict:

a) **Positive Definite Form**: A bowl-shaped surface where all eigenvalues are positive, indicating a minimum.

b) **Negative Definite Form**: An inverted bowl where all eigenvalues are negative, indicating a maximum.

c) **Indefinite Form**: A saddle-shaped surface with each positive and negative eigenvalues, indicating neither a maximum nor a minimum.

Each subplot includes the matrix ( M ) and the corresponding quadratic form (Q(x) = x^T M x). Photo by the writer.

[
x^T M x > 0
]

Kinetic Energy Selections

Gaussian (Standard HMC): ( K(p) = frac{1}{2}p^T M^{-1} p )
Yields Euclidean trajectories, efficient for moderate dimensions.
Relativistic (Riemannian HMC): ( K(p) = sqrt{p^T M^{-1} p + c^2} )
Limits maximum velocity, stopping divergences in ill-conditioned spaces.
Adaptive (Surrogate Gradients): Learn ( K(p) ) via neural networks to match goal geometry.

Key Intuition

The Hamiltonian ( H(q,p) = U(q) + frac{1}{2}p^T M^{-1} p ) creates an energy landscape where momentum carries the sampler through high-probability regions, avoiding random walk behavior.

The HMC Algorithm

The algorithm involves:

Initialization: Start with an initial position ( q_0 ) and sample momentum ( p_0 sim mathcal{N}(0,M) ).
Leapfrog Integration: Use the leapfrog method to approximate Hamiltonian dynamics. For a step size ( epsilon ) and L steps, update:
- Half-step momentum: ( p(t + frac{epsilon}{2}) = p(t) – frac{epsilon}{2} frac{partial U}{partial q}(q(t)) )
- Full-step position: ( q(t + epsilon) = q(t) + epsilon frac{partial K}{partial p}(p(t + frac{epsilon}{2})) ), where ( K(p) = frac{1}{2} p^T M^{-1} p ), so ( frac{partial K}{partial p} = M^{-1} p )
- Full-step momentum: ( p(t + epsilon) = p(t + frac{epsilon}{2}) – frac{epsilon}{2} frac{partial U}{partial q}(q(t + epsilon)) )
That is repeated L times to get proposed ( dot{q} ) and ( dot{p} ).
Metropolis-Hastings Acceptance: Accept the proposed ( dot{q} ) with probability ( min(1, e^{H(q_0,p_0) – H(dot{q},dot{p})}) ), where ( H(q,p) = U(q) + K(p) ).

This process generates a Markov chain with stationary distribution ( P(q) ), leveraging Hamiltonian dynamics to take larger, more efficient steps in comparison with random-walk methods.

Why Higher Than Random Walk?

HMC navigates high-dimensional spaces along energy contours – like following mountain paths as an alternative of wandering randomly!

Recap of the Hamilton’s equations?

[
begin{cases}
dot{q} = nabla_p K(p) = M^{-1}p & text{(Guided exploration)}
dot{p} = -nabla_q U(q) = nabla_q log P(q) & text{(Bayesian updating)}
end{cases}
]

This coupled system drives ( (q,p) ) along iso-probability contours of ( P(q) ), with momentum rotating somewhat than resetting at each step like in Random Walk Metropolis–consider following mountain paths as an alternative of wandering randomly! The important thing parameters – integration time ( tau = Lepsilon ) and step size ( epsilon ) – balance exploration vs. computational cost:

Short ( tau ): Local exploration, higher acceptance
Long ( tau ): Global moves, risk of U-turns (periodic orbits)

Key Parameters and Tuning

Tuning ( M ) to match the covariance of ( P(q) ) (e.g., via warmup adaptation) and setting ( tau sim mathcal{O}(1/lambda_{text{max}}) ), where ( lambda_{text{max}} ) is the most important eigenvalue of ( nabla^2 U ), often yields optimal mixing.

TorchEBM Library 📚

Oh, by the way in which, I’ve been messing around with these items in Python and threw together a library called TorchEBM. It’s got some tools for energy-based, rating matching, diffusion- and flow-based models and HMC bits I’ve been fiddling with. Nothing fancy, only a researcher’s sandbox for testing ideas like these. In case you’re into coding this sort of thing, poke around on TorchEBM GitHub and lemme know what you’re thinking that—PRs welcome! Been fun tinkering with it while writing this post.

Reference to Energy-Based Models

Energy-based models (EBMs) are a category of generative models that outline a probability distribution over data points using an energy function. The probability of an information point is proportional to ( e^{-E(x)} ), where ( E(x) ) is the energy function. This formulation is directly analogous to the Boltzmann distribution in statistical physics, where the probability is expounded to the energy of a state. In Hamiltonian mechanics, the Hamiltonian function ( H(q, p) ) represents the whole energy of the system, and the probability distribution in phase space is given by ( e^{-H(q,p)/T} ), where ( T ) is the temperature.

In EBMs, Hamiltonian Monte Carlo (HMC) is commonly used to sample from the model’s distribution. HMC leverages Hamiltonian dynamics to propose latest states, that are then accepted or rejected based on the Metropolis-Hastings criterion. This method is especially effective for high-dimensional problems, because it reduces the correlation between samples and allows for more efficient exploration of the state space. As an illustration, in image generation tasks, HMC can sample from the distribution defined by the energy function, facilitating the generation of high-quality images.

EBMs define probability through Hamiltonians:

[
p(x) = frac{1}{Z}e^{-E(x)} quad leftrightarrow quad H(q,p) = E(q) + K(p)
]

Potential Research Directions 🔮

Symplecticity in Machine Learning Models

Overview of the Hamiltonian Neural Networks architecture. Image from the HNN paper.

Incorporate the symplectic structure of Hamiltonian mechanics into machine learning models to preserve properties like energy conservation, which is crucial for long-term predictions. Generalizing Hamiltonian Neural Networks (HNNs), as discussed in Hamiltonian Neural Networks, to more complex systems or developing latest architectures that preserve symplecticity

HMC for Complex Distributions

HMC for sampling from complex, high-dimensional, and multimodal distributions, reminiscent of those encountered in deep learning. Combining HMC with other techniques, like parallel tempering, could handle distributions with multiple modes more effectively.

Combining Hamiltonian Mechanics with Other ML Techniques

Integrate Hamiltonian mechanics with reinforcement learning to guide exploration in continuous state and motion spaces. Using it to model the environment could improve exploration strategies, as seen in potential applications in robotics. Moreover, using Hamiltonian mechanics to define approximate posteriors in variational inference could lead on to more flexible and accurate approximations.

Hamiltonian GANs

Employing Hamiltonian formalism as an inductive bias for the generation of physically plausible videos with neural networks.

Wanna Team Up on This? 🤓

If a few of you sensible folks wherever you’re doing high-level wizardry are into research collaboration, I’d love to talk generative models over coffee ☕️ (virtual or IRL (London)). In case you’re into pushing these ideas further, hit me up! Follow me on Twitter/BlueSky or GitHub—I’m normally rambling about these items there. Also on LinkedIn and Medium/TDS for those who’re curious. To seek out more about my research interests, try my personal website.

Conclusion

Hamiltonian mechanics reframes physical systems through energy, using phase space to disclose symmetries and conservation laws via first-order equations. Symplectic integrators like Leapfrog Verlet preserve this structure, ensuring stability in simulations—crucial for applications like molecular dynamics and Hamiltonian Monte Carlo (HMC). HMC leverages these dynamics to sample complex distributions efficiently, bridging classical physics with modern machine learning.

References and Useful Links 📚

[]