Let’s take stock of exactly where we are in this journey it implement HMC for differential equations.
To do HMC, we concatenate all the parameters we wish to estimate into a single vector (call it $\theta$). Then, create a momentum variable (call it $p$ for some weird reason. Why do physicists denote MMMMMMomentum with $p$?) of the same dimension as $\theta$ and consider the joint distribution
We choose a momentum distribution for $p$ and then use The Hamiltonian to define a set of differential equations governing the dynamics of an imaginary particle on some high dimensional surface. The differential equations are
Here, $V = \log(\pi(\theta))$, the log density of our parameters. Given some initial random momentum, the ODEs above say that the particle’s rate of change per unit time is equivalent to the momentum, but the momentum’s rate of change depends on the shape of the surface. So the momentum governs the trajectory in time over the surface, and the surface governs how the momentum will change in time. Yes, that certainly sounds like a differential equation problem.
I won’t get into how we integrate this system. I think Colin Carrol has done a very good job boiling down the essence of HMC into something more digestible, but if you want the whole shebang on HMC, please read Michael Betancourt’s “A Conceptual Introduction of Hamiltonian Monte Carlo”.
We Solve ODEs to Compute Gradients of ODEs to Use in Other ODEs
So HMC is just (ok not JUST, but for the purposes of this blog post) a set of differential equations which can be cleverly integrated to get samples from a posterior. There is a bit of a hick up when we do this with ODEs. The $dp / dt$ is a vector in which each component is $\partial V / \partial \phi$. Here, $\phi$ is a particular parameter of our model, and so is an element of $\theta$. If $\phi$ is used to parameterize a differential equation for $y$, then computing gradients is especially difficult because we usually solve differential equations numerically, not analytically. This prevents us from using the chain rule to compute $\partial V / \partial \phi = \partial V/ \partial y \times \partial y / \partial \phi$.
Woe is me, I can’t compute the gradient of my second differential equation to plug into my first differential equation. But, if I solve a third differential equation, then I can compute the gradient of my second differential equation to plug into my first differential equation without having to solve and differentiate the second differential equation! Here is where Xzibit would come out and whisper “Yo Dowg, we heard you like differential equations”.
I understand that last part might be hard to parse so let me explain. In late 2018, a team at The University of Toronto published a new neural network architecture known as the “Neural ODE”. The details of that are not important, but they essentially faced the same problems as we do: they needed gradients of an ODE’s solution with respect to the parameters. In that paper, they use The Adjoint Method to do so. Again, not important how this method works exactly (maybe the topic of a future post). The adjoint method allows for the computation of a functional’s gradient directly, which means instead of computing $\partial y / \partial \phi$ and then $\partial V/ \partial y$, we can compute $\partial V / \partial \phi$ directly. This computation is performed by solving the differential equation for $y$ as well as some other differential equations at the same time.
Go ahead and look at the paper. Appendix B.2 basically outlines a set of differential equations, one of which is the equation for which we want to do inference, and another is a differential equation for $d L / d \theta$ (which I called $d V / d \phi$).
So, to recap, if we solve the adjoint differential equation, we can plug the result into the Hamiltonian differential equation without having to compute derivatives for the differential equation on which we intend to perform bayesian analysis. We solve ODEs to compute gradients of ODEs to use in other ODEs.
Here is where the wheels fall off. I actually tried implementing this for a very simple differential equation, but I don’t think I’ve really mastered the material. I’m going to lean on my mentor to help me understand how Theano does the AD while I bone up on some AD theory and make some toy models in python.
I think the path is a little clearer now. I can’t say the adjoint method is intuitive, and HMC isn’t either, but they are becoming less opaque black boxes than they previously were. I have a lot of doubts on if this can be done by the end of summer (most of them have to do with my own ability), but damnit if I don’t give it my best shot.