Jekyll2019-09-03T11:16:55-07:00https://dpananos.github.io/feed.xmlPh.DemetriApplied Mathematician. Statistician. Data Scientist.Demetri Pananosapananos@uwo.caGSoC 2019: It’s Over!2019-08-30T00:00:00-07:002019-08-30T00:00:00-07:00https://dpananos.github.io/posts/2019/08/blog-post-21<p>It’s the end of August, and Google Summer of Code 2019 is over. This blog post is meant to outline what I’ve accomplished, what I’ve failed to accomplish, what I’ve learned, and how I’ve felt over these last 4 months.</p>
<h1 id="my-project">My Project</h1>
<p>We’ve (I say we because you, dear reader, have been part of this journey just as much as I have) been working to <a href="https://dpananos.github.io/posts/2019/05/blog-post-12/">add ODE capabilities</a> to PyMC3. In principle, this doesn’t sound so hard. I mean, all we need to do is solve a different ODE related to our first ODE and then plug the result into a larger ODE so that NUTS can sample from the posterior <a href="https://dpananos.github.io/posts/2019/05/blog-post-13/">(yo dawg, we heard you like ODEs)</a>. Using Theano to compute gradients isn’t so bad. We used Theano to compute the gradients of an ODE and then used <a href="https://dpananos.github.io/posts/2019/05/blog-post-15/">gradient descent to fit that ODE to data</a>. From there, it was easy to plop our code into an existing ODE inference notebook, wrap it in a class, <a href="https://dpananos.github.io/posts/2019/06/blog-post-16/">and then sample</a>. All of that happened May/June of this year.</p>
<p>When July came, we spent most of our time writing the API and writing tests. Personally, I learned a lot more in July than I learned in the previous months, mostly because OOP and writing unit tests were completely new to me. I also got way better at git *<insert i_know_git_fu.gif="">*.</insert></p>
<p>Then August came, we wrapped everything in a pretty bow and <a href="https://github.com/pymc-devs/pymc3/pull/3590">made a PR</a>!</p>
<p>A lot of the work leading up to the PR was done in <a href="https://github.com/Dpananos/ODEGSoC">this</a> repo. This really wasn’t meant for people to go through, but you’re free to poke around and see what I was up to during the summer. In that repo are the very beginnings of <code class="highlighter-rouge">DifferentialEquation</code>, the meat of my contribution.</p>
<h1 id="about-differentialequation">About <code class="highlighter-rouge">DifferentialEquation</code></h1>
<p><code class="highlighter-rouge">DifferentialEquation</code> is a Theano Op which computes the gradients of ODEs using the forward sensitivity analysis and computes solutions to the ODE via numerical integration. You use it in the same way that you use other PyMC3 functions to build models. <a href="https://github.com/Dpananos/pymc3/blob/gsoc_ode/docs/source/notebooks/bayesian_estimation_of_ode_parameters.ipynb">Here</a> is a notebook I wrote on how to use <code class="highlighter-rouge">DifferentialEquation</code> for both linear/non-linear scalar/vector ODEs.</p>
<p>All you have to do is:</p>
<ul>
<li>Write your differential equation as a function in Python</li>
<li>Pass that function to <code class="highlighter-rouge">DifferentialEquation</code></li>
<li>Tell <code class="highlighter-rouge">DifferentialEquation</code> at what times your data was observed, and</li>
<li>Tell <code class="highlighter-rouge">DifferentialEquation</code> how many parameters and states your system has.</li>
</ul>
<p>Then, go build your model! The notebook I’ve linked to above is just the first step in a series of notebooks I have planned. Soon, I’d like to add a notebook on how <code class="highlighter-rouge">DifferentialEquation</code> can be used to estimate population pharamcokinetics.</p>
<h1 id="biggest-challenges">Biggest Challenges</h1>
<p>I think I was pretty clear about this in the first blog post; I was not certain I was going to succeed (but that is kind of the reason I stuck around). I was incredibly shocked to hear I got the gsoc position, and then was incredibly intimidated by the material I had to learn. The math was not the hard part (it was actually the most fun. Surprise, I’m a math nerd), but translating math to code is not always the easiest for me. I remember sitting in a Starbucks fiddling around with <code class="highlighter-rouge">autograd</code> (after reading Colin Carroll’s great blog posts on MCMC) trying to compute the sensitivities (i.e. gradients) of a tiny little ODE. I just didn’t think I understood it well enough to do it, but when I did compute those sensitivities, I was over the moon. I felt like I had actually learned something; like I had achieved a higher understanding. It was an amazing feeling.</p>
<p>After that, things got harder. Python is a tool for me, and so I don’t usually write classes, let alone design APIs. When I was writing the API for <code class="highlighter-rouge">DifferentialEquation</code>, I ran into a bug so infuriating that I decided to scrap the whole thing and start again. To anyone reading this, this is a legitimate way of debugging, do not @ me.</p>
<p>I thought once August hit I was in the home stretch, but I could not have been more wrong. Besides the various changes to the code from the PR review (which I expected), Michael (my mentor) and I noticed a problem with <code class="highlighter-rouge">DifferentialEquation</code> returning a tensor of the wrong shape. This lead to a week of poking and prodding at the API, trying to find exactly what was wrong with it. This lead to a lot of learning on my part (especially about Theano, as well as <code class="highlighter-rouge">python setup.py develop</code> why did no one tell me about that?). We eventually decided the change was not worth it at the moment, but will likely return to it at a later date.</p>
<p>I faced a lot of stuff I was not sure I would overcome, but luckily, I had great support. My mentor, Michael Osthege, never got frustrated with me. He always always understanding when I was having a rough time, super flexible to meet me even in the dead of the German night, and was all around a great person to show me the ropes. Thanks to Colin Carroll and Junpeng Lao for the kind words of encouragement through twitter, and thank you to the remainder of the PyMC3 devs for welcoming me into the community.</p>
<h1 id="would-i-do-gsoc-again">Would I Do GSoC Again?</h1>
<p>I won’t lie, gsoc was capital <em>H</em> Hard. I think my project was pretty ambitious, and it might have been better to propose we create the beginnings of the API instead of the API itself. I faced a lot of stress, but I think it was worth it. I set out a goal to contribute to open source when I started my PhD. That initially meant typos, then documentation, but I didn’t anticipate contributing functionality on this level. It’s a cool feeling, and I think I would want to do it again.</p>Demetri Pananosapananos@uwo.caIt’s the end of August, and Google Summer of Code 2019 is over. This blog post is meant to outline what I’ve accomplished, what I’ve failed to accomplish, what I’ve learned, and how I’ve felt over these last 4 months.OK, So I Was Wrong About LogisticRegression2019-08-30T00:00:00-07:002019-08-30T00:00:00-07:00https://dpananos.github.io/posts/2019/08/blog-post-22<p><a href="https://twitter.com/zacharylipton/status/1167298276686589953">Zachary Lipton recently tweeted that sklearn’s <code class="highlighter-rouge">LogisticRegression</code> uses a penalty by default</a>. This resulted in some heated twitter debates about the differences in attitudes between statistics and machine learning researchers and the responsibility of users to read the documentation, amongst other things.</p>
<p>My initial reaction was to jump to the defense of the sklearn developers. In essence, I thought (and to some extent still do think) that practitioners should take responsibility for the tools they use to draw the conclusions they draw. I’ve had some time to sit on that for a day or so and let other facts come to light.</p>
<p>While I maintain data analysts/scientists/researchers need to be mindful of their tools, I have reluctantly also come to agree that the designers of these tools should be mindful of their use. Below, I outline some of my thinking, linking to interesting discussions and facts as they have been shown to me.</p>
<h2 id="arguments-for-a-more-transparent-logisticregression">Argument’s For A More Transparent <code class="highlighter-rouge">LogisticRegression</code></h2>
<p>The arguments for why <code class="highlighter-rouge">LogisticRegression</code> should have an unbiased implementation as the default are numerous. I won’t go through them all here (you can read <a href="https://ryxcommar.com/2019/08/30/scikit-learns-defaults-are-wrong/">@ryxcommar’s very thoughtful and well argued article here</a> should you need a refresher). In short, if the function’s name is <code class="highlighter-rouge">LogisticRegression</code> then we should expect that it performs logistic regression and not some penalized variant thereof. Few people read the docs (for some strange reason to me) and so it can be very easy to go through your data scientific life blissfully unaware that you are biasing results.</p>
<p>I thought it was apparent that sklearn was a ML library first and foremost, and thus preferred to trade off variance in exchange for bias. The devs have made that abundantly clear <a href="https://github.com/scikit-learn/scikit-learn/issues/6738#issuecomment-252799153">in several issues in github</a>. I was wrong on this.</p>
<h2 id="creating-tools-for-science-is-a-great-power-and-with-great-power">Creating Tools For Science Is A Great Power, And With Great Power…</h2>
<p>A lot of my initial reaction was to put blame primarily on the user. After all, no one forced anybody to use sklearn. If only the user had read the docs and understood what they were doing, then this all could have been avoided. When I make a mistake, I don’t blame someone else, I say “damn, I should have read the docs!”. Silly user, it’s your fault!</p>
<p>Another argument I had was that a lot of software has really crappy or unintuitive defaults that we have just come to live with. Stata drops missing data, SAS spits out superfluous statistics making it tough for people to find what they need, R uses a Welch’s t-test in <code class="highlighter-rouge">t.test</code> and Clopper-Pearson intervals in <code class="highlighter-rouge">binom.test</code> in contrast to what even graduate students are taught in class. If users are, for instance, familiar with Wald intervals <a href="https://stats.stackexchange.com/questions/4713/binomial-confidence-interval-estimation-why-is-it-not-symmetric">then the use of Clopper-Pearson intervals may result in some confusion</a> when users realize the interval is not symmetric about the estimated mean. If these are sane defaults given the context of statistics (and they are, at least the ones in R are. No one is saying they aren’t), then why shouldn’t automatic regularization be the standard in the context of ML, which as I’ve said before, routinely trades variance in exchange for bias. Context is important, silly user! Read the docs!</p>
<p>I was quick to blame the user because they really are the last line of defense. If you design a study, analyze the data, and then send it for review, it is up to you to make sure that your paper accurately reflects what you said you did. No one else can do that for you. The reviewers act in this capacity in a small way, but if you said you did logistic regression, how are they to know any different? This brings up the issue of code review, but let’s leave that alone for now.</p>
<p>But, as I do, I forgot that <em>we’re all human</em>. We get tired, we skim, we look for shortcuts, and we fail to find the details we really need. Even the best of us do that. Even I do this, and I think I have been the most critical of people who do this. Now, I don’t find this a completely abdicating argument. I mean, how hard is it to google the docs or run <code class="highlighter-rouge">?LogisticRegression</code> in the terminal? But, I digress.</p>
<p>And that is where the design of scientific tools can come to the rescue. Making tools easier and clearer to use can combat a lot of this human error. I’ve reluctantly come to the conclusion that sklearn should have implemented an unbiased logistic regression by default (or at least named <code class="highlighter-rouge">LogisticRegression</code> more appropriately).</p>
<p>Moreover, I think the devs may need to consider that their library is more popular than perhaps they initially thought it might become. That sort of popularity brings a lot of power, and with great power…well, you know the rest.</p>
<p>I was especially disappointed to read <a href="https://www.reddit.com/r/statistics/comments/8de54s/is_r_better_than_python_at_anything_i_started/dxmnaef/">this</a> comment by reddit user /u/shaggorama, which states that sklearn quietly deprecated their bootstrap validator after it came to light that the implemented method was more or less ad hoc and did not actually implement the bootstrap. I find this particularly egregious because it is arguably more deceptive than <code class="highlighter-rouge">LogsiticRegression</code>. Not only that, but I was shocked to see Andreas Muller plainly ask <a href="https://github.com/scikit-learn/scikit-learn/issues/6738#issuecomment-252798270">why anyone would want an unbiased implementation of logistic regression</a>.</p>
<p>If we should expect that practitioners act in good faith, then tool makers should create tools which allow practitioners to act in good faith easily. To do otherwise, I have come to conclude, is to obstruct good science.</p>
<h2 id="so-what">So What?</h2>
<p>While I maintain users have a responsibility to know what they are doing, I jumped too quickly to blame the user. After reading some of the links I have posted here, I’ve changed my mind. Users can only shoulder so much responsibility, and I think if toolmakers want to make good tools then they should consider the perspective of the user.</p>Demetri Pananosapananos@uwo.caZachary Lipton recently tweeted that sklearn’s LogisticRegression uses a penalty by default. This resulted in some heated twitter debates about the differences in attitudes between statistics and machine learning researchers and the responsibility of users to read the documentation, amongst other things.GSoC 2019: A PR is Made!2019-08-05T00:00:00-07:002019-08-05T00:00:00-07:00https://dpananos.github.io/posts/2019/07/blog-post-20<p>Another short update: I’ve made a PR to merge <code class="highlighter-rouge">pymc3.ode</code> into PyMC3!</p>
<p>I’ve gotten some good feedback on the API and am currently working on a shape fix so that users won’t need to <code class="highlighter-rouge">reshape</code> the result from <code class="highlighter-rouge">DifferentialEquation</code>. This is actually giving me a lot of anxiety because I think it requires some insight into Theano which I just don’t have. Will keep you updated with what happens when I write the final blog post in 2 weeks time!</p>
<p>Here is the <a href="https://github.com/pymc-devs/pymc3/pull/3578">PR</a>!</p>Demetri Pananosapananos@uwo.caAnother short update: I’ve made a PR to merge pymc3.ode into PyMC3!GSoC 2019: Testing an API2019-07-22T00:00:00-07:002019-07-22T00:00:00-07:00https://dpananos.github.io/posts/2019/07/blog-post-19<p>This is a really short update. Here are a couple things I have been working on since the last blog post.</p>
<p>1) Continuing to write the API so that it is intuitive</p>
<p>2) Writing unit tests</p>
<p>3) Benchmarking</p>
<p>When I last made a blog post about the API, users were to pass the ode parameters and initial conditions as a single vector. That was just to get things off the ground. Now that API is working, I can go back and make that a little more user friendly. The <code class="highlighter-rouge">ODEModel</code> (the meat and potatoes of the API) inherits from the <code class="highlighter-rouge">theano.Op</code> class, so in the <code class="highlighter-rouge">make_node</code> method I changed</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
<span class="k">def</span> <span class="nf">make_node</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">theano</span><span class="o">.</span><span class="n">tensor</span><span class="o">.</span><span class="n">as_tensor_variable</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="k">return</span> <span class="n">theano</span><span class="o">.</span><span class="n">Apply</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="p">[</span><span class="n">x</span><span class="p">],</span> <span class="p">[</span><span class="n">x</span><span class="o">.</span><span class="nb">type</span><span class="p">()])</span>
</code></pre></div></div>
<p>to</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">def</span> <span class="nf">make_node</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">odeparams</span><span class="p">,</span> <span class="n">initial_condition</span><span class="p">):</span>
<span class="k">if</span> <span class="n">np</span><span class="o">.</span><span class="n">ndim</span><span class="p">(</span><span class="n">odeparams</span><span class="p">)</span><span class="o">></span><span class="mi">1</span><span class="p">:</span>
<span class="n">odeparams</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">ravel</span><span class="p">(</span><span class="n">odeparams</span><span class="p">)</span>
<span class="k">if</span> <span class="n">np</span><span class="o">.</span><span class="n">ndim</span><span class="p">(</span><span class="n">initial_condition</span><span class="p">)</span> <span class="o">></span> <span class="mi">1</span><span class="p">:</span>
<span class="n">odeparams</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">ravel</span><span class="p">(</span><span class="n">initial_condition</span><span class="p">)</span>
<span class="n">odeparams</span> <span class="o">=</span> <span class="n">tt</span><span class="o">.</span><span class="n">as_tensor_variable</span><span class="p">(</span><span class="n">odeparams</span><span class="p">)</span>
<span class="n">initial_condition</span> <span class="o">=</span> <span class="n">tt</span><span class="o">.</span><span class="n">as_tensor_variable</span><span class="p">(</span><span class="n">initial_condition</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">tt</span><span class="o">.</span><span class="n">concatenate</span><span class="p">([</span><span class="n">odeparams</span><span class="p">,</span><span class="n">initial_condition</span><span class="p">])</span>
<span class="k">return</span> <span class="n">theano</span><span class="o">.</span><span class="n">Apply</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="p">[</span><span class="n">x</span><span class="p">],</span> <span class="p">[</span><span class="n">x</span><span class="o">.</span><span class="nb">type</span><span class="p">()])</span>
</code></pre></div></div>
<p>The little bit of logic at the beginning is to get around some shape problems with <code class="highlighter-rouge">theano</code>. So now to use the API, instead of doing</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="n">all_params</span> <span class="o">=</span> <span class="n">pm</span><span class="o">.</span><span class="n">math</span><span class="o">.</span><span class="n">stack</span><span class="p">([</span><span class="o"><</span><span class="n">parameters</span> <span class="n">here</span><span class="o">></span><span class="p">,</span> <span class="o"><</span><span class="n">initial</span> <span class="n">conditions</span> <span class="n">here</span><span class="o">></span><span class="p">])</span>
<span class="n">forward</span> <span class="o">=</span> <span class="n">ode_model</span><span class="p">(</span><span class="n">all_params</span><span class="p">)</span>
</code></pre></div></div>
<p>you can do</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
<span class="n">forward</span> <span class="o">=</span> <span class="n">ode_model</span><span class="p">(</span><span class="n">odeparams</span> <span class="o">=</span> <span class="p">[</span><span class="o"><</span><span class="n">parameters</span> <span class="n">here</span><span class="o">></span><span class="p">],</span> <span class="n">initial_condition</span> <span class="o">=</span> <span class="p">[</span><span class="o"><</span><span class="n">initial</span> <span class="n">conditions</span> <span class="n">here</span><span class="o">></span><span class="p">])</span>
</code></pre></div></div>
<p>which I think is much better. Explicit is better than implicit. This change actually makes some models run <em>much faster</em> which is interesting.</p>Demetri Pananosapananos@uwo.caThis is a really short update. Here are a couple things I have been working on since the last blog post.Bayes Workshop2019-07-09T00:00:00-07:002019-07-09T00:00:00-07:00https://dpananos.github.io/posts/2019/07/blog-post-18<p>Link is <a href="https://drive.google.com/drive/folders/1XygO5U21VHObuuZty7aWJWuLD4vSTkKa?usp=sharing">here</a>.</p>Demetri Pananosapananos@uwo.caLink is here.GSoC 2019: Designing an API2019-07-08T00:00:00-07:002019-07-08T00:00:00-07:00https://dpananos.github.io/posts/2019/07/blog-post-17<p>Let’s take stock of where we are on our journey to add ODE capabilities on PyMC3.</p>
<ul>
<li>
<p>We know that HMC requires derivatives of the log likelihood, which means taking derivatives of the differential equation’s solution with repect to it’s parameters. That sounds hard, but we’ve discovered that it is actually as easy as solving more differential equations. The method by which we compute derivatives of the differential equation’s solution is called the “Forward Sensitivity Method”.</p>
</li>
<li>
<p>We’ve done a little proof of concept for the forward sensitivity method. We wrote a computation graph in theano to compute these derivatives via automatic differentiation, and then used those derivatives to do gradient descent. Since we successfully fit our differential equation to data, we become more confident in our approach.</p>
</li>
<li>
<p>We’ve dropped our computation graph into an existing notebook and were very surprised that the model sampled. We estimated the correct parameter value, the Gelman-Rubin statistic indicated our chains converged, and the number of effective samples was quite large. All in all, a great success.</p>
</li>
</ul>
<p>Now, we are at the point where we have to think about generalizing, and honestly, I’ve been dreading this point.</p>
<p>It is not news that I am a <em>bad</em> software engineer (in fact, that is precisely the reason I did GSoC). So, I am going to lean very heavily on you, dear reader, to help me through this. I’m going to be very open with what I am thinking, and I hope that you will chime in when neccesary to tell me that things can be done better, faster, safer.</p>
<p>So, here we go.</p>
<h1 id="thinking-about-an-api">Thinking About an API</h1>
<p>I don’t think an entire blog post outlining how the API works is interesting. Instead, I’m going to highlight some relevant bits, and if you have questions, you can ask me directly.</p>
<p>I’m keeping all my code in <a href="https://github.com/Dpananos/ODEGSoC/blob/master/Scripts/ode_api.py">this</a> repo. The API should take as an entry a function which computes the appropriate derivatives for the differential equation given the system’s current state, time, and parameters. Then, the <a href="https://github.com/Dpananos/ODEGSoC/blob/master/Scripts/ode_api.py#L151">computation graph</a> I wrote will be able to compute a theano compiled function which returns an augmented system (that is, the derivatives for the system and the derivatives to be used in the forward sensitivity method).</p>
<p>Before we begin integrating the augmented system, we need to construct appropriate initial conditions. In order to do that, we need to understand how the initial conditions look for the forward sensitivity method. For a differential equation</p>
<script type="math/tex; mode=display">y' = f(y,t,p)</script>
<p>with sensitivities defined by</p>
<script type="math/tex; mode=display">\dfrac{d}{dt} \left( \dfrac{\partial y}{\partial p} \right) = \dfrac{\partial f}{\partial y} \dfrac{\partial y}{\partial p} + \dfrac{\partial f}{\partial p}</script>
<p>the sensitivities (the derivative of the ODE’s solution with respect to the parameters) is a matrix, namely</p>
<script type="math/tex; mode=display">\left( \dfrac{\partial y}{\partial p} \right)_{i,j} = \dfrac{\partial y_i}{\partial p_j}</script>
<p>I tried to get my website to show the actual matrix version of the above, but no dice. Note, if $p_j$ is a model parameter, then $\partial y_i / \partial p_j = 0$. If $p_j$ is an initial condition for one of the equations in our system (i.e. $y_i(t_0) = p_j$), then the $\partial y_i / \partial p_j = 1$ (can you see why)? If we arrange the columns of $\partial y / \partial p$ so that all the model parameters are adjacent, and all the initial conditions are adjacent, that means that the identity matrix will appear somewhere in our initial condition for $\partial y / \partial p $. I’ve written a little <a href="https://github.com/Dpananos/ODEGSoC/blob/master/Scripts/ode_api.py#L30">helper method</a> to make this initial condition and then flatten it (it needs to be flattened in order to be passed to the integrator).</p>
<p>The notebook I posted 2 weeks ago used a “cached simulator” to avoid having to integrate the system twice for the same parameters in the forward and backward pass, thereby increasing the speed of the computation. I’ve made that <a href="https://github.com/Dpananos/ODEGSoC/blob/master/Scripts/ode_api.py#L87">a part of the API</a> so it is, more or less, automatic. Small digression, for some particular reason this was the HARDEST thing to implement. I struggled for more than I care to admit. To fix whatever bug I was experiencing, I rewrote the entire API in a fit of rage and frustration.</p>
<p><a href="https://github.com/Dpananos/ODEGSoC/blob/master/Scripts/test_scalar_ode_1_param.ipynb">Here</a> is a notebook with a small example showing the API working. There are a few changes I plan to make that I hope will result in some speed ups, but as it stands all you need to do is: pass your ODE function, pass the initial time that your system evolves forward from, pass an array of times you would like to evaluate your ODE at, pass the number of states your system has, and pass the number of parameters your model has (pure parameters, not the number of parameters plus the number of initial conditions you plan on estimating).</p>Demetri Pananosapananos@uwo.caLet’s take stock of where we are on our journey to add ODE capabilities on PyMC3.GSoC 2019: A Sampling Notebook2019-06-24T00:00:00-07:002019-06-24T00:00:00-07:00https://dpananos.github.io/posts/2019/06/blog-post-16<p>OK, I will keep this one short and sweet, no math. We have a sampling notebook.</p>
<p>This isn’t completely new. The notebook I wrote is based, almost entirely, on <a href="https://github.com/pymc-devs/pymc3/blob/master/docs/source/notebooks/ODE_parameter_estimation.ipynb">this</a> notebook by Sanmitra Ghosh. Sanmitra computes the requisite gradient and Jacobian analytically. If you remember, my last blog post showed how to construct a Theano computation graph to compute those quantities automatically. This + That = Sampling Notebook (did I say no math? Wow, how long did I last? Not even a paragraph? I’m awful).</p>
<p><a href="https://github.com/Dpananos/ODEGSoC/blob/master/Notebooks/Sampling%20SIR%20model.ipynb">Here is my notebook</a>. But don’t snoop around the rest of the repo! You’re going to ruin all the fun surprises I have in store (and you will definitely see my buggy code).</p>
<p>It really was a matter of plug and play, and if you are really in need of ODE capabilities on Pymc3 (like…you need them yesterday), I’m fairly certain you can fork this notebook and run with it.</p>
<p>I’m currently working on putting this all in an API. I’m thinking you would do something like</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
<span class="n">ode_model</span> <span class="o">=</span> <span class="n">ODEModel</span><span class="p">(</span><span class="n">odefunc</span><span class="p">)</span>
<span class="n">ode_op</span> <span class="o">=</span> <span class="n">ODEop</span><span class="p">(</span><span class="n">ode_model</span><span class="p">)</span>
<span class="k">with</span> <span class="n">pm</span><span class="o">.</span><span class="n">Model</span><span class="p">()</span> <span class="k">as</span> <span class="n">model</span><span class="p">:</span>
<span class="c">#Write priors here</span>
<span class="n">solution</span> <span class="o">=</span> <span class="n">ode_op</span><span class="p">(</span><span class="n">parameters</span><span class="p">)</span>
<span class="c">#Write likelihood</span>
</code></pre></div></div>
<p>There are still some bugs to work out in my API (most notably working with asynchronous data), but I will save that for a later date. For now, feel free to fork and play with the notebook.</p>Demetri Pananosapananos@uwo.caOK, I will keep this one short and sweet, no math. We have a sampling notebook.GSoC 2019: Gradient Descent for ODEs (But This Time, In Theano)2019-06-10T00:00:00-07:002019-06-10T00:00:00-07:00https://dpananos.github.io/posts/2019/05/blog-post-15<p>A little while ago, I wrote a post on doing gradient descent for ODEs. In that post, I used <code class="highlighter-rouge">autograd</code> to do the automatic differentiation. While neat, it was really a way for me to get familiar with some math that I was to use for GSoC. After taking some time to learn more about <code class="highlighter-rouge">theano</code>, I’ve reimplemented the blog post, this time using <code class="highlighter-rouge">theano</code> to perform the automatic differentiation. If you’re read the previous post, then skip right to the code.</p>
<p>Gradient descent usually isn’t used to fit Ordinary Differential Equations (ODEs) to data (at least, that isn’t how the Applied Mathematics departments to which I have been a part have done it). Nevertheless, that doesn’t mean that it can’t be done. For some of my recent GSoC work, I’ve been investigating how to compute gradients of solutions to ODEs without access to the solution’s analytical form. In this blog post, I describe how these gradients can be computed and how they can be used to fit ODEs to synchronous data with gradient descent.</p>
<h2 id="up-to-speed-with-odes">Up To Speed With ODEs</h2>
<p>I realize not everyone might have studied ODEs. Here is everything you need to know:</p>
<p>A differential equation relates an unknown function $y \in \mathbb{R}^n$ to it’s own derivative through a function $f: \mathbb{R}^n \times \mathbb{R} \times \mathbb{R}^m \rightarrow \mathbb{R}^n$, which also depends on time $t \in \mathbb{R}$ and possibly a set of parameters $\theta \in \mathbb{R}^m$. We usually write ODEs as</p>
<script type="math/tex; mode=display">y' = f(y,t,\theta) \quad y(t_0) = y_0</script>
<p>Here, we refer to the vector $y$ as “the system”, since the ODE above really defines a system of equations. The problem is usually equipped with an initial state of the system $y(t_0) = y_0$ from which the system evolves forward in $t$. Solutions to ODEs in analytic form are often <em>very hard</em> if not impossible, so most of the time we just numerically approximate the solution. It doesn’t matter how this is done because numerical integration is not the point of this post. If you’re interested, look up the class of <em>Runge-Kutta</em> methods.</p>
<h2 id="computing-gradients-for-odes">Computing Gradients for ODEs</h2>
<p>In this section, I’m going to be using derivative notation rather than $\nabla$ for gradients. I think it is less ambiguous.</p>
<p>If we want to fit an ODE model to data by minimizing some loss function $\mathcal{L}$, then gradient descent looks like</p>
<script type="math/tex; mode=display">\theta_{n+1} = \theta_n - \alpha \dfrac{\partial \mathcal{L}}{\partial \theta}</script>
<p>In order to compute the gradient of the loss, we need the gradient of the solution, $y$, with respect to $\theta$. The gradient of the solution is the hard part here because it can not be computed (a) analytically (because analytic solutions are hard AF), or (b) through automatic differentiation without differentiating through the numerical integration of our ODE (which seems computationally wasteful).</p>
<p>Thankfully, years of research into ODEs yields a way to do this (that is not the adjoint method. Surprise! You thought I was going to say the adjoint method didn’t you?). Forward mode sensitivity analysis calculates gradients by extending the ODE system to include the following equations:</p>
<script type="math/tex; mode=display">\dfrac{d}{dt}\left( \dfrac{\partial y}{\partial \theta} \right) = \mathcal{J}_f \dfrac{\partial y}{\partial \theta} +\dfrac{\partial f}{\partial \theta}</script>
<p>Here, $\mathcal{J}$ is the Jacobian of $f$ with respect to $y$. The forward sensitivity analysis is <em>just another differential equation</em> (see how it relates the derivative of the unknown $\partial y / \partial \theta$ to itself?)! In order to compute the gradient of $y$ with respect to $\theta$ at time $t_i$, we compute</p>
<script type="math/tex; mode=display">\dfrac{\partial y}{\partial \theta} = \int_{t_0}^{t_i} \mathcal{J}_f \dfrac{\partial y}{\partial \theta} + \dfrac{\partial f}{\partial \theta} \, dt</script>
<p>I know this looks scary, but since forward mode sensitivities are just ODEs, we actually just get this from what we can consider to be a black box</p>
<script type="math/tex; mode=display">\dfrac{\partial y}{\partial \theta} = \operatorname{BlackBox}(f(y,t,\theta), t_0, y_0, \theta)</script>
<p>So now that we have our gradient in hand, we can use the chain rule to write</p>
<script type="math/tex; mode=display">\dfrac{\partial \mathcal{L}}{\partial \theta} =\dfrac{\partial \mathcal{L}}{\partial y} \dfrac{\partial y}{\partial \theta}</script>
<p>We can use automatic differentiation to compute $\dfrac{\partial \mathcal{L}}{\partial y}$.</p>
<p>OK, so that is some math (interesting to me, maybe not so much to you). Let’s actually implement this in python.</p>
<h2 id="gradient-descent-for-the-sir-model">Gradient Descent for the SIR Model</h2>
<p>The SIR model is a set of differential equations which govern how a disease spreads through a homogeneously mixed closed populations. I could write an entire thesis on this model and its various extensions (in fact, I have), so I’ll let you read about those on your free time.</p>
<p>The system, shown below, is parameterized by a single parameter:</p>
<script type="math/tex; mode=display">\dfrac{dS}{dt} = -\theta SI \quad S(0) = 0.99</script>
<script type="math/tex; mode=display">\dfrac{dI}{dt} = \theta SI - I \quad I(0) = 0.01</script>
<p>Let’s define the system, the appropriate derivatives, generate some observations and fit $\theta$ using gradient descent. Here si what you’ll need to get started:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">theano</span>
<span class="kn">import</span> <span class="nn">theano.tensor</span> <span class="k">as</span> <span class="n">tt</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">import</span> <span class="nn">scipy.integrate</span>
</code></pre></div></div>
<p>Let’s then define the ODE system</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">f</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">t</span><span class="p">,</span> <span class="n">theta</span><span class="p">):</span>
<span class="s">""""This is the ODE system.
The function can act on either numpy arrays or theano TensorVariables
Args:
y (vector): system state
t (float): current time (optional)
theta (vector): parameters of the ODEs
Returns:
dydt (list): result of the ODEs
"""</span>
<span class="k">return</span> <span class="p">[</span>
<span class="o">-</span><span class="n">theta</span><span class="o">*</span><span class="n">y</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">*</span><span class="n">y</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="c">#= dS/dt</span>
<span class="n">theta</span><span class="o">*</span><span class="n">y</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">*</span><span class="n">y</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">-</span> <span class="n">y</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="c">#= dI/dt</span>
<span class="p">]</span>
</code></pre></div></div>
<p>and create a computation graph with <code class="highlighter-rouge">theano</code>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#Define the differential Equation</span>
<span class="c">#Present state of the system</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">tt</span><span class="o">.</span><span class="n">dvector</span><span class="p">(</span><span class="s">'y'</span><span class="p">)</span>
<span class="c">#Parameter: Basic reproductive ratio</span>
<span class="n">p</span> <span class="o">=</span> <span class="n">tt</span><span class="o">.</span><span class="n">dscalar</span><span class="p">(</span><span class="s">'p'</span><span class="p">)</span>
<span class="c">#Present state of the gradients: will always be 0 unless the parameter is the inital condition</span>
<span class="n">dydp</span> <span class="o">=</span> <span class="n">tt</span><span class="o">.</span><span class="n">dvector</span><span class="p">(</span><span class="s">'dydp'</span><span class="p">)</span>
<span class="n">f_tensor</span> <span class="o">=</span> <span class="n">tt</span><span class="o">.</span><span class="n">stack</span><span class="p">(</span><span class="n">f</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="bp">None</span><span class="p">,</span> <span class="n">p</span><span class="p">))</span>
<span class="c">#Now compute gradients</span>
<span class="c">#Use Rop to compute the Jacobian vector product</span>
<span class="n">Jdfdy</span> <span class="o">=</span> <span class="n">tt</span><span class="o">.</span><span class="n">Rop</span><span class="p">(</span><span class="n">f_tensor</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">dydp</span><span class="p">)</span>
<span class="n">grad_f</span> <span class="o">=</span> <span class="n">tt</span><span class="o">.</span><span class="n">jacobian</span><span class="p">(</span><span class="n">f_tensor</span><span class="p">,</span> <span class="n">p</span><span class="p">)</span>
<span class="c">#This is the time derivative of dydp</span>
<span class="n">ddt_dydp</span> <span class="o">=</span> <span class="n">Jdfdy</span> <span class="o">+</span> <span class="n">grad_f</span>
<span class="c">#Compile the system as a theano function</span>
<span class="c">#Args:</span>
<span class="c">#y - array of length 2 representing current state of the system (S and I)</span>
<span class="c">#dydp - array of length 2 representing current state of the gradient (dS/dp and dI/dp)</span>
<span class="n">system</span> <span class="o">=</span> <span class="n">theano</span><span class="o">.</span><span class="n">function</span><span class="p">(</span>
<span class="n">inputs</span><span class="o">=</span><span class="p">[</span><span class="n">y</span><span class="p">,</span> <span class="n">dydp</span><span class="p">,</span> <span class="n">p</span><span class="p">],</span>
<span class="n">outputs</span><span class="o">=</span><span class="p">[</span><span class="n">f_tensor</span><span class="p">,</span> <span class="n">ddt_dydp</span><span class="p">]</span>
<span class="p">)</span>
</code></pre></div></div>
<p>Next, we’ll define the augmented system (that is, the ODE plus the sensitivities).</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#create a function to spit out derivatives</span>
<span class="k">def</span> <span class="nf">ODESYS</span><span class="p">(</span><span class="n">Y</span><span class="p">,</span><span class="n">t</span><span class="p">,</span><span class="n">p</span><span class="p">):</span>
<span class="s">"""
Args:
Y (vector): current state and gradient
t (scalar): current time
p (vector): parameters
Returns:
derivatives (vector): derivatives of state and gradient
"""</span>
<span class="n">dydt</span><span class="p">,</span> <span class="n">dydp</span> <span class="o">=</span> <span class="n">system</span><span class="p">(</span><span class="n">Y</span><span class="p">[:</span><span class="mi">2</span><span class="p">],</span> <span class="n">Y</span><span class="p">[</span><span class="mi">2</span><span class="p">:],</span> <span class="n">p</span><span class="p">)</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">concatenate</span><span class="p">([</span><span class="n">dydt</span><span class="p">,</span> <span class="n">dydp</span><span class="p">])</span>
</code></pre></div></div>
<p>We’ll optimize the $L_2$ norm of the error. This is done in <code class="highlighter-rouge">theano</code>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#Tensor for observed data</span>
<span class="n">t_y_obs</span> <span class="o">=</span> <span class="n">tt</span><span class="o">.</span><span class="n">dmatrix</span><span class="p">(</span><span class="s">'y_obs'</span><span class="p">)</span>
<span class="c">#Tensor for predictions</span>
<span class="n">t_y_pred</span> <span class="o">=</span> <span class="n">tt</span><span class="o">.</span><span class="n">dmatrix</span><span class="p">(</span><span class="s">'y_pred'</span><span class="p">)</span>
<span class="c">#Define error and cost</span>
<span class="n">err</span> <span class="o">=</span> <span class="n">t_y_obs</span> <span class="o">-</span> <span class="n">t_y_pred</span>
<span class="n">Cost</span> <span class="o">=</span> <span class="n">err</span><span class="o">.</span><span class="n">norm</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span><span class="n">axis</span> <span class="o">=</span> <span class="mi">1</span><span class="p">)</span><span class="o">.</span><span class="nb">sum</span><span class="p">()</span>
<span class="n">Cost_gradient</span> <span class="o">=</span> <span class="n">theano</span><span class="o">.</span><span class="n">tensor</span><span class="o">.</span><span class="n">grad</span><span class="p">(</span><span class="n">Cost</span><span class="p">,</span><span class="n">t_y_pred</span><span class="p">)</span>
<span class="n">C</span> <span class="o">=</span> <span class="n">theano</span><span class="o">.</span><span class="n">function</span><span class="p">(</span><span class="n">inputs</span> <span class="o">=</span> <span class="p">[</span><span class="n">t_y_obs</span><span class="p">,</span> <span class="n">t_y_pred</span><span class="p">],</span> <span class="n">outputs</span><span class="o">=</span><span class="n">Cost</span><span class="p">)</span>
<span class="n">del_C</span> <span class="o">=</span> <span class="n">theano</span><span class="o">.</span><span class="n">function</span><span class="p">(</span><span class="n">inputs</span> <span class="o">=</span> <span class="p">[</span><span class="n">t_y_obs</span><span class="p">,</span> <span class="n">t_y_pred</span><span class="p">],</span> <span class="n">outputs</span><span class="o">=</span><span class="n">Cost_gradient</span><span class="p">)</span>
</code></pre></div></div>
<p>Create some observations from which to fit</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#Initial Condition</span>
<span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">19920908</span><span class="p">)</span>
<span class="n">Y0</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mf">0.99</span><span class="p">,</span><span class="mf">0.01</span><span class="p">,</span> <span class="mf">0.0</span><span class="p">,</span> <span class="mf">0.0</span><span class="p">])</span>
<span class="c">#Space to compute solutions</span>
<span class="n">t_dense</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">5</span><span class="p">,</span><span class="mi">101</span><span class="p">)</span>
<span class="c">#True param value</span>
<span class="n">p_true</span> <span class="o">=</span> <span class="mi">8</span>
<span class="n">y_hat_theano</span> <span class="o">=</span> <span class="n">scipy</span><span class="o">.</span><span class="n">integrate</span><span class="o">.</span><span class="n">odeint</span><span class="p">(</span><span class="n">ODESYS</span><span class="p">,</span> <span class="n">y0</span><span class="o">=</span><span class="n">Y0</span><span class="p">,</span> <span class="n">t</span><span class="o">=</span><span class="n">t_dense</span><span class="p">,</span> <span class="n">args</span><span class="o">=</span><span class="p">(</span><span class="n">p_true</span><span class="p">,))</span>
<span class="n">S_obs</span><span class="p">,</span><span class="n">I_obs</span><span class="p">,</span> <span class="o">*</span><span class="n">_</span> <span class="o">=</span> <span class="n">y_hat_theano</span><span class="o">.</span><span class="n">T</span> <span class="o">+</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">normal</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mf">0.1</span><span class="p">,</span><span class="n">size</span> <span class="o">=</span> <span class="n">y_hat_theano</span><span class="o">.</span><span class="n">T</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
<span class="n">y_obs</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">c_</span><span class="p">[</span><span class="n">S_obs</span><span class="p">,</span><span class="n">I_obs</span><span class="p">]</span>
</code></pre></div></div>
<p>Perform Gradient Descent</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
<span class="n">p_gd</span> <span class="o">=</span> <span class="mf">1.1</span>
<span class="n">learning_rate</span> <span class="o">=</span> <span class="mf">0.01</span>
<span class="n">num_steps</span> <span class="o">=</span> <span class="mi">1000</span>
<span class="n">prev_cost</span> <span class="o">=</span> <span class="nb">float</span><span class="p">(</span><span class="s">'inf'</span><span class="p">)</span>
<span class="n">tol</span> <span class="o">=</span> <span class="mf">1e-5</span>
<span class="k">while</span> <span class="bp">True</span><span class="p">:</span>
<span class="c">#Evaluate solution</span>
<span class="n">y_hat_theano</span> <span class="o">=</span> <span class="n">scipy</span><span class="o">.</span><span class="n">integrate</span><span class="o">.</span><span class="n">odeint</span><span class="p">(</span><span class="n">ODESYS</span><span class="p">,</span> <span class="n">y0</span><span class="o">=</span><span class="n">Y0</span><span class="p">,</span> <span class="n">t</span><span class="o">=</span><span class="n">t_dense</span><span class="p">,</span> <span class="n">args</span><span class="o">=</span><span class="p">(</span><span class="n">p_gd</span><span class="p">,))</span>
<span class="c">#Splice out the numerical solution and numerical gradients</span>
<span class="n">y_pred</span> <span class="o">=</span> <span class="n">y_hat_theano</span><span class="p">[:,:</span><span class="mi">2</span><span class="p">]</span>
<span class="n">gradients</span> <span class="o">=</span> <span class="n">y_hat_theano</span><span class="p">[:,</span><span class="mi">2</span><span class="p">:]</span>
<span class="c">#Perform the gradient step</span>
<span class="n">p_gd</span><span class="o">-=</span> <span class="n">learning_rate</span><span class="o">*</span><span class="n">np</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">del_C</span><span class="p">(</span><span class="n">y_obs</span><span class="p">,</span><span class="n">y_pred</span><span class="p">)</span><span class="o">*</span><span class="n">gradients</span><span class="p">)</span>
<span class="c">#Has the loss changed a large amount?</span>
<span class="n">cost</span> <span class="o">=</span> <span class="n">C</span><span class="p">(</span><span class="n">y_obs</span><span class="p">,</span><span class="n">y_pred</span><span class="p">)</span>
<span class="c">#If so, keep going. Stop when the loss stops shrinking</span>
<span class="k">if</span> <span class="nb">abs</span><span class="p">(</span><span class="n">cost</span> <span class="o">-</span> <span class="n">prev_cost</span><span class="p">)</span><span class="o"><</span><span class="n">tol</span><span class="p">:</span>
<span class="k">break</span>
<span class="n">prev_cost</span> <span class="o">=</span> <span class="n">cost</span>
</code></pre></div></div>
<p>And lastly, compare our fitted curves to the true curves</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">dpi</span> <span class="o">=</span> <span class="mi">120</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">t_dense</span><span class="p">,</span><span class="n">y_obs</span><span class="p">[:,</span><span class="mi">0</span><span class="p">],</span> <span class="n">alpha</span> <span class="o">=</span> <span class="mf">0.5</span><span class="p">,</span> <span class="n">marker</span> <span class="o">=</span> <span class="s">'.'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">t_dense</span><span class="p">,</span><span class="n">y_obs</span><span class="p">[:,</span><span class="mi">1</span><span class="p">],</span><span class="n">alpha</span> <span class="o">=</span> <span class="mf">0.5</span><span class="p">,</span> <span class="n">marker</span> <span class="o">=</span> <span class="s">'.'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">t_dense</span><span class="p">,</span><span class="n">y_hat_theano</span><span class="p">[:,</span><span class="mi">0</span><span class="p">],</span> <span class="n">color</span> <span class="o">=</span> <span class="s">'k'</span><span class="p">,</span> <span class="n">zorder</span> <span class="o">=</span> <span class="mi">100</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">t_dense</span><span class="p">,</span><span class="n">y_hat_theano</span><span class="p">[:,</span><span class="mi">1</span><span class="p">],</span> <span class="n">color</span> <span class="o">=</span> <span class="s">'k'</span><span class="p">,</span> <span class="n">zorder</span> <span class="o">=</span> <span class="mi">100</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">t_dense</span><span class="p">,</span><span class="n">y_hat_analytical</span><span class="p">[:,</span><span class="mi">0</span><span class="p">],</span> <span class="n">linewidth</span> <span class="o">=</span> <span class="mi">3</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">t_dense</span><span class="p">,</span><span class="n">y_hat_analytical</span><span class="p">[:,</span><span class="mi">1</span><span class="p">],</span> <span class="n">linewidth</span> <span class="o">=</span> <span class="mi">3</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">legend</span><span class="p">()</span>
</code></pre></div></div>
<p>Here is our result!</p>
<div style="text-align:center"><img src="/images/blog/final.png" /></div>Demetri Pananosapananos@uwo.caA little while ago, I wrote a post on doing gradient descent for ODEs. In that post, I used autograd to do the automatic differentiation. While neat, it was really a way for me to get familiar with some math that I was to use for GSoC. After taking some time to learn more about theano, I’ve reimplemented the blog post, this time using theano to perform the automatic differentiation. If you’re read the previous post, then skip right to the code.GSoC 2019: ODE Inception2019-05-21T00:00:00-07:002019-05-21T00:00:00-07:00https://dpananos.github.io/posts/2019/05/blog-post-13<p>Let’s take stock of exactly where we are in this journey it implement HMC for differential equations.</p>
<p>To do HMC, we concatenate all the parameters we wish to estimate into a single vector (call it $\theta$). Then, create a momentum variable (call it $p$ for some weird reason. Why do physicists denote MMMMMMomentum with $p$?) of the same dimension as $\theta$ and consider the joint distribution</p>
<script type="math/tex; mode=display">\pi(\theta,p) = \pi(p \vert \theta)\pi(\theta)</script>
<p>We choose a momentum distribution for $p$ and then use The Hamiltonian to define a set of differential equations governing the dynamics of an imaginary particle on some high dimensional surface. The differential equations are</p>
<script type="math/tex; mode=display">\dfrac{d\theta}{dt} = p</script>
<script type="math/tex; mode=display">\dfrac{dp}{dt} = -\dfrac{\partial V}{\partial \theta}</script>
<p>Here, $V = \log(\pi(\theta))$, the log density of our parameters. Given some initial random momentum, the ODEs above say that the particle’s rate of change per unit time is equivalent to the momentum, but the momentum’s rate of change depends on the shape of the surface. <em>So the momentum governs the trajectory in time over the surface, and the surface governs how the momentum will change in time</em>. Yes, that certainly sounds like a differential equation problem.</p>
<p>I won’t get into how we integrate this system. I think <a href="https://colindcarroll.com/">Colin Carrol</a> has done a very good job boiling down the essence of HMC into something more digestible, but if you want the <em>whole shebang</em> on HMC, please read Michael Betancourt’s “<em>A Conceptual Introduction of Hamiltonian Monte Carlo</em>”.</p>
<h2 id="we-solve-odes-to-compute-gradients-of-odes-to-use-in-other-odes">We Solve ODEs to Compute Gradients of ODEs to Use in Other ODEs</h2>
<p>So HMC is just (ok not JUST, but for the purposes of this blog post) a set of differential equations which can be cleverly integrated to get samples from a posterior. There is a bit of a hick up when we do this with ODEs. The $dp / dt$ is a vector in which each component is $\partial V / \partial \phi$. Here, $\phi$ is a particular parameter of our model, and so is an element of $\theta$. If $\phi$ is used to parameterize a differential equation for $y$, then computing gradients is especially difficult because we usually solve differential equations numerically, not analytically. This prevents us from using the chain rule to compute $\partial V / \partial \phi = \partial V/ \partial y \times \partial y / \partial \phi$.</p>
<p>Woe is me, I can’t compute the gradient of my second differential equation to plug into my first differential equation. But, if I solve a <em>third</em> differential equation, then I can compute the gradient of my second differential equation to plug into my first differential equation without having to solve and differentiate the second differential equation! Here is where Xzibit would come out and whisper “Yo Dowg, we heard you like differential equations”.</p>
<p>I understand that last part might be hard to parse so let me explain. In late 2018, a team at The University of Toronto published a new neural network architecture known as the “Neural ODE”. The details of that are not important, but they essentially faced the same problems as we do: they needed gradients of an ODE’s solution with respect to the parameters. In that paper, they use <em>The Adjoint Method</em> to do so. Again, not important how this method works exactly (maybe the topic of a future post). The adjoint method allows for the computation of a functional’s gradient directly, which means instead of computing $\partial y / \partial \phi$ and then $\partial V/ \partial y$, we can compute $\partial V / \partial \phi$ directly. This computation is performed by solving the differential equation for $y$ as well as some other differential equations at the same time.</p>
<p>Go ahead and look at the paper. Appendix B.2 basically outlines a set of differential equations, one of which is the equation for which we want to do inference, and another is a differential equation for $d L / d \theta$ (which I called $d V / d \phi$).</p>
<p>So, to recap, if we solve the adjoint differential equation, we can plug the result into the Hamiltonian differential equation without having to compute derivatives for the differential equation on which we intend to perform bayesian analysis. <em>We solve ODEs to compute gradients of ODEs to use in other ODEs</em>.</p>
<h2 id="implementation">Implementation</h2>
<p>Here is where the wheels fall off. I actually tried implementing this for a very simple differential equation, but I don’t think I’ve really mastered the material. I’m going to lean on my mentor to help me understand how Theano does the AD while I bone up on some AD theory and make some toy models in python.</p>
<h2 id="conclusion">Conclusion</h2>
<p>I think the path is a little clearer now. I can’t say the adjoint method is intuitive, and HMC isn’t either, but they are becoming less opaque black boxes than they previously were. I have a lot of doubts on if this can be done by the end of summer (most of them have to do with my own ability), but damnit if I don’t give it my best shot.</p>Demetri Pananosapananos@uwo.caLet’s take stock of exactly where we are in this journey it implement HMC for differential equations.Gradient Descent for ODEs2019-05-21T00:00:00-07:002019-05-21T00:00:00-07:00https://dpananos.github.io/posts/2019/05/blog-post-14<p>Gradient descent usually isn’t used to fit Ordinary Differential Equations (ODEs) to data (at least, that isn’t how the Applied Mathematics departments to which I have been a part have done it). Nevertheless, that doesn’t mean that it can’t be done. For some of my recent GSoC work, I’ve been investigating how to compute gradients of solutions to ODEs without access to the solution’s analytical form. In this blog post, I describe how these gradients can be computed and how they can be used to fit ODEs to synchronous data with gradient descent.</p>
<h2 id="up-to-speed-with-odes">Up To Speed With ODEs</h2>
<p>I realize not everyone might have studied ODEs. Here is everything you need to know:</p>
<p>A differential equation relates an unknown function $y \in \mathbb{R}^n$ to it’s own derivative through a function $f: \mathbb{R}^n \times \mathbb{R} \times \mathbb{R}^m \rightarrow \mathbb{R}^n$, which also depends on time $t \in \mathbb{R}$ and possibly a set of parameters $\theta \in \mathbb{R}^m$. We usually write ODEs as</p>
<script type="math/tex; mode=display">y' = f(y,t,\theta) \quad y(t_0) = y_0</script>
<p>Here, we refer to the vector $y$ as “the system”, since the ODE above really defines a system of equations. The problem is usually equipped with an initial state of the system $y(t_0) = y_0$ from which the system evolves forward in $t$. Solutions to ODEs in analytic form are often <em>very hard</em> if not impossible, so most of the time we just numerically approximate the solution. It doesn’t matter how this is done because numerical integration is not the point of this post. If you’re interested, look up the class of <em>Runge-Kutta</em> methods.</p>
<h2 id="computing-gradients-for-odes">Computing Gradients for ODEs</h2>
<p>In this section, I’m going to be using derivative notation rather than $\nabla$ for gradients. I think it is less ambiguous.</p>
<p>If we want to fit an ODE model to data by minimizing some loss function $\mathcal{L}$, then gradient descent looks like</p>
<script type="math/tex; mode=display">\theta_{n+1} = \theta_n - \alpha \dfrac{\partial \mathcal{L}}{\partial \theta}</script>
<p>In order to compute the gradient of the loss, we need the gradient of the solution, $y$, with respect to $\theta$. The gradient of the solution is the hard part here because it can not be computed (a) analytically (because analytic solutions are hard AF), or (b) through automatic differentiation without differentiating through the numerical integration of our ODE (which seems computationally wasteful).</p>
<p>Thankfully, years of research into ODEs yields a way to do this (that is not the adjoint method. Surprise! You thought I was going to say the adjoint method didn’t you?). Forward mode sensitivity analysis calculates gradients by extending the ODE system to include the following equations:</p>
<script type="math/tex; mode=display">\dfrac{d}{dt}\left( \dfrac{\partial y}{\partial \theta} \right) = \mathcal{J}_f \dfrac{\partial y}{\partial \theta} +\dfrac{\partial f}{\partial \theta}</script>
<p>Here, $\mathcal{J}$ is the Jacobian of $f$ with respect to $y$. The forward sensitivity analysis is <em>just another differential equation</em> (see how it relates the derivative of the unknown $\partial y / \partial \theta$ to itself?)! In order to compute the gradient of $y$ with respect to $\theta$ at time $t_i$, we compute</p>
<script type="math/tex; mode=display">\dfrac{\partial y}{\partial \theta} = \int_{t_0}^{t_i} \mathcal{J}_f \dfrac{\partial y}{\partial \theta} + \dfrac{\partial f}{\partial \theta} \, dt</script>
<p>I know this looks scary, but since forward mode sensitivities are just ODEs, we actually just get this from what we can consider to be a black box</p>
<script type="math/tex; mode=display">\dfrac{\partial y}{\partial \theta} = \operatorname{BlackBox}(f(y,t,\theta), t_0, y_0, \theta)</script>
<p>So now that we have our gradient in hand, we can use the chain rule to write</p>
<script type="math/tex; mode=display">\dfrac{\partial \mathcal{L}}{\partial \theta} =\dfrac{\partial \mathcal{L}}{\partial y} \dfrac{\partial y}{\partial \theta}</script>
<p>We can use automatic differentiation to compute $\dfrac{\partial \mathcal{L}}{\partial y}$.</p>
<p>OK, so that is some math (interesting to me, maybe not so much to you). Let’s actually implement this in python.</p>
<h2 id="gradient-descent-for-the-sir-model">Gradient Descent for the SIR Model</h2>
<p>The SIR model is a set of differential equations which govern how a disease spreads through a homogeneously mixed closed populations. I could write an entire thesis on this model and its various extensions (in fact, I have), so I’ll let you read about those on your free time.</p>
<p>The system, shown below, is parameterized by a single parameter:</p>
<script type="math/tex; mode=display">\dfrac{dS}{dt} = -\theta SI \quad S(0) = 0.99</script>
<script type="math/tex; mode=display">\dfrac{dI}{dt} = \theta SI - I \quad I(0) = 0.01</script>
<p>Let’s define the system, the appropriate derivatives, generate some observations and fit $\theta$ using gradient descent. Here si what you’ll need to get started:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">autograd</span>
<span class="kn">from</span> <span class="nn">autograd.builtins</span> <span class="kn">import</span> <span class="nb">tuple</span>
<span class="kn">import</span> <span class="nn">autograd.numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="c">#Import ode solver and rename as BlackBox for consistency with blog</span>
<span class="kn">from</span> <span class="nn">scipy.integrate</span> <span class="kn">import</span> <span class="n">odeint</span> <span class="k">as</span> <span class="n">BlackBox</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
</code></pre></div></div>
<p>Let’s then define the ODE system</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">f</span><span class="p">(</span><span class="n">y</span><span class="p">,</span><span class="n">t</span><span class="p">,</span><span class="n">theta</span><span class="p">):</span>
<span class="s">'''Function describing dynamics of the system'''</span>
<span class="n">S</span><span class="p">,</span><span class="n">I</span> <span class="o">=</span> <span class="n">y</span>
<span class="n">ds</span> <span class="o">=</span> <span class="o">-</span><span class="n">theta</span><span class="o">*</span><span class="n">S</span><span class="o">*</span><span class="n">I</span>
<span class="n">di</span> <span class="o">=</span> <span class="n">theta</span><span class="o">*</span><span class="n">S</span><span class="o">*</span><span class="n">I</span> <span class="o">-</span> <span class="n">I</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="n">ds</span><span class="p">,</span><span class="n">di</span><span class="p">])</span>
</code></pre></div></div>
<p>and take appropriate derivatives</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#Jacobian wrt y</span>
<span class="n">J</span> <span class="o">=</span> <span class="n">autograd</span><span class="o">.</span><span class="n">jacobian</span><span class="p">(</span><span class="n">f</span><span class="p">,</span><span class="n">argnum</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="c">#Gradient wrt theta</span>
<span class="n">grad_f_theta</span> <span class="o">=</span> <span class="n">autograd</span><span class="o">.</span><span class="n">jacobian</span><span class="p">(</span><span class="n">f</span><span class="p">,</span><span class="n">argnum</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
</code></pre></div></div>
<p>Next, we’ll define the augmented system (that is, the ODE plus the sensitivities).</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">ODESYS</span><span class="p">(</span><span class="n">Y</span><span class="p">,</span><span class="n">t</span><span class="p">,</span><span class="n">theta</span><span class="p">):</span>
<span class="c">#Y will be length 4.</span>
<span class="c">#Y[0], Y[1] are the ODEs</span>
<span class="c">#Y[2], Y[3] are the sensitivities</span>
<span class="c">#ODE</span>
<span class="n">dy_dt</span> <span class="o">=</span> <span class="n">f</span><span class="p">(</span><span class="n">Y</span><span class="p">[</span><span class="mi">0</span><span class="p">:</span><span class="mi">2</span><span class="p">],</span><span class="n">t</span><span class="p">,</span><span class="n">theta</span><span class="p">)</span>
<span class="c">#Sensitivities</span>
<span class="n">grad_y_theta</span> <span class="o">=</span> <span class="n">J</span><span class="p">(</span><span class="n">Y</span><span class="p">[:</span><span class="mi">2</span><span class="p">],</span><span class="n">t</span><span class="p">,</span><span class="n">theta</span><span class="p">)</span><span class="nd">@Y</span><span class="p">[</span><span class="o">-</span><span class="mi">2</span><span class="p">::]</span> <span class="o">+</span> <span class="n">grad_f_theta</span><span class="p">(</span><span class="n">Y</span><span class="p">[:</span><span class="mi">2</span><span class="p">],</span><span class="n">t</span><span class="p">,</span><span class="n">theta</span><span class="p">)</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">concatenate</span><span class="p">([</span><span class="n">dy_dt</span><span class="p">,</span><span class="n">grad_y_theta</span><span class="p">])</span>
</code></pre></div></div>
<p>We’ll optimize the $L_2$ norm of the error</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">Cost</span><span class="p">(</span><span class="n">y_obs</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">cost</span><span class="p">(</span><span class="n">Y</span><span class="p">):</span>
<span class="s">'''Squared Error Loss'''</span>
<span class="n">n</span> <span class="o">=</span> <span class="n">y_obs</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
<span class="n">err</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">linalg</span><span class="o">.</span><span class="n">norm</span><span class="p">(</span><span class="n">y_obs</span> <span class="o">-</span> <span class="n">Y</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="n">axis</span> <span class="o">=</span> <span class="mi">1</span><span class="p">)</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">err</span><span class="p">)</span><span class="o">/</span><span class="n">n</span>
<span class="k">return</span> <span class="n">cost</span>
</code></pre></div></div>
<p>Create some observations from which to fit</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">seed</span><span class="p">(</span><span class="mi">19920908</span><span class="p">)</span>
<span class="c">## Generate Data</span>
<span class="c">#Initial Condition</span>
<span class="n">Y0</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mf">0.99</span><span class="p">,</span><span class="mf">0.01</span><span class="p">,</span> <span class="mf">0.0</span><span class="p">,</span> <span class="mf">0.0</span><span class="p">])</span>
<span class="c">#Space to compute solutions</span>
<span class="n">t</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">5</span><span class="p">,</span><span class="mi">101</span><span class="p">)</span>
<span class="c">#True param value</span>
<span class="n">theta</span> <span class="o">=</span> <span class="mf">5.5</span>
<span class="n">sol</span> <span class="o">=</span> <span class="n">BlackBox</span><span class="p">(</span><span class="n">ODESYS</span><span class="p">,</span> <span class="n">y0</span> <span class="o">=</span> <span class="n">Y0</span><span class="p">,</span> <span class="n">t</span> <span class="o">=</span> <span class="n">t</span><span class="p">,</span> <span class="n">args</span> <span class="o">=</span> <span class="nb">tuple</span><span class="p">([</span><span class="n">theta</span><span class="p">]))</span>
<span class="c">#Corupt the observations with noise</span>
<span class="n">y_obs</span> <span class="o">=</span> <span class="n">sol</span><span class="p">[:,:</span><span class="mi">2</span><span class="p">]</span> <span class="o">+</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">normal</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mf">0.05</span><span class="p">,</span><span class="n">size</span> <span class="o">=</span> <span class="n">sol</span><span class="p">[:,:</span><span class="mi">2</span><span class="p">]</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">t</span><span class="p">,</span><span class="n">y_obs</span><span class="p">[:,</span><span class="mi">0</span><span class="p">],</span> <span class="n">marker</span> <span class="o">=</span> <span class="s">'.'</span><span class="p">,</span> <span class="n">alpha</span> <span class="o">=</span> <span class="mf">0.5</span><span class="p">,</span> <span class="n">label</span> <span class="o">=</span> <span class="s">'S'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">t</span><span class="p">,</span><span class="n">y_obs</span><span class="p">[:,</span><span class="mi">1</span><span class="p">],</span> <span class="n">marker</span> <span class="o">=</span> <span class="s">'.'</span><span class="p">,</span> <span class="n">alpha</span> <span class="o">=</span> <span class="mf">0.5</span><span class="p">,</span> <span class="n">label</span> <span class="o">=</span> <span class="s">'I'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">legend</span><span class="p">()</span>
</code></pre></div></div>
<p>Perform Gradient Descent</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">theta_iter</span> <span class="o">=</span> <span class="mf">1.5</span>
<span class="n">cost</span> <span class="o">=</span> <span class="n">Cost</span><span class="p">(</span><span class="n">y_obs</span><span class="p">[:,:</span><span class="mi">2</span><span class="p">])</span>
<span class="n">grad_C</span> <span class="o">=</span> <span class="n">autograd</span><span class="o">.</span><span class="n">grad</span><span class="p">(</span><span class="n">cost</span><span class="p">)</span>
<span class="n">maxiter</span> <span class="o">=</span> <span class="mi">100</span>
<span class="n">learning_rate</span> <span class="o">=</span> <span class="mi">1</span> <span class="c">#Big steps</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">maxiter</span><span class="p">):</span>
<span class="n">sol</span> <span class="o">=</span> <span class="n">BlackBox</span><span class="p">(</span><span class="n">ODESYS</span><span class="p">,</span><span class="n">y0</span> <span class="o">=</span> <span class="n">Y0</span><span class="p">,</span> <span class="n">t</span> <span class="o">=</span> <span class="n">t</span><span class="p">,</span> <span class="n">args</span> <span class="o">=</span> <span class="nb">tuple</span><span class="p">([</span><span class="n">theta_iter</span><span class="p">]))</span>
<span class="n">Y</span> <span class="o">=</span> <span class="n">sol</span><span class="p">[:,:</span><span class="mi">2</span><span class="p">]</span>
<span class="n">theta_iter</span> <span class="o">-=</span><span class="n">learning_rate</span><span class="o">*</span><span class="p">(</span><span class="n">grad_C</span><span class="p">(</span><span class="n">Y</span><span class="p">)</span><span class="o">*</span><span class="n">sol</span><span class="p">[:,</span><span class="o">-</span><span class="mi">2</span><span class="p">:])</span><span class="o">.</span><span class="nb">sum</span><span class="p">()</span>
<span class="k">if</span> <span class="n">i</span><span class="o">%</span><span class="mi">10</span><span class="o">==</span><span class="mi">0</span><span class="p">:</span>
<span class="k">print</span><span class="p">(</span><span class="n">theta_iter</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'YYYEAAAAHHH!'</span><span class="p">)</span>
</code></pre></div></div>
<p>And lastly, compare our fitted curves to the true curves</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sol</span> <span class="o">=</span> <span class="n">BlackBox</span><span class="p">(</span><span class="n">ODESYS</span><span class="p">,</span> <span class="n">y0</span> <span class="o">=</span> <span class="n">Y0</span><span class="p">,</span> <span class="n">t</span> <span class="o">=</span> <span class="n">t</span><span class="p">,</span> <span class="n">args</span> <span class="o">=</span> <span class="nb">tuple</span><span class="p">([</span><span class="n">theta_iter</span><span class="p">]))</span>
<span class="n">true_sol</span> <span class="o">=</span> <span class="n">BlackBox</span><span class="p">(</span><span class="n">ODESYS</span><span class="p">,</span> <span class="n">y0</span> <span class="o">=</span> <span class="n">Y0</span><span class="p">,</span> <span class="n">t</span> <span class="o">=</span> <span class="n">t</span><span class="p">,</span> <span class="n">args</span> <span class="o">=</span> <span class="nb">tuple</span><span class="p">([</span><span class="n">theta</span><span class="p">]))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">t</span><span class="p">,</span><span class="n">sol</span><span class="p">[:,</span><span class="mi">0</span><span class="p">],</span> <span class="n">label</span> <span class="o">=</span> <span class="s">'S'</span><span class="p">,</span> <span class="n">color</span> <span class="o">=</span> <span class="s">'C0'</span><span class="p">,</span> <span class="n">linewidth</span> <span class="o">=</span> <span class="mi">5</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">t</span><span class="p">,</span><span class="n">sol</span><span class="p">[:,</span><span class="mi">1</span><span class="p">],</span> <span class="n">label</span> <span class="o">=</span> <span class="s">'I'</span><span class="p">,</span> <span class="n">color</span> <span class="o">=</span> <span class="s">'C1'</span><span class="p">,</span> <span class="n">linewidth</span> <span class="o">=</span> <span class="mi">5</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">t</span><span class="p">,</span><span class="n">y_obs</span><span class="p">[:,</span><span class="mi">0</span><span class="p">],</span> <span class="n">marker</span> <span class="o">=</span> <span class="s">'.'</span><span class="p">,</span> <span class="n">alpha</span> <span class="o">=</span> <span class="mf">0.5</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">t</span><span class="p">,</span><span class="n">y_obs</span><span class="p">[:,</span><span class="mi">1</span><span class="p">],</span> <span class="n">marker</span> <span class="o">=</span> <span class="s">'.'</span><span class="p">,</span> <span class="n">alpha</span> <span class="o">=</span> <span class="mf">0.5</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">t</span><span class="p">,</span><span class="n">true_sol</span><span class="p">[:,</span><span class="mi">0</span><span class="p">],</span> <span class="n">label</span> <span class="o">=</span> <span class="s">'Estimated '</span><span class="p">,</span> <span class="n">color</span> <span class="o">=</span> <span class="s">'k'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">t</span><span class="p">,</span><span class="n">true_sol</span><span class="p">[:,</span><span class="mi">1</span><span class="p">],</span> <span class="n">color</span> <span class="o">=</span> <span class="s">'k'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">legend</span><span class="p">()</span>
</code></pre></div></div>
<p>Here is our result!</p>
<div style="text-align:center"><img src="/images/blog/final.png" /></div>
<h2 id="conclusions">Conclusions</h2>
<p>Fitting ODEs via gradient descent is possible, and not as complicated as I had initially thought. There are still some relaxations to be explored. Namely: what happens if we have observations at time $t_i$ for one part of the system but not the other? How does this scale as we add more parameters to the model? Can we speed up gradient descent some how (because it takes too long to converge as it is, hence the <code class="highlighter-rouge">maxiter</code> variable). In any case, this was an interesting, yet related, divergence from my GSoC work. I hope you learned something.</p>Demetri Pananosapananos@uwo.caGradient descent usually isn’t used to fit Ordinary Differential Equations (ODEs) to data (at least, that isn’t how the Applied Mathematics departments to which I have been a part have done it). Nevertheless, that doesn’t mean that it can’t be done. For some of my recent GSoC work, I’ve been investigating how to compute gradients of solutions to ODEs without access to the solution’s analytical form. In this blog post, I describe how these gradients can be computed and how they can be used to fit ODEs to synchronous data with gradient descent.