A Purrrfect Method for Simulating Data

4 minute read

Published:

When I was doing my Masters, I had to generate a lot of plots, which means I had to generate a lot of data. Usually, the data I would be generating would depend on a parameter (maybe something like the rolling window length, or maybe the bandwidth for some smoothing function) and I would have to try a whirlwind of combinations. In order to do this, I would end up doing is writing code to generate the data once, then just loop over that code for different values of the parameters.

It was extremely painful when I had to make a small change to code, because that means I would have to comb through lines of code to find the right place to edit, make the edit, ensure I was still collecting the right information, ensure my indents were in the right place, and finally ensure that the code still ran. It was a bit of a nightmare, and I winced every time my supervisor suggested collecting new information.

By means of example, suppose I wanted to investigate how the shape of the probability density function for the Beta distribution changes as I vary α and β.

Here is how I would have done it before learning about purrr.

library(dplyr)
library(magrittr)

#Generate the candidate parameterizations
alpha = c(2,4,6)
beta  =  c(8,10,12)

#Need this for later
x = seq(0,1,0.001)

#Need somewhere to store this
results = data.frame(alpha = NULL,
                     beta = NULL,
                     x = NULL,
                     y = NULL
                     )


for(A in alpha){
  #NO, NOT ANOTHER ONE, NO!
  for(B in beta){

    #Generate the candidate density
    density = dbeta(x,shape1 = A, shape2 = B)

    #Append it to the previous results
    new_results = data.frame(alpha = A, beta = B, x = x, y = density)
    results<- bind_rows(results, new_results)
  }
}

The code above is not so bad, but you can imagine how this could get hairy quickly. The code I wrote just didn’t feel efficient, or elegant, or tidy. After I started learning about the R, and in particular after I watched this video, I started to think how my problems could be solved using the tidyverse and purrr. I really liked the idea of a dataframe inside a dataframe, so I started to think of my simulations like a cupboard in a kitchen: keep similar things together, but in seperate containers. I keep my coffee and sugar on the same shelf, but in different containers, why couldn’t I do that with data from simulations?

Here is what I had in my mind:

Simulation Parameter 1Simulation Parameter 2Data From Simulation
Value 1Value 1Data
Value 2Value 2Data

And after learning about how purrr works, here is the implementation

library(tidyverse)
#Candidate parameterizations
a = c(2,4,6)
b = c(8,10,12)

#Put them in the shelf
params = expand.grid(a = a, b = b)

#This function does most of the heavy lifting
#Special thanks to mishabalyasin on Rstudio Community
gen_den = function(a,b,...){

  x = seq(0,1,0.001)
  den = dbeta(x = x, shape1 = a, shape2 = b)
  return(tibble(x = x, y = den))
}

#Voila
params %>%
    dplyr::mutate(data = pmap(., gen_den)) %>%
    head

First, the code is very neat and easy to read through. Second, I don’t have to worry about which loop I am in, or adding results to a dataframe, or if my indents are right.

While this particular problem is quite contrived, I think it serves to highlight how useful functional programming can be, especially for repetitive taks like simulation, or sensitivity analysis. For example, shown below are solutions to a differential equation for several values of the parameters. This is something I most certainly would have had to do in grad school, and not a single loop in this code.

library(deSolve)
#specify the parameters
params = expand.grid(mu = c(1/1000, 1/100, 1/10), gamma = c(5,25,50), beta = c(250) )

#specify the differential equation
 SIR<-function(t, state, parameters) {
  with(as.list(c(state, parameters)),{
     # rate of change
     dS = mu- beta*S*J
     dJ = beta*S*J - (gamma+mu)*J

     # return the rate of change
     list(c(dS,dJ))
   })   # end with(as.list ...
 }


 #Solve the ode with
solve_ode = function(mu, gamma, beta){

  #Usually, these are not changed
  t = seq(0,50,0.01)
  ICs = c(S = 0.99,J = 0.01)

  #The actual solving
  out <- ode(y = ICs, times = t, func = SIR, parms = c(mu = mu, gamma = gamma, beta = beta)) %>% as.tibble
}


#The good stuff

solutions = params %>%
            mutate(data = pmap(.,solve_ode)) %>%
            unnest()


#plot
solutions %>%
  mutate(mu = paste0('mu=',mu),
         gamma = paste0('gamma=',gamma)
         ) %>%
  ggplot(aes(time,S, color = gamma))+
  geom_line()+
  facet_wrap(~mu, nrow = 2, ncol = 2)+
  theme(aspect.ratio = 1)

Special thanks to Msha Balyasin for helping me out on this post on Rstudio Community!