A Purrrfect Method for Simulating Data
Published:
When I was doing my Masters, I had to generate a lot of plots, which means I had to generate a lot of data. Usually, the data I would be generating would depend on a parameter (maybe something like the rolling window length, or maybe the bandwidth for some smoothing function) and I would have to try a whirlwind of combinations. In order to do this, I would end up doing is writing code to generate the data once, then just loop over that code for different values of the parameters.
It was extremely painful when I had to make a small change to code, because that means I would have to comb through lines of code to find the right place to edit, make the edit, ensure I was still collecting the right information, ensure my indents were in the right place, and finally ensure that the code still ran. It was a bit of a nightmare, and I winced every time my supervisor suggested collecting new information.
By means of example, suppose I wanted to investigate how the shape of the probability density function for the Beta distribution changes as I vary α and β.
Here is how I would have done it before learning about purrr
.
library(dplyr)
library(magrittr)
#Generate the candidate parameterizations
alpha = c(2,4,6)
beta = c(8,10,12)
#Need this for later
x = seq(0,1,0.001)
#Need somewhere to store this
results = data.frame(alpha = NULL,
beta = NULL,
x = NULL,
y = NULL
)
for(A in alpha){
#NO, NOT ANOTHER ONE, NO!
for(B in beta){
#Generate the candidate density
density = dbeta(x,shape1 = A, shape2 = B)
#Append it to the previous results
new_results = data.frame(alpha = A, beta = B, x = x, y = density)
results<- bind_rows(results, new_results)
}
}
The code above is not so bad, but you can imagine how this could get hairy quickly. The code I wrote just didn’t feel efficient, or elegant, or tidy. After I started learning about the R, and in particular after I watched this video, I started to think how my problems could be solved using the tidyverse and purrr
. I really liked the idea of a dataframe inside a dataframe, so I started to think of my simulations like a cupboard in a kitchen: keep similar things together, but in seperate containers. I keep my coffee and sugar on the same shelf, but in different containers, why couldn’t I do that with data from simulations?
Here is what I had in my mind:
Simulation Parameter 1 | Simulation Parameter 2 | Data From Simulation |
---|---|---|
Value 1 | Value 1 | Data |
Value 2 | Value 2 | Data |
And after learning about how purrr
works, here is the implementation
library(tidyverse)
#Candidate parameterizations
a = c(2,4,6)
b = c(8,10,12)
#Put them in the shelf
params = expand.grid(a = a, b = b)
#This function does most of the heavy lifting
#Special thanks to mishabalyasin on Rstudio Community
gen_den = function(a,b,...){
x = seq(0,1,0.001)
den = dbeta(x = x, shape1 = a, shape2 = b)
return(tibble(x = x, y = den))
}
#Voila
params %>%
dplyr::mutate(data = pmap(., gen_den)) %>%
head
First, the code is very neat and easy to read through. Second, I don’t have to worry about which loop I am in, or adding results to a dataframe, or if my indents are right.
While this particular problem is quite contrived, I think it serves to highlight how useful functional programming can be, especially for repetitive taks like simulation, or sensitivity analysis. For example, shown below are solutions to a differential equation for several values of the parameters. This is something I most certainly would have had to do in grad school, and not a single loop in this code.
library(deSolve)
#specify the parameters
params = expand.grid(mu = c(1/1000, 1/100, 1/10), gamma = c(5,25,50), beta = c(250) )
#specify the differential equation
SIR<-function(t, state, parameters) {
with(as.list(c(state, parameters)),{
# rate of change
dS = mu- beta*S*J
dJ = beta*S*J - (gamma+mu)*J
# return the rate of change
list(c(dS,dJ))
}) # end with(as.list ...
}
#Solve the ode with
solve_ode = function(mu, gamma, beta){
#Usually, these are not changed
t = seq(0,50,0.01)
ICs = c(S = 0.99,J = 0.01)
#The actual solving
out <- ode(y = ICs, times = t, func = SIR, parms = c(mu = mu, gamma = gamma, beta = beta)) %>% as.tibble
}
#The good stuff
solutions = params %>%
mutate(data = pmap(.,solve_ode)) %>%
unnest()
#plot
solutions %>%
mutate(mu = paste0('mu=',mu),
gamma = paste0('gamma=',gamma)
) %>%
ggplot(aes(time,S, color = gamma))+
geom_line()+
facet_wrap(~mu, nrow = 2, ncol = 2)+
theme(aspect.ratio = 1)
Special thanks to Msha Balyasin for helping me out on this post on Rstudio Community!