Some Windows users had difficulties with PuTTY. Another way to access a terminal on greatlakes is to connect to https://greatlakes.arc-ts.umich.edu/ and follow the menu option to Clusters
\(\rightarrow\) Great Lakes login node
.
On my MacBook Air laptop, times from running Rscript test.R
, or equivalently source(test.R)
within an R session, are
## user.self sys.self elapsed user.child sys.child
## time0 6.424 0.231 6.716 0.000 0.000
## time1 0.390 0.874 6.468 8.271 1.239
## time2 0.425 0.916 7.678 0.000 0.000
## time3 0.493 0.779 5.715 13.356 1.629
## time4 1.680 0.922 8.053 14.133 2.185
sbatch test.sbat
, you get## user.self sys.self elapsed user.child sys.child
## time0 6.370 0.266 6.652 0.000 0.000
## time1 0.210 0.829 2.003 0.683 0.123
## time2 0.213 0.791 1.384 1.092 0.262
## time3 0.311 0.645 1.400 6.802 1.869
## time4 1.047 0.947 2.235 6.976 2.755
We follow the solutions from Simulation-Based Inference for Epidemiological Dynamics [1] Lesson 4 . We start by building a pomp object for the SEIR model from Homework 6, using as a starting point the SIR code from the notes.
library(pomp)
library(tidyverse)
library(doParallel)
library(doRNG)
source("https://kingaa.github.io/sbied/pfilter/model.R")
start_time <- Sys.time()
seir_step <- Csnippet("
double dN_SE = rbinom(S,1-exp(-Beta*I/N*dt));
double dN_EI = rbinom(E,1-exp(-mu_EI*dt));
double dN_IR = rbinom(I,1-exp(-mu_IR*dt));
S -= dN_SE;
E += dN_SE - dN_EI;
I += dN_EI - dN_IR;
R += dN_IR;
H += dN_IR;
")
seir_rinit <- Csnippet("
S = nearbyint(eta*N);
E = 0;
I = 1;
R = nearbyint((1-eta)*N);
H = 0;
")
measSEIR <- pomp(measSIR,
rprocess=euler(seir_step,delta.t=1/7),
rinit=seir_rinit,
paramnames=c("N","Beta","mu_EI","mu_IR","eta","k","rho"),
partrans=parameter_trans(
log=c("Beta","mu_EI","mu_IR","k"),
logit=c("eta","rho")
),
statenames=c("S","E","I","R","H")
)
Now, we’ll start by considering the best parameters we’ve found so far, for the regime where \(\mu_{IR}=2\mathrm{wk}^{-1}\). We extract these from the database used for the notes.
read_csv("https://kingaa.github.io/sbied/mif/measles_params.csv") %>%
filter(
loglik==max(loglik),
abs(mu_IR-2)<0.001
) %>%
select(-loglik,-loglik.se) -> coef(measSEIR)
coef(measSEIR,"mu_EI") <- 0.8
## Warning: in 'coef<-': name 'mu_EI' refers to no existing parameter; it is being
## concatenated.
fixed_params <- coef(measSEIR,c("N","mu_IR","k"))
coef(measSEIR)
## Beta mu_IR rho k eta N
## 3.788053e+00 2.000000e+00 5.785164e-02 1.000000e+01 5.884615e-01 3.800000e+04
## mu_EI
## 8.000000e-01
The warning tells us that mu_EI
is a new parameter, which of course, we knew.
To debug the model and provide a sanity check on our parameter guesses, we first explore via simulation. Some simulations die out, but others lead to epidemics.
set.seed(1014406)
measSEIR %>%
simulate(nsim=20,format="data.frame",include=TRUE) %>%
ggplot(aes(x=week,y=reports,group=.id,color=(.id=="data")))+
geom_line()+
guides(color="none")+
theme_bw()
The next prerequisite is that we can successfully filter:
pf1 <- pfilter(measSEIR,1000)
plot(pf1)
logLik(pf1)
## [1] -249.9245
The minimum effective sample size is 5, which is not a complete disaster, and we should bear in mind that this is likely to improve when we fit the parameters.
We now carry out a local search, estimating only 4 parameters for simplicity. For a thorough scientific analysis, one would also want to consider the evidence in the data concerning the other parameters that are fixed here.
run_level <- 3
Np <- switch(run_level,100, 1e3, 2e3)
Nlocal <- switch(run_level, 2, 5, 20)
Nglobal <- switch(run_level, 2, 5, 100)
Npoints_profile <- switch(run_level, 4, 10, 50)
Nreps_profile <- switch(run_level, 2, 4, 15)
Nmif <- switch(run_level, 10, 50, 100)
Nreps_eval <- switch(run_level, 2, 5, 10)
library(doParallel)
cores <- as.numeric(Sys.getenv('SLURM_NTASKS_PER_NODE',unset=NA))
if(is.na(cores)) cores <- detectCores()
registerDoParallel(cores)
results_dir <- paste0("laptop_",run_level,"/")
#results_dir <- paste0("greatlakes_",run_level,"/")
if(!dir.exists(results_dir)) dir.create(results_dir)
bake(file=paste0(results_dir,"cores.rds"),cores) -> cores
library(doRNG)
registerDoRNG(482947940)
This consistently obtains log likelihoods around -104, similar to those found with the SIR model:
sapply(mifs_local,logLik)
## [1] -105.1628 -104.9758 -105.3915 -105.3748 -105.2248 -105.1515 -105.0409
## [8] -104.8968 -105.0645 -104.9695 -105.0639 -105.3850 -105.1000 -105.0275
## [15] -105.0221 -105.2937 -104.8534 -105.0570 -105.0831 -104.9307
As usual, we should evaluate the likelihoods using a particle filter, rather than relying on the likelihood from the last filtering iteration of the perturbed model used by mif2
.
registerDoRNG(900242057)
foreach(mf=mifs_local,.combine=rbind) %dopar% {
library(pomp)
library(tidyverse)
evals <- replicate(Nreps_eval, logLik(pfilter(mf,Np=Np)))
ll <- logmeanexp(evals,se=TRUE)
mf %>% coef() %>% bind_rows() %>%
bind_cols(loglik=ll[1],loglik.se=ll[2])
} -> local_logliks
In this case, there is not much discrepancy between the perturbed and unperturbed likelihoods. The small improvement (rather than disadvantage) from filtering with fixed parameters supports a hypothesis that the constant parameter model is reasonable here.
local_logliks$loglik
##
## -104.3372 -104.5065 -104.2713 -104.4094 -104.4142 -104.3839 -104.3983 -104.3503
##
## -104.4736 -104.4238 -104.3638 -104.3319 -104.3234 -104.4009 -104.4933 -104.2462
##
## -104.3965 -104.3881 -104.5120 -104.2379
set.seed(2062379496)
runif_design(
lower=c(Beta=5,rho=0.2,eta=0,mu_EI=1/3),
upper=c(Beta=80,rho=0.9,eta=1,mu_EI=3),
nseq=Nglobal
) -> guesses
mf1 <- mifs_local[[1]]
bake(file=paste0(results_dir,"global_search.rds"),{
registerDoRNG(1270401374)
foreach(guess=iter(guesses,"row"), .combine=rbind) %dopar% {
library(pomp)
library(tidyverse)
mf1 %>%
mif2(params=c(unlist(guess),fixed_params),Np=Np) %>%
mif2() -> mf
replicate(
Nreps_eval,
mf %>% pfilter(Np=Np) %>% logLik()
) %>%
logmeanexp(se=TRUE) -> ll
mf %>% coef() %>% bind_rows() %>%
bind_cols(loglik=ll[1],loglik.se=ll[2])
} -> results
}) %>%
filter(is.finite(loglik)) -> results
The maximum log likelihood discovered is -103.6. This small improvement of around one log unit is not compelling evidence by itself for the need of an extra parameter.
The fitted model has some interesting features, which can be seen from a scatter plot.
pairs(~loglik+Beta+eta+rho+mu_EI,
data=filter(results,loglik>max(loglik)-10))
When including an latent period, the MLE has intermediate values of \(\rho\) and \(\eta\) that match epidemiological expectations for endemic measles in the pre-vaccine era, while remaining consistent with a mean infectious period of 0.5 wk. This is substantially different from the results in Section 5 of Lesson 4. Thus, adding a latent period to the model can substantially change the interpretation of the fitted model without substantially changing the overall fit measured by maximized likelihood. The profile likelihood calculations below help to clarify this finding.
The likelihood surface here is fairly flat: the y-axis range is just 2 log units. Data on a single epidemic cannot readily distinguish whether the disease has a high susceptible fraction and low reporting rate, or low susceptible fraction and high reporting rate. Longer time series could resolve this question.
Here, we follow the profile code in Chapter 14 of the course notes [2] which draws on [1].
Recall that profiling means determining, for each value of \(\rho\), the best likelihood that the model can achieve.
To do this, we’ll first bound the uncertainty by putting a box around the highest-likelihood estimates we’ve found so far.
Within this box, we’ll choose some random starting points, for each of several values of \(\rho\).
filter(results,loglik>max(loglik)-20) %>%
sapply(range) -> box
box
## Beta rho eta mu_EI N mu_IR k loglik
## [1,] 1.734629 0.02936271 0.02233601 0.7094618 38000 2 10 -119.8806
## [2,] 168.559655 0.94014333 0.99671126 48.3440159 38000 2 10 -103.5724
## loglik.se
## [1,] 0.02837331
## [2,] 0.49851575
freeze(seed=1196696958,
profile_design(
rho =seq(0.01,0.95,length=Npoints_profile),
lower=box[1,c("Beta","eta","mu_EI")],
upper=box[2,c("Beta","eta","mu_EI")],
nprof=Nreps_profile, type="runif"
)) -> guesses
plot(guesses)
fixed_params <- c(N=38000, mu_IR=2, k=10)
bake(file=paste0(results_dir,"rho_profile.rds"),dependson=guesses,{
registerDoRNG(2105684752)
foreach(guess=iter(guesses,"row"), .combine=rbind) %dopar% {
library(pomp)
library(tidyverse)
mf1 %>% mif2(params=c(guess,fixed_params),Nmif=Nmif,
rw.sd=rw.sd(Beta=0.02,eta=ivp(0.02),mu_EI=0.02)) %>%
mif2(Nmif=Nmif,Np=Np,cooling.fraction.50=0.3) -> mf
replicate(
Nreps_eval,
mf %>% pfilter(Np=Np) %>% logLik()) %>%
logmeanexp(se=TRUE) -> ll
mf %>% coef() %>% bind_rows() %>%
bind_cols(loglik=ll[1],loglik.se=ll[2])
} -> prof_results
attr(prof_results,"ncpu") <- getDoParWorkers()
prof_results
}) -> profile_results
t_rho <- attr(profile_results,"system.time")
ncpu_rho <- attr(profile_results,"ncpu")
The profile took 1000 min to run on 4 cores.
profile_results %>%
filter(is.finite(loglik)) -> profile_results
pairs(~loglik+Beta+eta+rho+mu_EI,data=profile_results,pch=16)
profile_results %>%
filter(loglik>max(loglik)-10) %>%
group_by(round(rho,2)) %>%
filter(rank(-loglik)<3) %>%
ungroup() %>%
ggplot(aes(x=rho,y=loglik))+
geom_point()+
geom_hline(
color="red",
yintercept=max(results$loglik)-0.5*qchisq(df=1,p=0.95)
)
profile_results %>%
filter(loglik>max(loglik)-0.5*qchisq(df=1,p=0.95)) %>%
summarize(min=min(rho),max=max(rho)) -> rho_ci
According to this model, the data are consistent with reporting efficiencies in the 2.9 – 87 percent range (95% CI)`.
We can compare to the profile over \(\rho\) on [2] Chapter 14, slide 75, which obtains an approximate confidence interval of 3–17% for \(\rho\) when there is no latent period.
Including the latent period therefore allows us to fit the data with a model matching our anticipation that infectious period should be around 3.5 days and reporting rate around 60%.
end_time <- Sys.time()
bake(file=paste0(results_dir,"run_time.rds"),difftime(end_time,start_time,units="secs")) -> run_time
On a MacBook Air laptop, the SEIR computations took 6.3 min at run level 2, with 4 cores.
At run level 3, the laptop took 18 hr. This included a delay when the machine powered down over night! The actual calculation time was about 5 hr.
For running long jobs on a laptop, it can be useful to use nohup
and &
to run the job in the background, regardless of the fate of the terminal where it was started.
nohup Rscript --vanilla -e "rmarkdown::render(\"sol07.Rmd\")" &
However, you also have to stop your laptop powering down. Even then, it is not very practical. Time to use a cluster.
On greatlakes, the computation time was 57 sec at run level 2, and 22 min at run level 3, with 36 cores.
The sbat files used to run the code on greatlakes are run1.sbat
, run2.sbat
, run3.sbat
in the source directory. They compile the document using knitr::knit()
rather than rmarkdown::render()`` since the latter needs an additional program (
pandoc) to convert the results to HTML.
knit` will run the code and save the results of the computations, which can then be copied back to your laptop for compilation to HTML.
1. King, A.A., Ionides, E.L., and Lin, Q. (2021). Simulation-based inference for epidemiological dynamics. Summer Institute in Statistics and Modeling in Infectious Diseases. Available at: https://kingaa.github.io/sbied/.
2. Ionides, E. (2022). Notes for STATS/DATASCI 531, Modeling and Analysis of Time Series Data. Available at: https://ionides.github.io/531w22/.