Question 7.1. Introduction to the greatlakes cluster.

##       user.self sys.self elapsed user.child sys.child
## time0     6.424    0.231   6.716      0.000     0.000
## time1     0.390    0.874   6.468      8.271     1.239
## time2     0.425    0.916   7.678      0.000     0.000
## time3     0.493    0.779   5.715     13.356     1.629
## time4     1.680    0.922   8.053     14.133     2.185
##       user.self sys.self elapsed user.child sys.child
## time0     6.370    0.266   6.652      0.000     0.000
## time1     0.210    0.829   2.003      0.683     0.123
## time2     0.213    0.791   1.384      1.092     0.262
## time3     0.311    0.645   1.400      6.802     1.869
## time4     1.047    0.947   2.235      6.976     2.755

Question 7.2. Likelihood maximization for the SEIR model.

(a). Developing the model and the likelihood maximization procedure

We follow the solutions from Simulation-Based Inference for Epidemiological Dynamics [1] Lesson 4 . We start by building a pomp object for the SEIR model from Homework 6, using as a starting point the SIR code from the notes.

library(pomp)
library(tidyverse)
library(doParallel)
library(doRNG)
source("https://kingaa.github.io/sbied/pfilter/model.R")
start_time <- Sys.time()
seir_step <- Csnippet("
  double dN_SE = rbinom(S,1-exp(-Beta*I/N*dt));
  double dN_EI = rbinom(E,1-exp(-mu_EI*dt));
  double dN_IR = rbinom(I,1-exp(-mu_IR*dt));
  S -= dN_SE;
  E += dN_SE - dN_EI;
  I += dN_EI - dN_IR;
  R += dN_IR;
  H += dN_IR;
")

seir_rinit <- Csnippet("
  S = nearbyint(eta*N);
  E = 0;
  I = 1;
  R = nearbyint((1-eta)*N);
  H = 0;
")

measSEIR <- pomp(measSIR,
  rprocess=euler(seir_step,delta.t=1/7),
  rinit=seir_rinit,
  paramnames=c("N","Beta","mu_EI","mu_IR","eta","k","rho"),
  partrans=parameter_trans(
        log=c("Beta","mu_EI","mu_IR","k"),
        logit=c("eta","rho")
  ),
  statenames=c("S","E","I","R","H")
)

Now, we’ll start by considering the best parameters we’ve found so far, for the regime where \(\mu_{IR}=2\mathrm{wk}^{-1}\). We extract these from the database used for the notes.

read_csv("https://kingaa.github.io/sbied/mif/measles_params.csv") %>%
  filter(
    loglik==max(loglik),
    abs(mu_IR-2)<0.001
    ) %>%
  select(-loglik,-loglik.se) -> coef(measSEIR)

coef(measSEIR,"mu_EI") <- 0.8
## Warning: in 'coef<-': name 'mu_EI' refers to no existing parameter; it is being
## concatenated.
fixed_params <- coef(measSEIR,c("N","mu_IR","k"))
coef(measSEIR)
##         Beta        mu_IR          rho            k          eta            N 
## 3.788053e+00 2.000000e+00 5.785164e-02 1.000000e+01 5.884615e-01 3.800000e+04 
##        mu_EI 
## 8.000000e-01

The warning tells us that mu_EI is a new parameter, which of course, we knew.

To debug the model and provide a sanity check on our parameter guesses, we first explore via simulation. Some simulations die out, but others lead to epidemics.

set.seed(1014406)
measSEIR %>%
  simulate(nsim=20,format="data.frame",include=TRUE) %>%
  ggplot(aes(x=week,y=reports,group=.id,color=(.id=="data")))+
  geom_line()+
  guides(color="none")+
  theme_bw()

The next prerequisite is that we can successfully filter:

pf1 <- pfilter(measSEIR,1000)
plot(pf1)

logLik(pf1)
## [1] -249.9245

The minimum effective sample size is 5, which is not a complete disaster, and we should bear in mind that this is likely to improve when we fit the parameters.

We now carry out a local search, estimating only 4 parameters for simplicity. For a thorough scientific analysis, one would also want to consider the evidence in the data concerning the other parameters that are fixed here.

  • The computation time depends on how many profile points to calculate, how many replicated searches to do at each point. Here, we use the run level approach described in Chapter 15 of [2].
  run_level <- 3
  Np <-              switch(run_level,100, 1e3, 2e3)
  Nlocal <-          switch(run_level,  2,   5,  20)
  Nglobal <-         switch(run_level,  2,   5, 100)
  Npoints_profile <- switch(run_level,  4,  10,  50)
  Nreps_profile   <- switch(run_level,  2,   4,  15)
  Nmif <-            switch(run_level, 10,  50, 100)
  Nreps_eval <-      switch(run_level,  2,   5,  10)
  • We set up a directory for saving cached results, and an environment for parallel computation which is set up to run on either a single machine or a single node of a slurm cluster.
library(doParallel)
cores <- as.numeric(Sys.getenv('SLURM_NTASKS_PER_NODE',unset=NA))  
if(is.na(cores)) cores <- detectCores()  
registerDoParallel(cores)
results_dir <- paste0("laptop_",run_level,"/")
#results_dir <- paste0("greatlakes_",run_level,"/")
if(!dir.exists(results_dir)) dir.create(results_dir)
bake(file=paste0(results_dir,"cores.rds"),cores) -> cores
  • We set up a parallel random number generator and set a seed for reproducibility.
library(doRNG)
registerDoRNG(482947940)

This consistently obtains log likelihoods around -104, similar to those found with the SIR model:

sapply(mifs_local,logLik)
##  [1] -105.1628 -104.9758 -105.3915 -105.3748 -105.2248 -105.1515 -105.0409
##  [8] -104.8968 -105.0645 -104.9695 -105.0639 -105.3850 -105.1000 -105.0275
## [15] -105.0221 -105.2937 -104.8534 -105.0570 -105.0831 -104.9307

As usual, we should evaluate the likelihoods using a particle filter, rather than relying on the likelihood from the last filtering iteration of the perturbed model used by mif2.

registerDoRNG(900242057)
foreach(mf=mifs_local,.combine=rbind) %dopar% {
  library(pomp)
  library(tidyverse)
  evals <- replicate(Nreps_eval, logLik(pfilter(mf,Np=Np)))
  ll <- logmeanexp(evals,se=TRUE)
  mf %>% coef() %>% bind_rows() %>%
    bind_cols(loglik=ll[1],loglik.se=ll[2])
} -> local_logliks

In this case, there is not much discrepancy between the perturbed and unperturbed likelihoods. The small improvement (rather than disadvantage) from filtering with fixed parameters supports a hypothesis that the constant parameter model is reasonable here.

local_logliks$loglik
##                                                                                 
## -104.3372 -104.5065 -104.2713 -104.4094 -104.4142 -104.3839 -104.3983 -104.3503 
##                                                                                 
## -104.4736 -104.4238 -104.3638 -104.3319 -104.3234 -104.4009 -104.4933 -104.2462 
##                                         
## -104.3965 -104.3881 -104.5120 -104.2379
set.seed(2062379496)

runif_design(
  lower=c(Beta=5,rho=0.2,eta=0,mu_EI=1/3),
  upper=c(Beta=80,rho=0.9,eta=1,mu_EI=3),
  nseq=Nglobal
) -> guesses
mf1 <- mifs_local[[1]]
bake(file=paste0(results_dir,"global_search.rds"),{
registerDoRNG(1270401374)
foreach(guess=iter(guesses,"row"), .combine=rbind) %dopar% {
  library(pomp)
  library(tidyverse)
  mf1 %>%
    mif2(params=c(unlist(guess),fixed_params),Np=Np) %>%
    mif2() -> mf
  replicate(
    Nreps_eval,
    mf %>% pfilter(Np=Np) %>% logLik()
  ) %>%
    logmeanexp(se=TRUE) -> ll
  mf %>% coef() %>% bind_rows() %>%
    bind_cols(loglik=ll[1],loglik.se=ll[2])
} -> results
}) %>%
  filter(is.finite(loglik)) -> results

(b). Comparison of SEIR and SIR likelihoods

The maximum log likelihood discovered is -103.6. This small improvement of around one log unit is not compelling evidence by itself for the need of an extra parameter.

(c). Interpretation of the SEIR fitted model

The fitted model has some interesting features, which can be seen from a scatter plot.

pairs(~loglik+Beta+eta+rho+mu_EI,
      data=filter(results,loglik>max(loglik)-10))

When including an latent period, the MLE has intermediate values of \(\rho\) and \(\eta\) that match epidemiological expectations for endemic measles in the pre-vaccine era, while remaining consistent with a mean infectious period of 0.5 wk. This is substantially different from the results in Section 5 of Lesson 4. Thus, adding a latent period to the model can substantially change the interpretation of the fitted model without substantially changing the overall fit measured by maximized likelihood. The profile likelihood calculations below help to clarify this finding.

The likelihood surface here is fairly flat: the y-axis range is just 2 log units. Data on a single epidemic cannot readily distinguish whether the disease has a high susceptible fraction and low reporting rate, or low susceptible fraction and high reporting rate. Longer time series could resolve this question.

(d). Calculating and plotting a profile likelihood for reporting rate.

Here, we follow the profile code in Chapter 14 of the course notes [2] which draws on [1].

  • Recall that profiling means determining, for each value of \(\rho\), the best likelihood that the model can achieve.

  • To do this, we’ll first bound the uncertainty by putting a box around the highest-likelihood estimates we’ve found so far.

  • Within this box, we’ll choose some random starting points, for each of several values of \(\rho\).

  filter(results,loglik>max(loglik)-20) %>%
    sapply(range) -> box
  box
##            Beta        rho        eta      mu_EI     N mu_IR  k    loglik
## [1,]   1.734629 0.02936271 0.02233601  0.7094618 38000     2 10 -119.8806
## [2,] 168.559655 0.94014333 0.99671126 48.3440159 38000     2 10 -103.5724
##       loglik.se
## [1,] 0.02837331
## [2,] 0.49851575
  freeze(seed=1196696958,
    profile_design(
      rho =seq(0.01,0.95,length=Npoints_profile),
      lower=box[1,c("Beta","eta","mu_EI")],
      upper=box[2,c("Beta","eta","mu_EI")],
      nprof=Nreps_profile, type="runif"
    )) -> guesses
  plot(guesses)

  fixed_params <- c(N=38000, mu_IR=2, k=10)
bake(file=paste0(results_dir,"rho_profile.rds"),dependson=guesses,{
  registerDoRNG(2105684752)
  foreach(guess=iter(guesses,"row"), .combine=rbind) %dopar% {
    library(pomp)
    library(tidyverse)
    mf1 %>% mif2(params=c(guess,fixed_params),Nmif=Nmif,
      rw.sd=rw.sd(Beta=0.02,eta=ivp(0.02),mu_EI=0.02)) %>%
      mif2(Nmif=Nmif,Np=Np,cooling.fraction.50=0.3) -> mf
    replicate(
      Nreps_eval,
      mf %>% pfilter(Np=Np) %>% logLik()) %>%
      logmeanexp(se=TRUE) -> ll
    mf %>% coef() %>% bind_rows() %>%
      bind_cols(loglik=ll[1],loglik.se=ll[2])
  } -> prof_results
  attr(prof_results,"ncpu") <- getDoParWorkers()
  prof_results
}) -> profile_results

t_rho <- attr(profile_results,"system.time")
ncpu_rho <- attr(profile_results,"ncpu")

The profile took 1000 min to run on 4 cores.

  profile_results %>%
    filter(is.finite(loglik)) -> profile_results

  pairs(~loglik+Beta+eta+rho+mu_EI,data=profile_results,pch=16)

  profile_results %>%
    filter(loglik>max(loglik)-10) %>%
    group_by(round(rho,2)) %>%
    filter(rank(-loglik)<3) %>%
    ungroup() %>%
    ggplot(aes(x=rho,y=loglik))+
    geom_point()+
    geom_hline(
      color="red",
      yintercept=max(results$loglik)-0.5*qchisq(df=1,p=0.95)
    )

  profile_results %>%
    filter(loglik>max(loglik)-0.5*qchisq(df=1,p=0.95)) %>%
    summarize(min=min(rho),max=max(rho)) -> rho_ci

According to this model, the data are consistent with reporting efficiencies in the 2.9 – 87 percent range (95% CI)`.

We can compare to the profile over \(\rho\) on [2] Chapter 14, slide 75, which obtains an approximate confidence interval of 3–17% for \(\rho\) when there is no latent period.

Including the latent period therefore allows us to fit the data with a model matching our anticipation that infectious period should be around 3.5 days and reporting rate around 60%.

(e). Run time.

end_time <- Sys.time()
bake(file=paste0(results_dir,"run_time.rds"),difftime(end_time,start_time,units="secs")) -> run_time
  • On a MacBook Air laptop, the SEIR computations took 6.3 min at run level 2, with 4 cores.

  • At run level 3, the laptop took 18 hr. This included a delay when the machine powered down over night! The actual calculation time was about 5 hr.

  • For running long jobs on a laptop, it can be useful to use nohup and & to run the job in the background, regardless of the fate of the terminal where it was started.

nohup Rscript --vanilla -e "rmarkdown::render(\"sol07.Rmd\")" &
  • However, you also have to stop your laptop powering down. Even then, it is not very practical. Time to use a cluster.

  • On greatlakes, the computation time was 57 sec at run level 2, and 22 min at run level 3, with 36 cores.

  • The sbat files used to run the code on greatlakes are run1.sbat, run2.sbat, run3.sbat in the source directory. They compile the document using knitr::knit() rather than rmarkdown::render()`` since the latter needs an additional program (pandoc) to convert the results to HTML.knit` will run the code and save the results of the computations, which can then be copied back to your laptop for compilation to HTML.


References

1. King, A.A., Ionides, E.L., and Lin, Q. (2021). Simulation-based inference for epidemiological dynamics. Summer Institute in Statistics and Modeling in Infectious Diseases. Available at: https://kingaa.github.io/sbied/.

2. Ionides, E. (2022). Notes for STATS/DATASCI 531, Modeling and Analysis of Time Series Data. Available at: https://ionides.github.io/531w22/.