Licensed under the Creative Commons attribution-noncommercial license, http://creativecommons.org/licenses/by-nc/3.0/. Please share and remix noncommercially, mentioning its origin.
CC-BY_NC


Produced with R version 3.2.3 and pomp version 1.3.1.2.


Objectives

  1. To explain the simplest particle filter, allowing Monte Carlo solution of the POMP filtering and prediction recursions and therefore providing computation of the likelihood.

  2. To provide examples of visualizing and exploring likelihood surfaces using the particle filter for computation of the likelihood.

  3. To consider how to apply the fundamental tools of likelihood-based inference in situations where the likelihood cannot be written down explicitly but can be evaluated and maximized via Monte Carlo methods.

  4. To gain some experience at carrying out likelihood-based inferences for dynamic models using simulation-based statistical methodology in the R package pomp.

  5. Understand how iterated filtering algorithms carry out repeated particle filtering operations, with randomly perturbed parameter values, in order to maximize the likelihood.

  6. Gain experience carrying out statistical investigations using iterated filtering in a relatively simple situation (fitting an SIR model to a boarding school flu outbreak).

13.1 Theory of the particle filter

13.1.1 Indirect specification of the statistical model via a simulation procedure

  • For simple statistical models, we may describe the model by explicitly writing the density function \(f_{Y_{1:N}}(y_{1:N}{\, ; \,}\theta)\). One may then ask how to simulate a random variable \(Y_{1:N}\sim f_{Y_{1:N}}(y_{1:N}{\, ; \,}\theta)\).

  • For many dynamic models it is convenient to define the model via a procedure to simulate the random variable \(Y_{1:N}\). This implicitly defines the corresponding density \(f_{Y_{1:N}}(y_{1:N}{\, ; \,}\theta)\). For a complicated simulation procedure, it may be difficult or impossible to write down \(f_{Y_{1:N}}(y_{1:N}{\, ; \,}\theta)\) exactly.

  • It is important for us to bear in mind that the likelihood function exists even when we don’t know what it is! We can still talk about the likelihood function, and develop numerical methods that take advantage of its statistical properties.




13.1.2 Special case: a deterministic unobserved state process

  • Lets’ begin with a special case where the unobserved state process is deterministic. That is, \(X_{n}=x_n(\theta)\) is a known function of \(\theta\) for each \(n\). What is the likelihood?

  • Since the distribution of the observable random variable, \(Y_n\), depends only on \(X_n\) and \(\theta\), and since, in particular \(Y_{m}\) and \(Y_{n}\) are independent given \(X_{m}\) and \(X_{n}\), we have \[{\mathscr{L}}(\theta) = \prod_{n} f_{Y_n|X_n}(y_n^*{{\, | \,}}x_n(\theta){\, ; \,}\theta)\] or \[{\ell}(\theta) = \log{\mathscr{L}}(\theta) = \sum_{n} \log f_{Y_n|X_n}(y_n^*{{\, | \,}}x_n(\theta){\, ; \,}\theta).\]




13.2 Likelihood for stochastic models by direct simulation

\[\begin{eqnarray} {\mathscr{L}}(\theta) &=& f_{Y_{1:N}}({y_{1:N}^*}{\, ; \,}\theta) \\ &=& \! \int_{x_{0:N}} \!\! f_{X_0}(x_0{\, ; \,}\theta)\prod_{n=1}^{N}\!f_{Y_n|X_n}({y_n^*}{{\, | \,}}x_n{\, ; \,}\theta)\, f_{X_n|X_{n-1}}(x_n|x_{n-1}{\, ; \,}\theta)\, dx_{0:N}. \end{eqnarray}\]

bsflu <- read.table("bsflu_data.txt")

sir_step <- Csnippet("
  double dN_SI = rbinom(S,1-exp(-Beta*I/N*dt));
  double dN_IR = rbinom(I,1-exp(-gamma*dt));
  S -= dN_SI;
  I += dN_SI - dN_IR;
  R += dN_IR;
  H += dN_IR;
")

sir_init <- Csnippet("
  S = nearbyint(N)-1;
  I = 1;
  R = 0;
  H = 0;
")

dmeas <- Csnippet("lik = dbinom(B,H,rho,give_log);")
rmeas <- Csnippet("B = rbinom(H,rho);")

pomp(bsflu,times="day",t0=0,
     rprocess=euler.sim(sir_step,delta.t=1/5),
     initializer=sir_init,rmeasure=rmeas,dmeasure=dmeas,
     zeronames="H",statenames=c("H","S","I","R"),
     paramnames=c("Beta","gamma","rho","N")) -> sir
simulate(sir,params=c(Beta=2,gamma=1,rho=0.5,N=2600),
         nsim=10000,states=TRUE) -> x
matplot(time(sir),t(x["H",1:50,]),type='l',lty=1,
        xlab="time",ylab="H",bty='l',col='blue')
lines(time(sir),obs(sir,"B"),lwd=2,col='black')

ell <- dmeasure(sir,y=obs(sir),x=x,times=time(sir),log=TRUE,
                params=c(Beta=2,gamma=1,rho=0.5,N=2600))
dim(ell)
## [1] 10000    14
ell <- apply(ell,1,sum); summary(exp(ell)); logmeanexp(ell,se=TRUE)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##  0.00e+00  0.00e+00  0.00e+00 5.37e-106  0.00e+00 5.37e-102
##                    se 
## -242.39328   16.09139

13.3 The particle filter

\[\begin{eqnarray} {\mathscr{L}}(\theta)&=&f_{Y_{1:N}}(y^*_{1:N}{\, ; \,}\theta) \\ &=& \prod_{n}\,f_{Y_N|Y_{1:N-1}}(y^*_n|y^*_{1:n-1}{\, ; \,}\theta) \\ &=& \prod_{n}\,\int f_{Y_n|X_n}(y^*_n|x_n{\, ; \,}\theta)\,f_{X_n|Y_{1:n-1}}(x_n|y^*_{1:n-1}{\, ; \,}\theta)\, dx_{n}. \end{eqnarray}\]

\[\begin{eqnarray} &&f_{X_n|Y_{1:n-1}}(x_n|y^*_{1:n-1}{\, ; \,}\theta) \\ &&\quad = \int_{x_{n-1}} \! f_{X_n|X_{n-1}}(x_n|x_{n-1}{\, ; \,}\theta)\, f_{X_{n-1}|Y_{1:n-1}}(x_{n-1}{{\, | \,}}y^*_{1:n-1}{\, ; \,}\theta) \, dx_{n-1}. \end{eqnarray}\]

\[\begin{eqnarray} &&f_{X_n|Y_{1:n}}(x_n|y^*_{1:n}{\, ; \,}\theta) \\ &&\quad = f_{X_n|Y_n,Y_{1:n-1}}(x_n|y^*_n,y^*_{1:n-1}{\, ; \,}\theta) \\ &&\quad =\frac{f_{Y_n|X_n}(y^*_{n}|x_{n}{\, ; \,}\theta)\,f_{X_n|Y_{1:n-1}}(x_{n}|y^*_{1:n-1}{\, ; \,}\theta)}{\int f_{Y_n|X_n}(y^*_{n}|u_{n}{\, ; \,}\theta)\,f_{X_n|Y_{1:n-1}}(u_{n}|y^*_{1:n-1}{\, ; \,}\theta)\, du_n}. \end{eqnarray}\]

  1. Suppose \(X_{n-1,j}^{F}\), \(j=1,\dots,J\) is a set of \(J\) points drawn from the filtering distribution at time \(n-1\).

  2. We obtain a sample \(X_{n,j}^{P}\) of points drawn from the prediction distribution at time \(t\) by simply simulating the process model: \[X_{n,j}^{P} \sim \mathrm{process}(X_{n-1,j}^{F},\theta), \qquad j=1,\dots,J.\]

  3. Having obtained \(x_{n,j}^{P}\), we obtain a sample of points from the filtering distribution at time \(t_n\) by resampling from \(\big\{X_{n,j}^{P},j\in 1:J\big\}\) with weights \[w_{n,j}=f_{Y_n|X_n}(y^*_{n}|X^P_{n,j}{\, ; \,}\theta).\]

  4. The Monte Carlo principle tells us that the conditional likelihood \[\begin{eqnarray} {\mathscr{L}}_n(\theta) &=& f_{Y_n|Y_{1:n-1}}(y^*_n|y^*_{1:n-1}{\, ; \,}\theta) \\ &=& \int f_{Y_n|X_n}(y^*_{n}|x_{n}{\, ; \,}\theta)\,f_{X_n|Y_{1:n-1}}(x_{n}|y^*_{1:n-1}{\, ; \,}\theta)\, dx_n \end{eqnarray} \] is approximated by \[\hat{\mathscr{L}}_n(\theta) \approx \frac{1}{N}\,\sum_j\, f_{Y_n|X_n}(y^*_{n}|X_{n,j}^{P}{\, ; \,}\theta).\]

  5. We can iterate this procedure through the data, one step at a time, alternately simulating and resampling, until we reach \(n=N\).

  6. The full log likelihood then has approximation \[\begin{eqnarray}{\ell}(\theta) &=& \log{{\mathscr{L}}(\theta)} \\ &=& \sum_n \log{{\mathscr{L}}_n(\theta)} \\ &\approx& \sum_n\log\hat{\mathscr{L}}_n(\theta). \end{eqnarray} \]

13.4 Sequential Monte Carlo in pomp

sims <- simulate(sir,params=c(Beta=2,gamma=1,rho=0.8,N=2600),nsim=20,
                 as.data.frame=TRUE,include.data=TRUE)

ggplot(sims,mapping=aes(x=time,y=B,group=sim,color=sim=="data"))+
  geom_line()+guides(color=FALSE)

pf <- pfilter(sir,Np=5000,params=c(Beta=2,gamma=1,rho=0.8,N=2600))
logLik(pf)
## [1] -87.82817
pf <- replicate(10,pfilter(sir,Np=5000,params=c(Beta=2,gamma=1,rho=0.8,N=2600)))
ll <- sapply(pf,logLik); ll
##  [1] -84.65377 -82.83597 -78.01854 -73.30744 -81.52721 -86.99616 -80.03618
##  [8] -75.22000 -81.15559 -84.17097
logmeanexp(ll,se=TRUE)
##                    se 
## -75.462762   1.778989




13.5 The graph of the likelihood function: The likelihood surface

sliceDesign(
  c(Beta=2,gamma=1,rho=0.8,N=2600),
  Beta=rep(seq(from=0.5,to=4,length=40),each=3),
  gamma=rep(seq(from=0.5,to=2,length=40),each=3)) -> p

require(foreach)
require(doMC)

registerDoMC(cores=5)        
## number of cores 
## usually the number of cores on your machine, or slightly smaller

set.seed(998468235L,kind="L'Ecuyer")
mcopts <- list(preschedule=FALSE,set.seed=TRUE)

foreach (theta=iter(p,"row"),.combine=rbind,
         .inorder=FALSE,.options.multicore=mcopts) %dopar% 
 {
   pfilter(sir,params=unlist(theta),Np=5000) -> pf
   theta$loglik <- logLik(pf)
   theta
 } -> p




13.5.1 Exercise: Write down the definition of a likelihood slice in mathematical notation.

13.5.2 Exercise: Note that a likelihood slice is different from a likelihood profile

  • What is the difference between a likelihood slice and a profile from a computational perspective?

  • What is the difference between the statistical interpretation of a likelihood slice and a profile. For example, compare their roles in constructing confidence intervals and hypothesis tests.

  • What is the purpose of computing a likelihood slice? What is the purpose of computing a likelihood profile?




  • Note that the slice computation used the foreach package with the multicore backend (doMC) to parallelize these computations.

  • To ensure that we have high-quality random numbers in each parallel R session, we use a parallel random number generator. Notice the specification kind="L'Ecuyer" when we set the seed for the random number generator, and the specification .options.multicore=mcopts for foreach, the parallel for loop

  • Now, we can plot the likelihood slices:

foreach (v=c("Beta","gamma")) %do% 
{
  x <- subset(p,slice==v)
  plot(x[[v]],x$loglik,xlab=v,ylab="loglik")
}


13.5.3 Exercise: Likelihood slices

Add likelihood slices along the \(\rho\) direction.


Slices offer a very limited perspective on the geometry of the likelihood surface. With just two parameters, we can evaluate the likelihood at a grid of points and visualize the surface directly.

expand.grid(Beta=seq(from=1,to=4,length=50),
            gamma=seq(from=0.7,to=3,length=50),
            rho=0.8,
            N=2600) -> p

foreach (theta=iter(p,"row"),.combine=rbind,
         .inorder=FALSE,.options.multicore=mcopts) %dopar% 
 {
   pfilter(sir,params=unlist(theta),Np=5000) -> pf
   theta$loglik <- logLik(pf)
   theta
 } -> p
pp <- mutate(p,loglik=ifelse(loglik>max(loglik)-100,loglik,NA))
ggplot(data=pp,mapping=aes(x=Beta,y=gamma,z=loglik,fill=loglik))+
  geom_tile(color=NA)+
  geom_contour(color='black',binwidth=3)+
  scale_fill_gradient()+
  labs(x=expression(beta),y=expression(gamma))


13.5.4 Exercise: 2D likelihood slice

Compute a slice of the likelihood in the \(\beta\)-\(N\) plane.


13.6 Maximizing the likelihood using the particle filter

13.6.1 Causal, mechanistic interpretation of parameter estimates

  • When we write down a mechanistic model for a system, we have some idea of what we intend parameters to mean. In epidemiology, for example, we interpret parameters as a reporting rate, a contact rate between individuals, an immigration rate, a duration of immunity, etc.

  • The data and the parameter estimation procedure do not know about our intended interpretation of the model. It can and does happen that some parameter estimates statistically consistent with the data may be scientifically absurd according to the scientific reasoning that went into building the model.

  • This can arise as a consequence of weak identifiability.

  • It can also be a warning that the data do not agree that our model represents reality in the way we had hoped. Perhaps more work is needed on model development.

  • Scientifically unreasonable parameter estimates can sometimes be avoided by fixing some parameters at known, reasonable values. However, this risks suppressing the warning that the data were trying to give about weaknesses in the model, or in the biological interpretation of it.

  • This issue will be discussed further when it arises in case studies.




13.6.2 Likelihood maximization for the boarding school flu example

  • We saw above that the default parameter set for the ‘bsflu’ pomp object is not particularly close to the MLE.

  • One way to find the MLE is to try optimizing the estimated likelihood, computed by the particle filter, directly.

  • There are many optimization algorithms to choose from, and many implemented in R.

  • Three issues arise immediately.

  1. The particle filter gives us a stochastic estimate of the likelihood. We can reduce this variability by making the number of particles, Np, larger. However, we cannot make it go away. If we use a deterministic optimizer (i.e., one that assumes the objective function is evaluated deterministically), then we must control this variability somehow. For example, we can fix the seed of the pseudo-random number generator. A side effect will be that the objective function becomes jagged, marked by many small local knolls and pits. Alternatively, we can use a stochastic optimization algorithm, with which we will be only be able to obtain estimates of our MLE. This is the trade-off between a noisy and a rough objective function.

  2. Because the particle filter gives us just an estimate of the likelihood and no information about the derivative, we must choose an algorithm that is “derivative-free”. There are many such, but we can expect less efficiency than would be possible with derivative information. Note that finite differencing is not an especially promising way of constructing derivatives. The price would be a \(n\)-fold increase in cpu time, where \(n\) is the dimension of the parameter space. Also, since the likelihood is only estimated, we would expect the derivative estimates to be noisy.

  3. Finally, the parameters are constrained to be positive, and \(\rho < 1\). We must therefore select an optimizer that can solve this constrained maximization problem, or figure out some of way of turning it into an unconstrained maximization problem. For the latter purpose, one can transform the parameters onto a scale on which there are no constraints.




13.7 An iterated filtering algorithm (IF2)

13.7.1 IF2 algorithm pseudocode

model input: Simulators for \(f_{X_0}(x_0;\theta)\) and \(f_{X_n|X_{n-1}}(x_n| x_{n-1}; \theta)\); evaluator for \(f_{Y_n|X_n}(y_n| x_n;\theta)\); data, \(y^*_{1:N}\)

algorithmic parameters: Number of iterations, \(M\); number of particles, \(J\); initial parameter swarm, \(\{\Theta^0_j, j=1,\dots,J\}\); perturbation density, \(h_n(\theta|\varphi;\sigma)\); perturbation scale, \(\sigma_{1{:}M}\)

output: Final parameter swarm, \(\{\Theta^M_j, j=1,\dots,J\}\)

  1. \(\quad\) For \(m\) in \(1{:} M\)
  2. \(\quad\quad\quad\) \(\Theta^{F,m}_{0,j}\sim h_0(\theta|\Theta^{m-1}_{j}; \sigma_m)\) for \(j\) in \(1{:} J\)
  3. \(\quad\quad\quad\) \(X_{0,j}^{F,m}\sim f_{X_0}(x_0 ; \Theta^{F,m}_{0,j})\) for \(j\) in \(1{:} J\)
  4. \(\quad\quad\quad\) For \(n\) in \(1{:} N\)
  5. \(\quad\quad\quad\quad\quad\) \(\Theta^{P,m}_{n,j}\sim h_n(\theta|\Theta^{F,m}_{n-1,j},\sigma_m)\) for \(j\) in \(1{:} J\)
  6. \(\quad\quad\quad\quad\quad\) \(X_{n,j}^{P,m}\sim f_{X_n|X_{n-1}}(x_n | X^{F,m}_{n-1,j}; \Theta^{P,m}_j)\) for \(j\) in \(1{:} J\)
  7. \(\quad\quad\quad\quad\quad\) \(w_{n,j}^m = f_{Y_n|X_n}(y^*_n| X_{n,j}^{P,m} ; \Theta^{P,m}_{n,j})\) for \(j\) in \(1{:} J\)
  8. \(\quad\quad\quad\quad\quad\) Draw \(k_{1{:}J}\) with \(P[k_j=i]= w_{n,i}^m\Big/\sum_{u=1}^J w_{n,u}^m\)
  9. \(\quad\quad\quad\quad\quad\) \(\Theta^{F,m}_{n,j}=\Theta^{P,m}_{n,k_j}\) and \(X^{F,m}_{n,j}=X^{P,m}_{n,k_j}\) for \(j\) in \(1{:} J\)
  10. \(\quad\quad\quad\) End For
  11. \(\quad\quad\quad\) Set \(\Theta^{m}_{j}=\Theta^{F,m}_{N,j}\) for \(j\) in \(1{:} J\)
  12. \(\quad\) End For

comments:

  • The \(N\) loop (lines 4 through 10) is a basic particle filter applied to a model with stochastic perturbations to the parameters.

  • The \(M\) loop repeats this particle filter with decreasing perturbations.

  • The superscript \(F\) in \(\Theta^{F,m}_{n,j}\) and \(X^{F,m}_{n,j}\) denote solutions to the filtering problem, with the particles \(j=1,\dots,J\) providing a Monte Carlo representation of the conditional distribution at time \(n\) given data \(y^*_{1:n}\) for filtering iteration \(m\).

  • The superscript \(P\) in \(\Theta^{P,m}_{n,j}\) and \(X^{P,m}_{n,j}\) denote solutions to the prediction problem, with the particles \(j=1,\dots,J\) providing a Monte Carlo representation of the conditional distribution at time \(n\) given data \(y^*_{1:n-1}\) for filtering iteration \(m\).

  • The weight \(w^m_{n,j}\) gives the likelihood of the data at time \(n\) for particle \(j\) in filtering iteration \(m\).

13.7.2 Choosing the algorithmic settings for IF2

  • The initial parameter swarm, \(\{ \Theta^0_j, j=1,\dots,J\}\), usually consists of \(J\) identical replications of some starting parameter vector.

  • \(J\) is set to be sufficient for particle filtering. By the time of the last iteration (\(m=M\)) one should not have effective sample size close to 1.

  • Perturbations are usually chosen to be Gaussian, with \(\sigma_m\) being a scale factor for iteration \(m\): \[h_n(\theta|\varphi;\sigma) \sim N[\varphi, \sigma^2_m V_n].\]
  • \(V_n\) is usually taken to be diagonal, \[ V_n = \left( \begin{array}{ccccc} v_{1,n}^2 & 0 & 0 & \rightarrow & 0 \\ 0 & v_{2,n}^2 & 0 & \rightarrow & 0 \\ 0 & 0 & v_{3,n}^2 & \rightarrow & 0 \\ \downarrow & & & \searrow & \downarrow \\ 0 & 0 & 0 & \rightarrow & v_{p,n}^2 \end{array}\right).\]
  • If \(\theta_i\) is a parameter that affects the dynamics or observations throughout the timeseries, it is called a regular parameter, and it is often appropriate to specify \[ v_{i,n} = v_i,\]
  • If \(\theta_j\) is a parameter that affects only the initial conditions of the dynamic model, it is called an initial value parameter (IVP) and it is appropriate to specify \[ v_{j,n} = \left\{\begin{array}{ll} v_j & \mbox{if $n=0$} \\ 0 & \mbox{if $n>0$} \end{array}\right.\]
  • If \(\theta_k\) is a break-point parameter that models how the system changes at time \(t_q\) then \(\theta_k\) is like an IVP at time \(t_q\) and it is appropriate to specify \[ v_{j,n} = \left\{\begin{array}{ll} v_j & \mbox{if $n=q$} \\ 0 & \mbox{if $n\neq q$} \end{array}\right.\]

  • \(\sigma_{1:M}\) is called a cooling schedule, following a thermodynamic analogy popularized by simulated annealing. As \(\sigma_m\) becomes small, the system cools toward a “freezing point”. If the algorithm is working sucessfully, the freezing point should be close to the lowest-energy state of the system, i.e., the MLE.

  • It is generally helpful for optimization to provide transformations of the parameters so that (on the estimation scale) they are real-valued and have uncertainty on the order of 1 unit. For example, one typically takes a logarithmic transformation of positive parameters and a logistic transformation of \([0,1]\) valued parameters.
  • On this scale, it is surprisingly often effective to take \[ v_i = 0.02\] for regular parameters (RPs) and \[ v_j = 0.1\] for initial value parameters (IVPs).

  • We suppose that \(\sigma_1=1\), since the scale of the parameters is addressed by the matrix \(V_n\) . Early on in an investigation, one might take \(M=100\) and \(\sigma_M=0.1\). Later on, consideration of diagnostic plots may suggest refinements.

  • It is surprising that useful general advice exists for these quantities that could in principle be highly model-specific. Here is one possible explanation: the precision of interest is often the second significant figure and there are often order 100 observations (10 monthly obsevations would be too few to fit a mechanistic model; 1000 would be unusual for an epidemiological system).




13.8 Applying IF2 to a boarding school influenza outbreak

bsflu_data <- read.table("bsflu_data.txt")
bsflu_statenames <- c("S","I","R1","R2")
bsflu_paramnames <- c("Beta","mu_I","rho","mu_R1","mu_R2")

The observation names (\(B\), \(C\)) are the names of the data variables:

(bsflu_obsnames <- colnames(bsflu_data)[1:2])
## [1] "B" "C"

We do not need a representation of \(R_3\) since the total population size is fixed at \(P=763\) and hence \(R_3(t)=P-S(t)-I(t)-R_1(t)-R_2(t)\). Now, we write the model code:

bsflu_dmeasure <- "
  lik = dpois(B,rho*R1+1e-6,give_log);
"

bsflu_rmeasure <- "
  B = rpois(rho*R1+1e-6);
  C = rpois(rho*R2);
"

bsflu_rprocess <- "
  double t1 = rbinom(S,1-exp(-Beta*I*dt));
  double t2 = rbinom(I,1-exp(-dt*mu_I));
  double t3 = rbinom(R1,1-exp(-dt*mu_R1));
  double t4 = rbinom(R2,1-exp(-dt*mu_R2));
  S -= t1;
  I += t1 - t2;
  R1 += t2 - t3;
  R2 += t3 - t4;
"

bsflu_fromEstimationScale <- "
 TBeta = exp(Beta);
 Tmu_I = exp(mu_I);
 Trho = expit(rho);
"

bsflu_toEstimationScale <- "
 TBeta = log(Beta);
 Tmu_I = log(mu_I);
 Trho = logit(rho);
"

bsflu_initializer <- "
 S=762;
 I=1;
 R1=0;
 R2=0;
"

We can now build the new bsflu pomp object.

require(pomp)
stopifnot(packageVersion("pomp")>="0.75-1")
bsflu2 <- pomp(
  data=bsflu_data,
  times="day",
  t0=0,
  rprocess=euler.sim(
    step.fun=Csnippet(bsflu_rprocess),
    delta.t=1/12
  ),
  rmeasure=Csnippet(bsflu_rmeasure),
  dmeasure=Csnippet(bsflu_dmeasure),
  fromEstimationScale=Csnippet(bsflu_fromEstimationScale),
  toEstimationScale=Csnippet(bsflu_toEstimationScale),
  obsnames = bsflu_obsnames,
  statenames=bsflu_statenames,
  paramnames=bsflu_paramnames,
  initializer=Csnippet(bsflu_initializer)
)
plot(bsflu2)

run_level <- 3
switch(run_level,
       {bsflu_Np=100; bsflu_Nmif=10; bsflu_Neval=10; bsflu_Nglobal=10; bsflu_Nlocal=10}, 
       {bsflu_Np=20000; bsflu_Nmif=100; bsflu_Neval=10; bsflu_Nglobal=10; bsflu_Nlocal=10}, 
       {bsflu_Np=60000; bsflu_Nmif=300; bsflu_Neval=10; bsflu_Nglobal=100; bsflu_Nlocal=20}
)

13.8.1 Running a particle filter

Before engaging in iterated filtering, it is often a good idea to check that the basic particle filter is working since iterated filtering builds on this technique. Here, carrying out slightly circular reasoning, we are going to test pfilter on a previously computed point estimate read in from bsflu_params.csv:

bsflu_params <- data.matrix(read.table("mif_bsflu_params.csv",row.names=NULL,header=TRUE))
bsflu_mle <- bsflu_params[which.max(bsflu_params[,"logLik"]),][bsflu_paramnames]

We are going to treat \(\mu_{R_1}\) and \(\mu_{R_2}\) as known, fixed at the empirical mean of the bed-confinement and convalescence times for symptomatic cases:

bsflu_fixed_params <- c(mu_R1=1/(sum(bsflu_data$B)/512),mu_R2=1/(sum(bsflu_data$C)/512))
  • It is convenient to do some parallelization to speed up the computations. Most machines are multi-core nowadays, and using this computational capacity involves only

    1. the following lines of code to let R know you plan to use multiple processors;

    2. using the parallel for loop provided by ‘foreach’.

require(doParallel)
cores <- 20  # The number of cores on this machine 
registerDoParallel(cores)
mcopts <- list(set.seed=TRUE)

set.seed(396658101,kind="L'Ecuyer")
  • Note that this code uses the doParallel library, whereas before we used doMC, as below.
require(doMC)
registerDoMC(cores=20) 
  • doParallel and doMC are functionally equivalent for our current purposes, but give you two options if you hit technical problems accessing the multiple cores on your machine.

  • We proceed to carry out replicated particle filters at this tentative MLE:

stew(file=sprintf("pf-%d.rda",run_level),{
  t_pf <- system.time(
    pf <- foreach(i=1:20,.packages='pomp',
                  .options.multicore=mcopts) %dopar% try(
                    pfilter(bsflu2,params=bsflu_mle,Np=bsflu_Np)
                  )
  )
  
},seed=1320290398,kind="L'Ecuyer")

(L_pf <- logmeanexp(sapply(pf,logLik),se=TRUE))
##                      se 
## -73.3241678   0.3197182
  • In 9.4 seconds, we obtain an unbiased likelihood estimate of -73.32 with a Monte standard error of 0.32.

13.8.2 Caching computations in Rmarkdown

  • It is not unusual for computations in a POMP analysis to take hours to run on many cores.

  • The computations for a final version of a manuscript may take days.

  • Usually, we use some mechanism like the different values of run_level so that preliminary versions of the manuscript take less time to run.

  • However, when editing the text or working on a different part of the manuscript, we don’t want to re-run long pieces of code.

  • Saving results so that the code is only re-run when necessary is called caching.

  • You may already be familiar with Rmarkdown’s own version of caching.

  • In the notes, we tell Rmarkdown to cache. For example, in (notes13.Rmd) the first R chunk, called knitr-opts, contains the following:

opts_chunk$set(
  cache=TRUE,
  )
  • Rmarkdown uses a library called knitr to process the Rmd file, so options for Rmarkdown are formally options for knitr.

  • Having set the option cache=TRUE, Rmarkdown caches every chunk, meaning that a chunk will only be re-run if code in that chunk is edited.

  • You can force Rmarkdown to recompute all the chunks by deleting the cache subdirectory.

  • What if changes elsewhere in the document affect the proper evaluation of your chunk, but you didn’t edit any of the code in the chunk itself?

    • Rmarkdown will get this wrong. It will not recompute the chunk.
  • A perfect caching system doesn’t exist. Always delete the entire cache and rebuild a fresh cache before finishing a manuscript.

  • Rmarkdown caching is good for relatively small computations, such as producing figures or things that may take a minute or two and are annoying if you have to recompute them every time you make any edits to the text.

  • For longer computations, it is good to have full manual control. In pomp, this is provided by two related functions, stew and bake.

13.8.2.1 stew and bake

  • Notice the function stew in the replicated particle filter code above.

  • Here, stew looks for a file called pf-[run_level].rda.

  • If it finds this file, it simply loads the contents of this file.

  • If the file doesn’t exist, it carries out the specified computation and saves it in a file of this name.

  • bake is similar to stew. The difference is thatbakeusesreadRDSandsaveRDS, whereasstewusesloadandsave`.

  • either way, the computation will not be re-run unless you manually delete pf-[run_level].rda.

  • stew and bake reset the seed appropriately whether or not the computation is recomputed. Othewise, caching risks adverse consequences for reproducibility.




13.9 A local search of the likelihood surface

bsflu_rw.sd <- 0.02
bsflu_cooling.fraction.50 <- 0.5

stew(file=sprintf("local_search-%d.rda",run_level),{
  
  t_local <- system.time({
    mifs_local <- foreach(i=1:bsflu_Nlocal,.packages='pomp', .combine=c, .options.multicore=mcopts) %dopar%  {
      mif2(
        bsflu2,
        start=bsflu_mle,
        Np=bsflu_Np,
        Nmif=bsflu_Nmif,
        cooling.type="geometric",
        cooling.fraction.50=bsflu_cooling.fraction.50,
        transform=TRUE,
        rw.sd=rw.sd(
          Beta=bsflu_rw.sd,
          mu_I=bsflu_rw.sd,
          rho=bsflu_rw.sd
        )
      )
      
    }
  })
  
},seed=900242057,kind="L'Ecuyer")
stew(file=sprintf("lik_local-%d.rda",run_level),{
    t_local_eval <- system.time({
    liks_local <- foreach(i=1:bsflu_Nlocal,.packages='pomp',.combine=rbind) %dopar% {
      evals <- replicate(bsflu_Neval, logLik(pfilter(bsflu2,params=coef(mifs_local[[i]]),Np=bsflu_Np)))
      logmeanexp(evals, se=TRUE)
    }
  })
},seed=900242057,kind="L'Ecuyer")

results_local <- data.frame(logLik=liks_local[,1],logLik_se=liks_local[,2],t(sapply(mifs_local,coef)))
summary(results_local$logLik,digits=5)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -74.543 -74.243 -74.038 -74.010 -73.930 -73.301
pairs(~logLik+Beta+mu_I+rho,data=subset(results_local,logLik>max(logLik)-50))




13.10 A global search of the likelihood surface using randomized starting values

bsflu_box <- rbind(
  Beta=c(0.001,0.01),
  mu_I=c(0.5,2),
  rho = c(0.5,1)
)
stew(file=sprintf("box_eval-%d.rda",run_level),{
  
  t_global <- system.time({
    mifs_global <- foreach(i=1:bsflu_Nglobal,.packages='pomp', .combine=c, .options.multicore=mcopts) %dopar%  mif2(
      mifs_local[[1]],
      start=c(apply(bsflu_box,1,function(x)runif(1,x[1],x[2])),bsflu_fixed_params)
    )
  })
},seed=1270401374,kind="L'Ecuyer")
stew(file=sprintf("lik_global_eval-%d.rda",run_level),{
  t_global_eval <- system.time({
    liks_global <- foreach(i=1:bsflu_Nglobal,.packages='pomp',.combine=rbind, .options.multicore=mcopts) %dopar% {
      evals <- replicate(bsflu_Neval, logLik(pfilter(bsflu2,params=coef(mifs_global[[i]]),Np=bsflu_Np)))
      logmeanexp(evals, se=TRUE)
    }
  })
},seed=442141592,kind="L'Ecuyer")

results_global <- data.frame(logLik=liks_global[,1],logLik_se=liks_global[,2],t(sapply(mifs_global,coef)))
summary(results_global$logLik,digits=5)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -75.537 -74.527 -74.299 -74.268 -74.030 -71.913
if (run_level>2) 
  write.table(rbind(results_local,results_global),
              file="mif_bsflu_params.csv",append=TRUE,col.names=FALSE,row.names=FALSE)
pairs(~logLik+Beta+mu_I+rho,data=subset(results_global,logLik>max(logLik)-250))




13.11 Diagnosing success or failure of the maximization procedure

class(mifs_global)
## [1] "mif2List"
## attr(,"package")
## [1] "pomp"
class(mifs_global[[1]])
## [1] "mif2d.pomp"
## attr(,"package")
## [1] "pomp"
class(c(mifs_global[[1]],mifs_global[[2]]))
## [1] "mif2List"
## attr(,"package")
## [1] "pomp"
plot(mifs_global)

13.11.1 Question: Interpret the diagnostics

  1. Do these plots suggest we have successfully maximized the likelihood, or not? Why?

  2. What would the convergence plots look like if we cooled too quickly? Or too slowly? Can you find evidence for either of these here? (The algorithmic parameter cooling.fraction.50 is the fraction by which we decrease the random walk standard deviation in 50 filtering iterations.)

  3. Here, we’ve done 300 mif iterations. Should we have done more? Could we have saved ourselves computational effort by doing less, without compromising our analysis?

  4. Some parameter estimates show strong agreement between the different mif runs from different starting values. Others less so. How do you interpret this? Diversity in parameter estimates could be a signal of poor numerical maximization. It could signal a multi-modal likelhood surface. Or, it could simply correspond to a flat likelihood surface where the maximum is not precisely identifiable. Can we tell from the diagnostic plots which of these is going on here?

  5. Maximization via particle filtering requires that the particle filter is working effectively. One way to monitor this is to pay attention to the effective sample size on the last filtering iteration. The effective sample size (ESS) is computed here as \[ \mathrm{ESS}_{n}= \frac{\left(\sum_{j=1}^J w_{n,j}\right)^2}{\sum_{j=1}^J w_{n,j}^2},\] where \(\{w_{n,j}\}\) are the weights defined in step 3 of the particle filter. The ESS approximates the number of independent, equally weighted, samples from the filtering distribution that would be equally informative to the one weighted sample that we have obtained by the particle filter. Do you have any concerns about the number of particles?



13.12 Exercises

13.12.1 Exercise: Construct a profile likelihood

  • How strong is the evidence about the contact rate, \(\beta\), given this model and data? Use mif2 to construct a profile likelihood. Due to time constraints, you may be able to compute only a preliminary version.

  • It is also possible to profile over the basic reproduction number, \(R_0=\beta P/\mu_I\). Is this more or less well determined that \(\beta\) for this model and data?




13.12.2 Exercise: Check the model source code

  • Check the source code for the bsflu pomp object. Does the code implement the model described?

  • For various reasons, it can be surprisingly hard to make sure that the written equations and the code are perfectly matched. Here are some things to think about:

  1. Papers should be written to be readable to as broad a community as possible. Code must be written to run successfully. People do not want to clutter papers with numerical details which they hope and belief are scientifically irrelevant. What problems can arise due to this, and what solutions are available?

  2. Suppose that there is an error in the coding of rprocess. Suppose that plug-and-play statistical methodology is used to infer parameters. A conscientious researcher carries out a simulation study, using simulate to generate some realizations from the fitted model and checking that the inference methodology can successfully recover the known parameters for this model, up to some statistical error. Will this procedure help to identify the error in rprocess? If not, how does one debug rprocess? What research practices help minimize the risk of errors in simulation code?




13.12.3 Exercise: Assessing and improving algorithmic parameters

Develop your own heuristics to try to improve the performance of mif2 in the previous example. Specifically, for a global optimization procedure carried out using random starting values in the specified box, let \(\hat\Theta_\mathrm{max}\) be a random Monte Carlo estimate of the resulting MLE, and let \(\hat\theta\) be the true (unknown) MLE. We can define the maximization error in the log likelihood to be \[e = \ell(\hat\theta) - E[\ell(\hat\Theta_\mathrm{max})].\] We cannot directly evaluate \(e\), since there is also Monte Carlo error in our evaluation of \(\ell(\theta)\), but we can compute it up to a known precision. Plan some code to estimates \(e\) for a search procedure using a computational effort of \(JM=2\times 10^7\), comparable to that used for each mif computation in the global search. Discuss the strengths and weaknesses of this quantification of optimization success. See if you can choose \(J\) and \(M\) subject to this constraint, together with choices of rw.sd and the cooling rate, cooling.fraction.50, to arrive at a quantifiably better procedure. Computationally, you may not be readily able to run your full procedure, but you could run a quicker version of it.




13.12.4 Exercise: Finding sharp peaks in the likelihood surface

  • Even in this small, 3 parameter, example, it takes a considerable amount of computation to find the global maximum (with values of \(\beta\) around 0.004) starting from uniform draws in the specified box. The problem is that, on the scale on which “uniform” is defined, the peak around \(\beta\approx 0.004\) is very narrow. Propose and test a more favorable way to draw starting parameters for the global search, with better scale invariance properties.




13.12.5 Exercise: Adding a latent class

  • Modify the model to include a latent period between becoming exposed and becoming infectious. See what effect this has on the maximized likelihood.




Acknowledgment

These notes draw on material developed for a short course on Simulation-based Inference for Epidemiological Dynamics by Aaron King and Edward Ionides, taught at the University of Washington Summer Institute in Statistics and Modeling in Infectious Diseases, 2015.


References

Anonymous. 1978. Influenza in a boarding school. British Medical Journal 1:587.

Arulampalam, M. S., S. Maskell, N. Gordon, and T. Clapp. 2002. A tutorial on particle filters for online nonlinear, non-Gaussian Bayesian tracking. IEEE Trans. Sig. Proc. 50:174–188.

Doucet, A., N. de Freitas, and N. Gordon, editors. 2001. Sequential Monte Carlo Methods in Practice. Springer-Verlag, New York.

Ionides, E. L., D. Nguyen, Y. Atchadé, S. Stoev, and A. A. King. 2015. Inference for dynamic and latent variable models via iterated, perturbed Bayes maps. Proceedings of the National Academy of Sciences of USA 112:719–– 724.

Kitagawa, G. 1987. Non-Gaussian state-space modeling of nonstationary time series. Journal of the American Statistical Association 82:1032–1041.