Introduction


Data

Data source

  • The data is from https://www.tycho.pitt.edu/ which is an open resource for infectious disease time series data. The original data contains the population who got HAV in every week in every state from 1966 to 1987. I extract the data for michigan and then change them to monthly data with the startcode.py.
require(ggplot2)
## Loading required package: ggplot2
sample <- read.csv("hep-a-mi.csv",header = TRUE)
head(sample)
##      State      time population
## 1 MICHIGAN  1/2/1966         84
## 2 MICHIGAN  1/9/1966         47
## 3 MICHIGAN 1/16/1966         71
## 4 MICHIGAN 1/23/1966         71
## 5 MICHIGAN 1/30/1966         50
## 6 MICHIGAN  2/6/1966         46
hepa <- read.csv("hep-a-mi2.csv",header = TRUE)
head(hepa)
##   number population
## 1      1        273
## 2      2        241
## 3      3        325
## 4      4        240
## 5      5        194
## 6      6        160
hepa <- read.csv("hep-a-mi2.csv",header = TRUE)
hep <- hepa$population
num <- hepa$number
plot(hep~num,type='l',main="Monthly HAV population in Michigan from 1966 to 1987")

library(mFilter)
hp=hpfilter(hep, freq=100,type="lambda",drift=F)
trend=ts(hp$trend)
cycle=ts(hp$cycle)
plot(ts.union(trend,cycle),type="l",xlab="Date",ylab="", main='Decomposition of HAVcases as trend and cycle')

  • From the plot above, we can see there is a strong trend in the data, the peak may because of an outbreak of HAV which did not get controll around 1970. Then with the progress of modern treatment technology and sanitary condition, less and less people got HAV especially after about 1980.

Pomp analysis

SIR model

  • Usually, an infectious disease can be represented by three stages, including susceptible, infectious and recovered. As the introduction in the first part(introduction), SIR is suitable for HAV. So we first build the model same as note 11[2], we have four states S,I,R,H.(The final population result from a process with probability ρ, which we can think of as the probability that an infection is severe enough to be noticed by the authorities. Since confined cases have, presumably, a much lower transmission rate, we need a variable H to track this.)

  • The model is graphically represented as follows: SIR model diagram

hav <- read.csv("hep-a-mi2.csv")
hav<- subset(hav,select=c(number,population))
require(pomp)
## Loading required package: pomp
sir_step <- Csnippet("
  double dN_SI = rbinom(S,1-exp(-Beta*I/N*dt));
  double dN_IR = rbinom(I,1-exp(-gamma*dt));
  S -= dN_SI;
  I += dN_SI - dN_IR;
  R += dN_IR;
  H += dN_IR;
")

sir_init <- Csnippet("
  S = N-1;
  I = 1;
  R = 0;
  H = 0;
")

pomp(hav,,time="number",t0=0,rprocess=euler.sim(sir_step,delta.t=1/4),initializer=sir_init,
     paramnames=c("Beta","gamma","N"),statenames=c("S","I","R","H")) -> sir
pomp(sir,zeronames="H") -> sir
dmeas <- Csnippet("lik = dbinom(population,H,rho,give_log);")
rmeas <- Csnippet("population = rbinom(H,rho);")
sir <- pomp(sir,rmeasure=rmeas,dmeasure=dmeas,statenames="H",paramnames="rho")
sims <- simulate(sir,params=c(Beta=0.25,gamma=0.1,rho=0.9,N=15000),nsim=20,as=TRUE,include=TRUE)
require(ggplot2)
ggplot(sims,mapping=aes(x=time,y=population,group=sim,color=sim=="data"),main='SIR simulation for HAV cases')+
  geom_line()+guides(color=FALSE)

  • The red lines represent the SIR somulation and the green line represents the pomp data.And from the result we can tell that it fits really well which indicates that SIR is a suitable model for HAV case in Michigan and modifying the SIR is meaningful as next step.

  • The parameters including \(\beta\), \(\gamma\), \(\rho\) and \(N\) is set by trying different value and choosing the value used by the best simulation result plots. The value of \(\beta\), \(\gamma\), \(\rho\) is reasonable. \(\beta\) represent the rate of the transmission. \(\gamma\) means thereciprocal value of the time from when people get the virus to when they show symptoms and get treatment.In this cases its \(\delta\)\(t\) is \(1/4\)(1 week) and \(\gamma\) eauals 0.1 means 10 weeks. It accords with the introduction part.(A little bit large but reasoniable).\(N\) represents how many people may get affcted by the virus(who lives near the virus resource.).Since it is very likely to get infection when you have access to the virus, 15000 as start may be too large for this case. \(\rho\) means how many people will be report when they recover.90% can be a good explanation.

SEICR model

Model introduction

  • SEICR model is a modification of SIR by adding the states of carrier and latency, we have introduced the reason why we used it. This model was firstly published associating with HBV analysis[3]. HBV and HAV is very similar.So I try to fit it in this case by some modify related to the difference between HAV and HBV.

  • The model for HBV is graphically represented as follows[3]: SEICR for HBV model diagram

  • My modifications(simplifications) are as below:

** Delete birth rate and death rate part. Since HAV seldom results in death and larger model may cause more error resulted from too many parameters, I choose to ignore the death and birth in my case.

** HAV is not a disease with heredity. So the new born babies are less likely to have HAV virus or anti-HAV in their body.

** I combine the parameter of people recovered automatically and after treatment as one parameter recover rate(The effect of these 2 parameter can be studied in the future work)

** Also I change the meaning of some of the states(E: people who ever get access to the HAV source,I: people who really get the virus. C: Showing symptom and carrying the virus) and process to make it more related to HAV.

  • So the final model can be show as below: SEICR for HAV model diagram The equation is as below: fomula

  • The other setting is the same as SIR model(The reported data is a binomial process of R(people who recover from infection and who has never show symptom and never get vaccinated but show anti-HAV in their body which is very rare). Also model the rateof moving from one state to the next over a very short time interval as a binomial random variable)

options(
  keep.source=TRUE,
  stringsAsFactors=FALSE,
  encoding="UTF-8"
  )
set.seed(594709947L)
require(ggplot2)
theme_set(theme_bw())
require(plyr)
## Loading required package: plyr
require(reshape2)
## Loading required package: reshape2
require(magrittr)
## Loading required package: magrittr
require(foreach)
## Loading required package: foreach
require(pomp)
stopifnot(packageVersion("pomp")>="0.69-1")

Fitting pomp model

  • Now we can build our pomp model.
hep_data <- read.csv("hep-a-mi2.csv")
hep_statenames <- c("S","E","I","C","R")
hep_paramnames <- c("Beta","mu_I","rho","mu_R1","mu_R2","mu_R3","mu_R4","mu_R5")
hep_obsnames <- colnames(hep_data)[2]
hep_dmeasure <- "
  lik = dpois(population,rho*R+1e-6,give_log);
"

hep_rmeasure <- "
  population= rpois(rho*R+1e-6);

"

hep_rprocess <- "
  double t1 = rbinom(S,1-exp(-Beta*I*dt));
  double t2 = rbinom(E,1-exp(-dt*mu_I));
  double t3 = rbinom(I,1-exp(-dt*mu_R1));
  double t4 = rbinom(C,1-exp(-dt*mu_R2));
  double t5 = rbinom(R,1-exp(-dt*mu_R3));
  double t6 = rbinom(I,1-exp(-dt*mu_R4));
  double t7 = rbinom(S,1-exp(-dt*mu_R5));
  S -= t1-t5+t7;
  E += t1 - t2;
  I += t2 - t3-t6;
  C += t3 - t4;
  R += t4 - t5+t6+t7;
"

hep_fromEstimationScale <- "
 TBeta = exp(Beta);
 Tmu_I = exp(mu_I);
 Tmu_R1 = exp(mu_R1);
 Tmu_R2 = exp(mu_R2);
 Tmu_R3 = exp(mu_R3);
 Tmu_R4 = exp(mu_R4);
 Tmu_R5 = exp(mu_R5);
 Trho = expit(rho);
"

hep_toEstimationScale <- "
 TBeta = log(Beta);
 Tmu_I = log(mu_I);
 Tmu_R1 = log(mu_R1);
 Tmu_R2 = log(mu_R2);
 Tmu_R3 = log(mu_R3);
 Tmu_R4 = log(mu_R4);
 Tmu_R5 = log(mu_R5);
 Trho = logit(rho);
"

hep_initializer <- "
 S=5000;
 E=0;
 I=0;
 C=0;
 R=0;
"

require(pomp)
stopifnot(packageVersion("pomp")>="0.75-1")
hep2 <- pomp(
  data=hep_data,
  times="number",
  t0=0,
  rprocess=euler.sim(
    step.fun=Csnippet(hep_rprocess),
    delta.t=1
  ),
  rmeasure=Csnippet(hep_rmeasure),
  dmeasure=Csnippet(hep_dmeasure),
  fromEstimationScale=Csnippet(hep_fromEstimationScale),
  toEstimationScale=Csnippet(hep_toEstimationScale),
  obsnames = hep_obsnames,
  statenames=hep_statenames,
  paramnames=hep_paramnames,
  initializer=Csnippet(hep_initializer)
)
plot(hep2)

  • To understand the transmission dynamic of HAV, I tried to estimate those parameters in my model by global search.

  • IF2 estimation involves using different starting values for the parameters. We can construct a large box in parameter space containing all plaussible parameter vectors. If the estimation gives a stable performance with different starting values, we could be confidenct that it is an adequate global optimization.

run_level <- 3
switch(run_level,
       {bsflu_Np=200; bsflu_Nmif=10; bsflu_Neval=10; bsflu_Nglobal=10; bsflu_Nlocal=10},
       {bsflu_Np=2000; bsflu_Nmif=100; bsflu_Neval=10; bsflu_Nglobal=10; bsflu_Nlocal=10},
       {bsflu_Np=20000; bsflu_Nmif=300; bsflu_Neval=10; bsflu_Nglobal=10; bsflu_Nlocal=10},
       
)



require(doParallel)
## Loading required package: doParallel
## Loading required package: iterators
## Loading required package: parallel
cores <- 20  # The number of cores on this machine 
registerDoParallel(cores)
mcopts <- list(set.seed=TRUE)

set.seed(396658101,kind="L'Ecuyer")
bsflu_box <- rbind(
  Beta=c(0.001,0.01),
  mu_I=c(0.5,2),
  mu_R1=c(0.5,2),
  mu_R2=c(0.5,2),
  mu_R3=c(0.5,2),
  mu_R4=c(0.5,2),
  mu_R5=c(0.5,2),
  rho = c(0.5,1)
)

bsflu_rw.sd <- 0.02
bsflu_cooling.fraction.50 <- 0.5
hep_rw.sd <-0.2
stew(file=sprintf("hepeefin_eval-%d.rda",run_level),{
  
  t_global <- system.time({
    mifs_global <- foreach(i=1:bsflu_Nglobal,.packages='pomp', .combine=c, .options.multicore=mcopts) %dopar% 
      mif2(     
      hep2,
      start=c(apply(bsflu_box,1,function(x)runif(1,x[1],x[2]))),
      Np=bsflu_Np,
      Nmif=bsflu_Nmif,
      cooling.type="geometric",
      cooling.fraction.50=bsflu_cooling.fraction.50,
      transform=TRUE,
      rw.sd=rw.sd(
        Beta=bsflu_rw.sd,
        mu_I=bsflu_rw.sd,
        rho=bsflu_rw.sd,
        mu_R1=hep_rw.sd,
        mu_R2=hep_rw.sd,
        mu_R3=hep_rw.sd,
        mu_R4=hep_rw.sd,
        mu_R5=hep_rw.sd
      )
    )
  })
},seed=1270401374,kind="L'Ecuyer")
plot(mifs_global)
## Warning in xy.coords(x, y, xlabel, ylabel, log = log): 565 y values <= 0
## omitted from logarithmic plot
## Warning in xy.coords(x, y, xlabel, ylabel, log): 31 y values <= 0 omitted
## from logarithmic plot

  • Unfortunately, this more complex and seemingly more realistic model performs even worse than the first model. From the nfail and loglik plot, we can see that the result is going to coverage which means this model fits the data. Also all the parameter seem coverage to pretty small values. There are some extrems values appeared in the iteration of some parameters(like \(mu_R1\)) also this value affected the log likehood
stew(file=sprintf("hepeefin-lik_global_eval-%d.rda",run_level),{
  t_global_eval <- system.time({
    liks_global <- foreach(i=1:bsflu_Nglobal,.packages='pomp',.combine=rbind, .options.multicore=mcopts) %dopar% {
      evals <- replicate(bsflu_Neval, logLik(pfilter(bsflu2,params=coef(mifs_global[[i]]),Np=bsflu_Np)))
      logmeanexp(evals, se=TRUE)
    }
  })
},seed=442141592,kind="L'Ecuyer")

results_global <- data.frame(logLik=liks_global[,1],logLik_se=liks_global[,2],t(sapply(mifs_global,coef)))
summary(results_global$logLik,digits=5)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -4624   -4488   -4444   -4115   -3605   -3342
t_global
##      user    system   elapsed 
## 13279.163    69.337  1502.584
pairs(~logLik+Beta+rho+mu_I+mu_R1+mu_R2+mu_R3+mu_R4+mu_R5,data=subset(results_global,logLik>max(logLik)-50000))

idx <- which.max(results_global$logLik)
results_global[idx,]
##             logLik logLik_se         Beta      mu_I        mu_R1
## result.9 -3342.047  19.65909 0.0001875922 0.6226018 2.894849e-06
##                 mu_R2      mu_R3   mu_R4       mu_R5       rho
## result.9 3.557312e-09 0.05669999 6373976 0.008652752 0.9881627

parameter analysis

  • Since we have 8 parameters for the SEICR model, we can not guess the parameter as we did in the SIR analysis(although the result is very reasonable). So we use global likehood to search for the best parameter.

  • The maximum likelihood parameters shows out unreasonably low \(mu_R1\) and \(mu_R2\) and unreasonably large \(mu_R4\)

  • \(\beta\)\(I\) now denotes the rate of how many people will get access to the source of the HAV virus(either the people who got HAV or the food and other staff used by them ) in a time period(one week). The number of \(I\) may be about 100~1000 (total number in this case is 5000). If we set it as 500, this rate of exposed to virus is 0.09 which is reasonable.

  • \(mu_R3\) and \(mu_I\) all denote the reciprocal value of the time from 1 state to another(several weeks) which are lso reasonable.

  • \(\rho\) has the same meening as in the SIR model which also has almost the same value as in SIR.

  • All the value of the parameter related to other parameters. Since part of the parameters re too large or small. The other reasonable value may be suspectibal.

Conclusion

Future work

Reference