Introduction

Chickenpox is a highly contagious disease which many associate with childhood. The rash it comes with is unpleasant and disruptive to everyday life. The development of a chickenpox vaccine in the 1970s, with it becoming fully available to the public in 1984, led to a sharp decline in chickenpox cases around the world. It is an extremely effective vaccine which prevents 100% of severe cases of chickenpox with little to no side effects. If the entirety of a population that is able to be vaccinated is vaccinated against chickenpox, it is highly unlikely for there to be much transmission between community members [1].

In Hungary, chickenpox is not a required vaccine for children and is not reimbursed when patients opt to get it, which has caused chickenpox to be much more prevalent in Hungary than in other countries where vaccination against chickenpox is expected [2]. Thus, even with the development of a vaccine, chickenpox is still a disease that Hungarian citizens contract and pass to each other at alarming rates, considering how transmissible chickenpox is. Thus, modeling chickenpox in Hungary will be useful for health officials in the country to accurately track and predict outbreaks and examine how chickenpox spreads.

Using chickenpox case data from each of the counties in Hungary from 2005 to 2014, we will construct a SEIR model with an additional component consisting of individuals who are vaccinated, therefore going directly from the susceptible population to recovered. Our analysis seeks to answer the question: can we successfully model Hungarian chickenpox cases using an SEIR model?

Exploratory Data Analysis

The Hungarian Chickenpox data used in the analysis is available on Kaggle. The data spans across 10 years from 2005 to 2014, with cases reported on a weekly basis in 20 Hungarian counties. There are no missing values in the data.

The population and birthrate data are found on Macrotrends.

We first look at the time series plot of total Chickenpox cases in Hungary. We observe some clear seasonality of data on a yearly basis as the peak can be as high as over 2000 cases, and the lowest is around 100 cases. We notice that there are some inconsistent trends in the data as the number of cases suddenly skyrockets or dips. We attribute them to the data entry errors and remove them from the data as we construct models. These outliers are highlighted with red circles in the plot.

# load the data
cpox_original = read.csv("hungary_chickenpox.csv")

# rowsum and create date object
cpox = cpox_original %>% mutate(cases = rowSums(.[-1]),date = as.Date(Date,format="%d/%m/%Y"))


# separate month/day/year
cpox = cpox %>% select(date,Date,cases) %>% separate(Date,c("day","month","year"),sep = "/")

cpox %>% mutate( time=  julian(date, origin = as.Date("2005-01-03"))/365.25 + 2005) %>%
  filter(time>=2005 & time<2015) %>%
  select(time,cases) -> cpox


# Plot the data
plot(cpox$time,cpox$case,type='l',xlab = "time",ylab="case", 
                        main="Hungarian chickenpox outbreak")

# highlight outlier (possible data entry error)
cpox = cpox %>% mutate(row_idx = 1:522)
cpox_outlier = cpox %>% filter(row_idx %in% c(122,159,469,486,487,493)) %>% select(-row_idx)
points(x=cpox_outlier$time,y=cpox_outlier$cases,col="red",cex=2)

# remove outlier 
cpox = cpox %>% filter(!row_idx %in% c(122,159,469,486,487,493)) %>% select(-row_idx)

We plot the number of national Chickenpox cases against month. The cases are around 40,000-60,000 from January to May with peak of 60,000 cases in May. The number of cases decreases and reaches a low at 3,000 in September. It is on an increasing trend from October to December.

Below are the times series plot of Chickenpox cases for the 3 most densely populated counties and the 3 least densely populated counties in the data. The seasonality pattern seems identical across 6 different counties.

cpox_town = cpox_original %>% gather(key="town",value="case",2:21) %>% 
            mutate(town = as.factor(town))
cpox_town_summary = cpox_town %>% 
                    group_by(town) %>% 
                    summarise(cases = sum(case))

# largest and smallest
cpox_town_summary_LS = (cpox_town_summary %>% arrange(-cases))[c(1,2,3,18,19,20),]

#
cpox_LS = cpox_town %>% filter(town %in% c("BUDAPEST","PEST","BORSOD","NOGRAD","TOLNA","ZALA")) %>% mutate(Date = as.Date(Date,format="%d/%m/%Y"))


cpox_LS$town <- factor(cpox_LS$town , levels = c("BUDAPEST","PEST","BORSOD","NOGRAD","TOLNA","ZALA"))
ggplot(cpox_LS) + geom_line(aes(x = Date,y = case)) + facet_wrap(~town,nrow=2)

Covariates

Below are the scatterplots with smoothed line of Hungarian population and birthrate data from 2005 to 2014. We see that Hungarian population steadily decreases from about 10.5 million to 9.8 million in this time period. The number of newborns stays around 95,000 from 2006 to 2008 and takes a dip to 89,000 in 2013 and increases slightly back to over 90,000. The smoothed line captures the data well and allows us to measure the effect that these covariates have on our models.

pop = c(10085937,10055653,10024149,9991867,9959439,9927370,9895680,9864358,
                                                        9833923,9804991,9777923)

birthrate = c(9.437,9.470,9.502,9.534,9.443,9.353,9.262,9.172,9.081,9.163,9.245)
year = seq(2005,2015)

# convert birthrate to the number of newborns
hungary_demographic = data.frame(year,pop = pop, birthrate = birthrate,
                            num_newborn = ceiling(pop/1000 * birthrate))

# add smoothing line (Similar to the past project)
hungary_covar = hungary_demographic %>% summarise(
                          time = seq(2005,2014,by=1/52),
                          pop = predict(smooth.spline(x=year,y=pop),x=time)$y,
                          birthrate = predict(smooth.spline(x=year,y=num_newborn),x=time)$y
                    )

par(mfrow=c(1,2))
# population plot
plot(x=hungary_covar$time,y=hungary_covar$pop/1000000,type="l",
     xlab="year",ylab="population (million)",
     main = "Hungarian population")

points(pop/1000000~year, data=hungary_demographic,col="red")

# birthrate plot
plot(x=hungary_covar$time,y=hungary_covar$birthrate/1000,type="l"
     ,xlab="year",ylab="newborns (thousand)",
     main = "Hungarian number of newborns")
points(num_newborn/1000~year, data=hungary_demographic,col="red")

POMP Model

Model Set Up

For our analysis, we will use a modified SEIR model shown below. In a normal SEIR model, once an individual is born, they would enter the susceptible population and either become infected with chickenpox or would not. With the added component of a vaccine, individuals are able to move directly from susceptible to recovered without becoming infected with the disease.

Graphic inspired by [3].

Partially observed Markov process (POMP) models are especially useful in modeling diseases and can be helpful tools in epidemiology. In class, we examined a case study with measles using an SEIR model. Modeling chickenpox in a similar way, but including vaccination status within the model, could be an effective way to model chickenpox in Hungary. Our model was created by closely following the POMP model developed by Aaron King in his case study of measles [4].

The main difference between King’s POMP model and ours is the inclusion of the vaccination rate parameter. The vaccination rate is multiplied by the birth rate to give us the number individuals who will directly move from the susceptible group to the recovered group. This corresponds to them receiving the vaccine. This value is multiplied by 0.92, which represents the vaccine’s effectiveness. According to the CDC, the chickenpox vaccine was 92% effective in post-licensure studies [1]. When browsing past final projects, we found a project which also included vaccination in a SEIR POMP model and uses Aaaron King’s past measles POMP model as a guide [5]. This project was done on measles data in California. Our project differs from the past final project in many ways even with the similar analysis goal and similar source POMP model.

## Process Model --------------------------------------------------------------
# c-snippets - modeled after project

rproc <- Csnippet("
                  double beta, br, seas, foi, dw, births, vac;
                  double rate[6], trans[6];

                  // cohort effect
                  if (fabs(t-floor(t)-251.0/365.0) < 0.5*dt) 
                    br = cohort*birthrate/dt + (1-cohort)*birthrate;
                  else 
                    br = (1.0-cohort)*birthrate;

                  // term-time seasonality
                  t = (t-floor(t))*365.25;
                  if ((t>=7&&t<=100) || 
                    (t>=115&&t<=199) || 
                    (t>=252&&t<=300) || 
                    (t>=308&&t<=356))
                    seas = 1.0+amplitude*0.2411/0.7589;
                  else
                     seas = 1.0-amplitude;

                  // transmission rate
                  beta = R0*(gamma+mu)*seas;

                  // expected force of infection
                  foi = beta*pow(I+iota,alpha)/pop;

                  // white noise (extrademographic stochasticity)
                  dw = rgammawn(sigmaSE,dt);

                  rate[0] = foi*dw/dt;  // stochastic force of infection
                  rate[1] = mu;         // natural S death
                  rate[2] = sigma;      // rate of ending of latent stage
                  rate[3] = mu;         // natural E death
                  rate[4] = gamma;      // recovery
                  rate[5] = mu;         // natural I death

                  // Poisson births
                  births = rpois(br*dt);

                  // Vaccination
                  vac = nearbyint(vr*br*.92*dt);

                  // transitions between classes
                  reulermultinom(2,S,&rate[0],dt,&trans[0]);
                  reulermultinom(2,E,&rate[2],dt,&trans[2]);
                  reulermultinom(2,I,&rate[4],dt,&trans[4]);

                  if (vac > S - trans[0] - trans[1]){
                    vac = S - trans[0] - trans[1];
                  }

                  S += births   - trans[0] - trans[1] - vac;
                  E += trans[0] - trans[2] - trans[3];
                  I += trans[2] - trans[4] - trans[5];
                  R = pop - S - E - I + vac;
                  W += (dw - dt)/sigmaSE;  // standardized i.i.d. white noise
                  C += trans[4];           // true incidence
                  ")

rinit <- Csnippet("
  double m = pop/(S_0+E_0+I_0+R_0);
  S = nearbyint(m*S_0);
  E = nearbyint(m*E_0);
  I = nearbyint(m*I_0);
  R = nearbyint(m*R_0);
  W = 0;
  C = 0;
")

dmeas <- Csnippet("
  double m = rho*C;
  double v = m*(1.0-rho+psi*psi*m);
  double tol = 1.0e-18;
  if (cases > 0.0) {
    lik = pnorm(cases+0.5,m,sqrt(v)+tol,1,0)
            - pnorm(cases-0.5,m,sqrt(v)+tol,1,0)+tol;
  } else {
    lik = pnorm(cases+0.5,m,sqrt(v)+tol,1,0)+tol;
  }
  if (give_log) lik = log(lik);
  if (!R_FINITE(lik)) 
                 Rprintf(\"%lg %lg %lg %lg %lg %lg\\n\",rho,C,m,v,psi,lik);
")

rmeas <- Csnippet("
  double m = rho*C;
  double v = m*(1.0-rho+psi*psi*m);
  double tol = 1.0e-18;
  cases = rnorm(m,sqrt(v)+tol);
  if (cases > 0.0) {
    cases = nearbyint(cases);
  } else {
    cases = 0.0;
  }
")

## POMP Construction ----------------------------------------------------------
cpox %>% 
  pomp(t0=with(cpox, time[1]),
       time="time",
       rprocess=euler(rproc,delta.t=1/365.25),
       rinit=rinit,
       dmeasure=dmeas,
       rmeasure=rmeas,
       covar=covariate_table(hungary_covar,times="time"),
       accumvars=c("C","W"),
       statenames=c("S","E","I","R","C","W"),
       paramnames=c("R0","mu","sigma","gamma","alpha","iota",
                    "rho","sigmaSE","psi", "cohort", "amplitude", 
                    "S_0","E_0","I_0","R_0","vr")
  ) -> m1

Parameters

There are several parameters in our model that we aim to estimate using a local and a global search.

  • \(R0\)- basic reproductive number
  • \(\sigma\)- rate of ending of latent stage
  • \(\gamma\)- recovery rate
  • \(\alpha\)- mixing (how homogeneous the population is)
  • \(\iota\)- disease import rate
  • \(\rho\)- reporting rate
  • \(\sigma_{SE}\)- extrademographic stochasticity
  • \(\psi\)- overdispersion
  • cohort- cohort effect
  • amplitude- seasonality
  • vr- vaccination rate

Discussion

Although we effectively simulate the chickenpox data, we recognize that there are a few limitations in our analysis.

Initial Parameters from Measles

We borrowed many initial parameters from the measles analysis in England and Wales from the course lecture slides [3]. The reason for this is that some parameters specific to the Hungary Chickenpox outbreak are difficult to directly estimate without extensive research into the behavior of the Chickenpox disease in Hungary. For example, we use the same amplitude parameter from the measles analysis in our SEIR model. As shown in the exploratory data analysis, there is clear seasonality in the chickenpox data. This similar seasonality may be seen in the measles data, with cases increasing during the winter school months. Using the measle’s amplitude parameter provided a good starting point to conduct the local search with confidence that it is partially valid in another disease.

Estimating initial parameters from literature

In our SEIR model, we seek to estimate parameters specific to Hungary and chickenpox. We estimate \(\gamma\) (rate of recovery), \(\rho\) (reporting rate), \(R0\) and vaccination rate based on results from scientific literature. Many parameter values are applicable to chickenpox in general but might not reflect the reality of the chickenpox outbreak in Hungary. For example, based on research, \(R0\) values is approximately 7-12 [8]. In our analysis, \(R0\) came out to be extremely high. This is something that requires investigation. This value should not be affected by vaccination rates, so it is unclear why \(R0\) would vary so much in our model based on the addition of our vaccination rate parameter.

Runtime on Great Lakes

We used many different levels of iterative filtering when conducting the local search and global search. Unfortunately, as we increased the number of particles at the iterative filtering state, the computational complexity became too great. We present our findings using less particles and number of MIF iterations than we would like, but acknowledge these findings may be much more accurate given longer run times. We envision that if we increased the number of particles and iterations, we could expand the grid of candidate parameters and find the maximum likelihood estimate that would get us closer to the true model.

Conclusion

We fit a modified version of the SEIR model to Hungarian chickenpox cases from 2005 to 2014. By using a local search and performing global maximization with different levels of iterative filtering, we identified the parameters in the SEIR model that maximize the likelihood. Below are the main conclusions of our analysis:

We effectively model the chickenpox data and our simulations using maximum likelihood estimators follow closely with the true data. Both simulations from local search and global search capture the seasonality of chickenpox data. Specifically, we see that the simulated data has a dip in cases in late Winter every year. The pattern is also observed in the true data. THis is likely due to students being on break from school for the holidays. The small difference between both simulations is that local search simulation seems to peak and dip earlier than the true data, whereas the global search simulation overestimates the number of cases at certain time periods compared to the true data.

Based on our global search result, the vaccination rate has an effect on the SEIR model. However, when we look at the pair plot, we see that vaccination rate seems to be a weakly identified parameter in the model. Namely, there is a region of parameter space of vaccination rate that produces similar likelihood values. Nonetheless, the global search output suggests that the lower bound of the parameter space which gives similar likelihood values is 0.025, meaning that vaccination rate still plays a role in our SEIR model.

We observe convergence of some parameters in the local search. Specifically, \(\rho\) (reporting rate) converges to around 0.5 and alpha (mixing parameter) converges to around 1. In addition, \(S_0\) (proportion of the susceptible ), \(E_0\) (proportion of the exposed), \(I_0\) (proportion of the infected) , and \(R_0\) (proportion of the recovered) converge to 0.025, 0.0001, 0.00035, and 0.975 respectively. We observe convergence of other parameters, but many still need further exploration. It suggests that a more robust search is needed.