17.1 Some theory for forecasting
17.2 Some comments on “fitted values” for time series
17.3 Case study: An emerging infectious disease outbreak
17.4 Parameter estimates for Ebola
17.5 Diagnostics for the Ebola analysis
17.6 Forecasting Ebola

Licensed under the Creative Commons attribution-noncommercial license, http://creativecommons.org/licenses/by-nc/3.0/. Please share and remix noncommercially, mentioning its origin.
CC-BY_NC

Produced with R version 3.2.3 and pomp version 1.3.1.2.

Objectives

To discuss forecasting and fitted values, both as ways to assess model fit and as potentially the major goal of time series analysis.
To teach some forecasting methods based on POMP models.
To demonstrate a forecasting analysis using POMP methods and the pomp computational environment.

17.1 Some theory for forecasting

Given data up to time \(t_n\), and a fitted model, what is our prediction for the future values of a process \(X_{{n+1}}, X_{n+1},\dots\)?
For a POMP model, the answer is fairly clear. According to the structure of a POMP model, the data up to time \(t_n\) affect our estimate of the latent variable \(X_n\), but do not otherwise affect the future evolution of the system.
If we have constructed the filtering distribution, \[ f_{X_n|Y_{1:n}}(x_n|{y_{1:n}^*}{\, ; \,}\hat\theta),\] for some appropriate parameter estimat \(\hat\theta\), our forecast distribution for \(X_{n+p}\) given \({y_{1:n}^*}\) involves combining the filtering distribution with the transition density for \(X_{n+p}\) given \(X_n\). We can write this as

\[\begin{eqnarray} f_{X_{n+p}|Y_{1:n}}(x_{n+p}|{y_{1:n}^*}{\, ; \,}\hat\theta) &=& \int f_{X_n|Y_{1:n}}(x_n|{y_{1:n}^*}{\, ; \,}\hat\theta) f_{X_{n+p}|X_n}(x_{n+p}{{\, | \,}}x_n{\, ; \,}\hat\theta) \, dx_{n} \\ &=& \int f_{X_n|Y_{1:n}}(x_n|{y_{1:n}^*}{\, ; \,}\hat\theta) \prod_{k=1}^p f_{X_{n+k}|X_{n+k-1}}(x_{n+k}{{\, | \,}}x_{n+k-1}{\, ; \,}\hat\theta) \, dx_{n}\dots dx_{n+p-1} \end{eqnarray}\]

A closely related task is to forecast \(Y_{n+p}\) rather than \(X_{n+p}\).
Since an ARIMA model is a POMP model, this theory also applies to ARIMA analysis and can be implemented in R by applying predict to the output of an arima.
For a nonlinear POMP analysis in pomp, the Monte Carlo approach is to represent the filtering distribution by \(J\) particles produced from pfilter. Then, the Monte Carlo representation of \(f_{X_{n+p}|Y_{1:n}}(x_{n+p}|{y_{1:n}^*}{\, ; \,}\hat\theta)\) comes from applying rprocess to these filter particles.
We may wish to extend this approach to generate a forecast that takes into account uncertainty in \(\theta\). Weighting a set of candidate parameter values by their likelihood given the data provides one way to do that—an empirical Bayes approach.

17.2 Some comments on “fitted values” for time series

In regression analysis, fitted values often play an important role in model diagnostics. It is often useful to plot residuals against fitted values, or to see how much of the variation in the data is explained by the fitted values.
For time series models, the “fitted values” are usually defined to be the 1-step prediction mean. This corresponds to exactly the regression fitted value for an AR model. However, because these fitted values depend on nearby data points (unlike in most regression models) their interpretation needs additional care.
For example, consider the daily S&P 500 index that we saw earlier in Section 3.5 and Chapter 15.
In practice, we would usually start by transforming the data to the daily investment return, by taking differences of logs. To make the current point, let’s instead fit an ARMA(1,1) model to the logarithm of the index, for the most recent 1000 days.

sp500_table <- read.table("sp500.csv",sep=",",header=TRUE)
Nt <- 1000
sp500 <- sp500_table$Close[Nt:1] # data are in reverse order in sp500.csv
sp500_date <- strptime(sp500_table$Date[Nt:1],"%Y-%m-%d")
sp500_time <-  as.numeric(format(sp500_date, format="%Y")) + as.numeric(format(sp500_date, format="%j"))/365
sp500_arma <- arima(log(sp500),order=c(1,0,1))
sp500_arma_fitted <- log(sp500)-resid(sp500_arma)
plot(sp500_time,sp500,log="y",type="l",xlab="Date")
lines(sp500_time,exp(sp500_arma_fitted),col="red")

Naively, the fitted values might seem to be doing a remarkable job of explaining changes in the stock market! What is going on?
Let’s look closely at the fitted values:

plot(sp500_time,sp500,log="y",type="l",xlim=range(tail(sp500_time,100)),xlab="Date")
lines(sp500_time,exp(sp500_arma_fitted),col="red")

We see that the fitted values lag one day behind.
The explanation is that the fitted model is essentially a random walk, \[ Y_{n+1} = Y_n + \epsilon_{n+1} \]

sp500_arma

## 
## Call:
## arima(x = log(sp500), order = c(1, 0, 1))
## 
## Coefficients:
##          ar1     ma1  intercept
##       0.9989  0.0209     7.4622
## s.e.  0.0012  0.0324     0.1565
## 
## sigma^2 estimated as 6.703e-05:  log likelihood = 3383.15,  aic = -6758.29

The one-step prediction is therefore \[ y^P_{n+1} = {y_n^*}.\]
This is just telling us: a reasonable prediction of the S&P 500 index tomorrow is its value today. That is not particularly interesting!
However, let’s look at the correlation between the data and the fitted values:

cor(sp500,sp500_arma_fitted)

## [1] 0.9970619

In many linear regression analysis situations, a correlation of \(0.997\) between the data and the fitted values would be taken as strong evidence that you’ve found a good predictive model!
Here, we can see this is a fallacy. However, it is more generally true that this sort of correlation can be dangerous to interpret as a measure of modeling success.
It is better to ask: Does the fitted model do substantially better than a naive model? Here, we can compare the standard deviation of the prediction error from the AR(1) model with that from the simple rule of predicting tomorrow’s index by today’s value.

sd(log(sp500)[2:length(sp500)]-lag(log(sp500))[1:(length(sp500)-1)])

## [1] 0.008186035

sd(resid(sp500_arma))

## [1] 0.008183854

We see, as basic financial knowldge would lead us to suspect in this situation, that the fitted model is not substantially better than the trivial forecast.
If we computed the correlation between data and fitted values after differencing, we would obtain a correlation close to zero. We can see, in this case, that this would be a better reflection of the predictability of the system.
In other cases, when there is no theoretical principle to guide us to take difference of logs, we can follow the general principle of presenting the improvement in predictability for our favorite model relative to a sensible, simple benchmark model.
This is similar to the ARMA benchmark likelihoods we use to assess the success of nonlinear POMP models.

17.3 Case study: An emerging infectious disease outbreak

Let’s situate ourselves at the beginning of October 2014.
Ebola has quite suddenly turned into a threat to global health and global trade.
The WHO situation report contained data on the number of Ebola cases in each of Guinea, Sierra Leone, and Liberia. Key questions included:
1. How fast will the outbreak unfold?
2. How large will it ultimately prove?
3. What interventions will be most effective?
We carry out an investigation based on King et al. (2015), using data downloaded from the WHO Situation Report of 1 October 2014:

base_url <- "http://kingaa.github.io/sbied/"
read.csv(paste0(base_url,"data/ebola_data.csv"),stringsAsFactors=FALSE,
         colClasses=c(date="Date")) -> dat
sapply(dat,class)

##        week        date     country       cases 
##   "integer"      "Date" "character"   "numeric"

head(dat)

##   week       date country cases
## 1    1 2014-01-05  Guinea 2.244
## 2    2 2014-01-12  Guinea 2.244
## 3    3 2014-01-19  Guinea 0.073
## 4    4 2014-01-26  Guinea 5.717
## 5    5 2014-02-02  Guinea 3.954
## 6    6 2014-02-09  Guinea 5.444

Supplementing these data are population estimates for the three countries.

## Population sizes in Guinea, Liberia, and Sierra Leone (census 2014)
populations <- c(Guinea=10628972,Liberia=4092310,SierraLeone=6190280)

dat %>%
  ggplot(aes(x=date,y=cases,group=country,color=country))+
  geom_line()

17.3.1 An SEIR model with gamma-distributed latent and infectious periods

Many of the early modeling efforts used variants on the simple SEIR model.
Here, we’ll focus on a variant that attempts a more accurate description of the duration of the latent period.
Specifically, this model assumes that the amount of time an infection remains latent is \[\mathrm{LP} \sim \mathrm{Gamma}\left(m,\frac{1}{m\,\alpha}\right),\] where \(m\) is an integer.
This means that the latent period has expectation \(1/\alpha\) and variance \(1/(m\,\alpha)\). In this document, we’ll fix \(m=3\).
We implement Gamma distributions using the so-called linear chain trick.

17.3.2 Process model simulator

rSim <- Csnippet('
  double lambda, beta;
  double *E = &E1;
  beta = R0 * gamma; // Transmission rate
  lambda = beta * I / N; // Force of infection
  int i;

  // Transitions
  // From class S
  double transS = rbinom(S, 1.0 - exp(- lambda * dt)); // No of infections
  // From class E
  double transE[nstageE]; // No of transitions between classes E
  for(i = 0; i < nstageE; i++){
    transE[i] = rbinom(E[i], 1.0 - exp(-nstageE * alpha * dt));
  }
  // From class I
  double transI = rbinom(I, 1.0 - exp(-gamma * dt)); // No of transitions I->R

  // Balance the equations
  S -= transS;
  E[0] += transS - transE[0];
  for(i=1; i < nstageE; i++) {
    E[i] += transE[i-1] - transE[i];
  }
  I += transE[nstageE-1] - transI;
  R += transI;
  N_EI += transE[nstageE-1]; // No of transitions from E to I
  N_IR += transI; // No of transitions from I to R
')

17.3.3 Deterministic skeleton

The deterministic skeleton is an ODE.

skel <- Csnippet('
  double lambda, beta;
  const double *E = &E1;
  double *DE = &DE1;
  beta = R0 * gamma; // Transmission rate
  lambda = beta * I / N; // Force of infection
  int i;

  // Balance the equations
  DS = - lambda * S;
  DE[0] = lambda * S - nstageE * alpha * E[0];
  for (i=1; i < nstageE; i++)
    DE[i] = nstageE * alpha * (E[i-1]-E[i]);
  DI = nstageE * alpha * E[nstageE-1] - gamma * I;
  DR = gamma * I;
  DN_EI = nstageE * alpha * E[nstageE-1];
  DN_IR = gamma * I;
')

17.3.4 Measurement model: overdispersed count data

\(C_t | H_t\) is negative binomial with \({\mathbb{E}}[C_t|H_t] = \rho\,H_t\) and \({\mathrm{Var}}[C_t|H_t] = \rho\,H_t\,(1+k\,\rho\,H_t)\).

dObs <- Csnippet('
  double f;
  if (k > 0.0)
    f = dnbinom_mu(nearbyint(cases),1.0/k,rho*N_EI,1);
  else
    f = dpois(nearbyint(cases),rho*N_EI,1);
  lik = (give_log) ? f : exp(f);
')

rObs <- Csnippet('
  if (k > 0) {
    cases = rnbinom_mu(1.0/k,rho*N_EI);
  } else {
    cases = rpois(rho*N_EI);
  }')

17.3.5 Parameter transformations

toEst <- Csnippet('
  const double *IC = &S_0;
  double *TIC = &TS_0;
  TR0 = log(R0);
  Trho = logit(rho);
  Tk = log(k);
  to_log_barycentric(TIC,IC,4);
')

fromEst <- Csnippet('
  const double *IC = &S_0;
  double *TIC = &TS_0;
  TR0 = exp(R0);
  Trho = expit(rho);
  Tk = exp(k);
  from_log_barycentric(TIC,IC,4);
')

The following function constructs a pomp object to hold the data for any one of the countries.

ebolaModel <- function (country=c("Guinea", "SierraLeone", "Liberia"),
                        timestep = 0.1, nstageE = 3) {

  ctry <- match.arg(country)
  pop <- unname(populations[ctry])
  nstageE <- as.integer(nstageE)

  globs <- paste0("static int nstageE = ",nstageE,";")

  dat <- subset(dat,country==ctry,select=-country)

  ## Create the pomp object
  dat %>% 
    extract(c("week","cases")) %>%
    pomp(
      times="week",
      t0=min(dat$week)-1,
      globals=globs,
      statenames=c("S","E1","I","R","N_EI","N_IR"),
      zeronames=c("N_EI","N_IR"),
      paramnames=c("N","R0","alpha","gamma","rho","k",
                   "S_0","E_0","I_0","R_0"),
      nstageE=nstageE,
      dmeasure=dObs, rmeasure=rObs,
      rprocess=discrete.time.sim(step.fun=rSim, delta.t=timestep),
      skeleton=skel, skeleton.type="vectorfield",
      toEstimationScale=toEst,
      fromEstimationScale=fromEst,
      initializer=function (params, t0, nstageE, ...) {
        all.state.names <- c("S",paste0("E",1:nstageE),"I","R","N_EI","N_IR")
        comp.names <- c("S",paste0("E",1:nstageE),"I","R")
        x0 <- setNames(numeric(length(all.state.names)),all.state.names)
        frac <- c(params["S_0"],rep(params["E_0"]/nstageE,nstageE),params["I_0"],params["R_0"])
        x0[comp.names] <- round(params["N"]*frac/sum(frac))
        x0
      }
    ) -> po
}

ebolaModel("Guinea") -> gin
ebolaModel("SierraLeone") -> sle
ebolaModel("Liberia") -> lbr

17.4 Parameter estimates for Ebola

King et al. (2015) estimated parameters for this model for each country.
A large Latin hypercube design was used to initiate a large number of iterated filtering runs.
Profile likelihoods were computed for each country against the parameters \(k\) (the measurement model overdispersion) and \(R_0\) (the basic reproductive ratio).
Full details are given on the datadryad.org site. The following loads the results of these calculations.

options(stringsAsFactors=FALSE)
profs <- read.csv(paste0(base_url,"/ebola/ebola-profiles.csv"))

The following plots the profile likelihoods.
The horizontal line represents the critical value of the likelihood ratio test for \(p=0.01\).

require(reshape2)
require(plyr)
require(magrittr)
require(ggplot2)
theme_set(theme_bw())

profs %>% 
  melt(id=c("profile","country","loglik")) %>%
  subset(variable==profile) %>%
  ddply(~country,mutate,dll=loglik-max(loglik)) %>%
  ddply(~country+profile+value,subset,loglik==max(loglik)) %>% 
  ggplot(mapping=aes(x=value,y=dll))+
  geom_point(color='red')+
  geom_hline(yintercept=-0.5*qchisq(p=0.99,df=1))+
  facet_grid(country~profile,scales='free')+
  labs(y=expression(l))

17.5 Diagnostics for the Ebola analysis

Parameter estimation is the process of finding the parameters that are “best”, in some sense, for a given model, from among the set of those that make sense for that model.
Model selection, likewise, aims at identifying the “best” model, in some sense, from among a set of candidates.
One can do both of these things more or less well, but no matter how carefully they are done, the best of a bad set of models is still bad.
Let’s investigate the model here, at its maximum-likelihood parameters, to see if we can identify problems.
The guiding principle in this is that, if the model is “good”, then the data are a plausible realization of that model.
- Therefore, we can compare the data directly against model simulations.
- Moreover, we can quantify the agreement between simulations and data in any way we like.
- Any statistic, or set of statistics, that can be applied to the data can also be applied to simulations.
- Shortcomings of the model should manifest themselves as discrepancies between the model-predicted distribution of such statistics and their value on the data.
pomp provides tools to facilitate this process. Specifically, the probe function applies a set of user-specified probes or summary statistics, to the model and the data, and quantifies the degree of disagreement in several ways.
Let’s see how this is done using the model for the Guinean outbreak.

library(pomp)
library(plyr)
library(reshape2)
library(magrittr)
options(stringsAsFactors=FALSE)

profs %>%
  subset(country=="Guinea") %>%
  subset(loglik==max(loglik),
         select=-c(loglik,loglik.se,country,profile)) %>%
  unlist() -> coef(gin)

simulate(gin,nsim=20,as.data.frame=TRUE,include.data=TRUE) %>% 
  mutate(date=min(dat$date)+7*(time-1),
         is.data=ifelse(sim=="data","yes","no")) %>% 
  ggplot(aes(x=date,y=cases,group=sim,color=is.data,
         alpha=is.data))+
  geom_line()+
  guides(color=FALSE,alpha=FALSE)+
  scale_color_manual(values=c(no=gray(0.6),yes='red'))+
  scale_alpha_manual(values=c(no=0.5,yes=1))

The simulations appear to be growing a bit more quickly than the data.
Let’s try to quantify this.
First, we’ll write a function that estimates the exponential growth rate by linear regression. Then, we’ll apply it to the data and to 500 simulations.

growth.rate <- function (y) {
  cases <- y["cases",]
  fit <- lm(log1p(cases)~seq_along(cases))
  unname(coef(fit)[2])
}
probe(gin,probes=list(r=growth.rate),nsim=500) %>% plot()

Do these results bear out our suspicion that the model and data differ in terms of growth rate?
The simulations appear to be more highly variable around the trend than do the data.

growth.rate.plus <- function (y) {
  cases <- y["cases",]
  fit <- lm(log1p(cases)~seq_along(cases))
  c(r=unname(coef(fit)[2]),sd=sd(residuals(fit)))
}
probe(gin,probes=list(growth.rate.plus),
      nsim=500) %>% plot()

Let’s also look more carefully at the distribution of values about the trend using the 1st and 3rd quantiles.
Also, it looks like the data are less jagged than the simulations.
We can quantify this using the autocorrelation function (ACF).

log1p.detrend <- function (y) {
  cases <- y["cases",]
  y["cases",] <- as.numeric(residuals(lm(log1p(cases)~seq_along(cases))))
  y
}

probe(gin,probes=list(
  growth.rate.plus,
  probe.quantile(var="cases",prob=c(0.25,0.75)),
  probe.acf(var="cases",lags=c(1,2,3),type="correlation",
            transform=log1p.detrend)
),nsim=500) %>% plot()

17.5.1 Exercise: the SEIR model for the Sierra Leone outbreak

Apply probes to investigate the extent to which the model is an adequate description of the data from the Sierra Leone outbreak.
Have a look at the probes provided with pomp: ?basic.probes.
Try also to come up with some informative probes of your own. Discuss the implications of your findings.

17.6 Forecasting Ebola

Up to now, we’ve primarily focused on using POMP models to answer scientific questions.
We can also use them to make forecasts.
The key issues are to do with quantifying the forecast uncertainty. This arises from four sources:
1. measurement error
2. process noise
3. parametric uncertainty
4. structural uncertainty
Here, we’ll explore how we can account for the first three of these in making forecasts for the Sierra Leone outbreak.
We follow an empirical Bayes approach:
1. We set up a collection of parameter vectors in a neighborhood of the maximum likelihood estimate containing the region of high likelihood.
2. We carry out a particle filter at each parameter vector, which gives us estimates of both the likelihood and the filter distribution at that parameter value.
3. We simulate forward from each filter distribution, up to the desired forecast horizon, to give the prediction distribution for each parameter vector.
4. We sample from these prediction distributions with probability proportional to the estimated likelihood of the parameter vector.

require(pomp)
require(plyr)
require(reshape2)
require(magrittr)
options(stringsAsFactors=FALSE)

set.seed(988077383L)

## forecast horizon
horizon <- 13

profs %>%
  subset(country=="SierraLeone") %>%
  subset(loglik==max(loglik),
         select=-c(loglik,loglik.se,country,profile)) %>%
  unlist() -> mle

## Weighted quantile function
wquant <- function (x, weights, probs = c(0.025,0.5,0.975)) {
  idx <- order(x)
  x <- x[idx]
  weights <- weights[idx]
  w <- cumsum(weights)/sum(weights)
  rval <- approx(w,x,probs,rule=1)
  rval$y
}

profs %>% 
  subset(country=="SierraLeone",
         select=-c(country,profile,loglik.se)) %>%
  subset(loglik>max(loglik)-0.5*qchisq(df=1,p=0.99)) %>%
  melt(variable.name="parameter") %>%
  ddply(~parameter,summarize,
        min=min(value),max=max(value)) %>%
  subset(parameter!="loglik") %>%
  melt(measure=c("min","max")) %>%
  acast(parameter~variable) -> ranges

params <- sobolDesign(lower=ranges[,'min'],
                      upper=ranges[,'max'],
                      nseq=20)
plot(params)

require(foreach)
require(doMC)
require(iterators)

registerDoMC(cores=4)

set.seed(887851050L,kind="L'Ecuyer")

foreach(p=iter(params,by='row'),
        .inorder=FALSE,
        .combine=rbind,
        .options.multicore=list(preschedule=TRUE,set.seed=TRUE)
) %dopar%
{
  require(pomp)
  
  M1 <- ebolaModel("SierraLeone")
  pf <- pfilter(M1,params=unlist(p),Np=2000,save.states=TRUE)
  pf$saved.states %>% tail(1) %>% melt() %>% 
    dcast(rep~variable,value.var="value") %>%
    ddply(~rep,summarize,S_0=S,E_0=E1+E2+E3,I_0=I,R_0=R) %>%
    melt(id="rep") %>% acast(variable~rep) -> x
  
  pp <- parmat(unlist(p),ncol(x))
  
  simulate(M1,params=pp,obs=TRUE) %>%
    melt() %>%
    mutate(time=time(M1)[time],
           period="calibration",
           loglik=logLik(pf)) -> calib

    M2 <- M1
  time(M2) <- max(time(M1))+seq_len(horizon)
  timezero(M2) <- max(time(M1))
  
  pp[rownames(x),] <- x
  
  simulate(M2,params=pp,obs=TRUE) %>%
    melt() %>%
    mutate(time=time(M2)[time],
           period="projection",
           loglik=logLik(pf)) -> proj
  
  rbind(calib,proj)
} %>% subset(variable=="cases",select=-variable) %>%
  mutate(weight=exp(loglik-mean(loglik))) %>%
  arrange(time,rep) -> sims

ess <- with(subset(sims,time==max(time)),weight/sum(weight))
ess <- 1/sum(ess^2); ess

## [1] 10126.79

sims %>% ddply(~time+period,summarize,prob=c(0.025,0.5,0.975),
               quantile=wquant(value,weights=weight,probs=prob)) %>%
  mutate(prob=mapvalues(prob,from=c(0.025,0.5,0.975),
                        to=c("lower","median","upper"))) %>%
  dcast(period+time~prob,value.var='quantile') %>%
  mutate(date=min(dat$date)+7*(time-1)) -> simq

simq %>% ggplot(aes(x=date))+
  geom_ribbon(aes(ymin=lower,ymax=upper,fill=period),alpha=0.3,color=NA)+
  geom_line(aes(y=median,color=period))+
  geom_point(data=subset(dat,country=="SierraLeone"),
             mapping=aes(x=date,y=cases),color='black')+
  labs(y="cases")

Acknowledgment

These notes draw on material developed for a short course on Simulation-based Inference for Epidemiological Dynamics by Aaron King and Edward Ionides, taught at the University of Washington Summer Institute in Statistics and Modeling in Infectious Diseases, 2015.

References

King, A. A., M. Domenech de Cellès, F. M. G. Magpantay, and P. Rohani. 2015. Avoidable errors in the modelling of outbreaks of emerging pathogens, with special reference to ebola. Proc. R. Soc. Lond. B 282:20150347.

Forecasting and fitted values, with a case study of Ebola

Edward Ionides

04/07/2016