ARMA and POMP Analysis on COVID-19 Variants in the US

Introduction

As countries start to see the light at the end of the tunnel in the battle against the COVID-19 pandemic, we wanted to study the daily cases and, using a series of POMP models, each one being adjusted based on the information we now have present about the ground situation during each “segment” of time, attempt to model the number of cases. Our data comes from Our World in Data[1] and contains daily numbers for the count of new cases for the US from 2020-01-23 to 2022-04-01. After running some initial exploratory data analysis on the data set, we were able to divide the time series into 3 distinct groups, which can be characterised by the variant that was the main driver of cases at the time. We therefore had a pre-Delta variant domain, a Delta variant domain and an Omicron variant domain. The graphs, alongside the full view of the entire dataset is presented below.

Explore Data Analysis

##   cases avg_7       date
## 1     1     0 2020-01-23
## 2     1     0 2020-01-24
## 3     0     0 2020-01-25
## 4     1     0 2020-01-26
## 5     0     0 2020-01-27
## 6     0     0 2020-01-28

##      cases             avg_7             date           
##  Min.   :      0   Min.   :     0   Min.   :2020-01-23  
##  1st Qu.:  28183   1st Qu.: 28000   1st Qu.:2020-08-09  
##  Median :  56516   Median : 60380   Median :2021-02-25  
##  Mean   :  99342   Mean   : 99223   Mean   :2021-02-25  
##  3rd Qu.: 120985   3rd Qu.:120164   3rd Qu.:2021-09-13  
##  Max.   :1259946   Max.   :806571   Max.   :2022-04-01

##       date                cases            avg_7       
##  Min.   :2020-01-23   Min.   :     0   Min.   :     0  
##  1st Qu.:2020-05-17   1st Qu.: 26386   1st Qu.: 26706  
##  Median :2020-09-10   Median : 51150   Median : 52849  
##  Mean   :2020-09-10   Mean   : 69266   Mean   : 68924  
##  3rd Qu.:2021-01-04   3rd Qu.: 75440   3rd Qu.: 70492  
##  Max.   :2021-04-30   Max.   :293325   Max.   :250323

##       date                cases            avg_7       
##  Min.   :2021-05-01   Min.   :  8364   Min.   : 11596  
##  1st Qu.:2021-06-23   1st Qu.: 27602   1st Qu.: 26413  
##  Median :2021-08-15   Median : 72618   Median : 73728  
##  Mean   :2021-08-15   Mean   : 74987   Mean   : 74518  
##  3rd Qu.:2021-10-07   3rd Qu.:113520   3rd Qu.:110890  
##  Max.   :2021-11-30   Max.   :198857   Max.   :164521

##       date                cases             avg_7       
##  Min.   :2021-12-01   Min.   :   9124   Min.   : 25252  
##  1st Qu.:2021-12-31   1st Qu.:  42566   1st Qu.: 52433  
##  Median :2022-01-30   Median : 133912   Median :127134  
##  Mean   :2022-01-30   Mean   : 256449   Mean   :257795  
##  3rd Qu.:2022-03-01   3rd Qu.: 388731   3rd Qu.:419001  
##  Max.   :2022-04-01   Max.   :1259946   Max.   :806571

ARMA Modeling

We start our analysis by fitting a benchmark ARMA model. We produce an AIC table to determine the best model see that the best fit is the ARMA(4,4).

aic_table <- function(data,P,Q){
  table <- matrix(NA,(P+1),(Q+1))
  for(p in 0:P) {
    for(q in 0:Q) {
      table[p+1,q+1] <- arima(data,order=c(p,0,q), method="ML")$aic
    }
  }
  dimnames(table) <- list(paste("AR",0:P, sep=""),paste("MA",0:Q,sep=""))
  table
}
cnt_aic_table <- aic_table(data$cases,4,4)
knitr::kable(cnt_aic_table,digits=2)

	MA0	MA1	MA2	MA3	MA4
AR0	21245.16	20535.77	20452.18	20059.06	19941.91
AR1	20007.74	19834.31	19811.78	19872.01	19918.81
AR2	19965.71	19965.97	19733.63	19589.33	19919.91
AR3	19772.79	19847.63	19754.16	19611.95	19549.93
AR4	19764.28	19746.47	19764.69	19589.80	19541.82

arima44 <- arima(x = data$cases, order = c(4, 0, 4))
arima44

## 
## Call:
## arima(x = data$cases, order = c(4, 0, 4))
## 
## Coefficients:
##          ar1      ar2     ar3     ar4     ma1     ma2     ma3     ma4
##       0.2438  -0.4574  0.7302  0.1946  0.5139  1.3157  0.3885  0.5522
## s.e.  0.0634   0.0405  0.0381  0.0656  0.0513  0.0381  0.0394  0.0494
##       intercept
##        97526.36
## s.e.   21858.28
## 
## sigma^2 estimated as 2.302e+09:  log likelihood = -9760.91,  aic = 19541.82

par(mfrow=c(1,2))
acf(arima44$residuals)
qqnorm(arima44$residuals)
qqline(arima44$residuals)

From the ACF and QQ-plot, we can tell that the residuals do not appear to be IID and normaly distributed. We then turn to POMP modeling analysis on the COVID-19 data.

POMP Modelling

Pre-Delta Domain

The pre-Delta data was defined as being within the timeframe of 2020-01-23 to April 30th 2021. Before the Delta variant, vaccinations had not yet been widely introduced to the US public and so therefore, we decided to implement an SEIR model, which studies the transitions from susceptible, to exposed, infected and then finally to the recovered patients. We found that the SEIR model was more appropriate than the SIR model due to the presence of this exposed state, as it is a situation we observe in many diseases. More specifically, the concept of “being exposed” to COVID-19 was specifically addressed by local, state and federal governments. The way that people were expected to behave changed depending on whether they had been exposed or not and so we believed this was an important section in our modelling process. The combination of parameter we settled on for this model was inspired by a previous project[2]. The parameters in the SEIR models are summarised as:

\(S_t\): the number of susceptible people at time t
\(E_t\): the number of exposed people at time t
\(I_t\): the number of infectious people at time t
\(R_t\): the number of recovered people at time t

And the model’s differential equations are given by following expressions:

\(\frac{dS_t}{dt} = -\beta/N S_t I_t\)

\(\frac{dE_t}{dt} = \beta/N S_t I_t -\mu_{EI} E_t\)

\(\frac{dI_t}{dt} = \mu_{EI} E_t -\mu_{IR} I_t\)

\(\frac{dR_t}{dt} = \mu_{IR} I_t\)

library(pomp)
library(doParallel)
library(doRNG)
library(iterators)
registerDoParallel()
registerDoRNG(5312022)

seir_step <- Csnippet("
  double dN_SE = rbinom(S,1-exp(-Beta*I/N*dt));
  double dN_EI = rbinom(E,1-exp(-mu_EI*dt));
  double dN_IR = rbinom(I,1-exp(-mu_IR*dt));
  S -= dN_SE;
  E += dN_SE - dN_EI;
  I += dN_EI - dN_IR;
  R += dN_IR;
  H += dN_IR;
")

seir_rinit <- Csnippet("
  S = nearbyint(eta*N);
  E = 0;
  I = 1;
  R = nearbyint((1-eta)*N);
  H = 0;
  ")

dmeas <- Csnippet("
double tol = 1.0e-25;
double mean_cases = rho*H;
double sd_cases = sqrt(mean_cases*mean_cases);

if(reports > 0.0){
lik = pnorm(reports,mean_cases,sd_cases,1,0)- pnorm(reports-0.5,mean_cases,sd_cases,1,0)
+ tol;
} else{
lik = pnorm(reports+0.5,mean_cases,sd_cases,1,0) +tol;
}
if (give_log) lik = log(lik);

")

rmeas <- Csnippet("

reports = rnorm(rho*H, sqrt(rho*H ) );
if (reports > 0.0) {
reports = nearbyint(reports);
} else {
reports = 0.0;
}
")

previous %>% mutate(days = c(1:464), reports=avg_7) -> df1

covid1 = df1 %>%
  pomp(
    times="days",t0=0,
    rprocess=euler(seir_step,delta.t=1),
    rinit=seir_rinit,
    rmeasure=rmeas,
    dmeasure=dmeas,
    accumvars="H",
    partrans=parameter_trans(
      log=c("Beta","tau","mu_EI","mu_IR"),
      logit=c("rho","eta")
    ),
    paramnames=c("N","Beta","mu_EI","mu_IR","rho","eta","tau"),
    statenames=c("S","E","I","R","H")
  )

We first do the simulation with our initially guessing parameters. Beta=12.9,mu_IR=1.15,mu_EI=0.08,rho=0.8,tau=1,eta=0.1,N=3e8

set.seed(531)
params <- c(Beta=12.9,mu_IR=1.15,mu_EI=0.08,rho=0.8,tau=1,eta=0.1,N=3e8)

covid1 %>%
  simulate(
    params=params,
    nsim=10,format="data.frame",include.data=TRUE
  ) -> sims
sims %>%
  ggplot(aes(x=days,y=reports,group=.id,color=.id=="data"))+
  geom_line()+
  guides(color="none")+labs(title="Simulation for data")

params <- c(Beta=12.9,mu_IR=1.15,mu_EI=0.08,rho=0.8,tau=1,eta=0.1,N=3e8)

foreach(i=1:10,.combine=c) %dopar% {
  library(pomp)
  library(tidyverse)
  covid1 %>% pfilter(params=params,Np=100)
} -> pf
pf %>% logLik() %>% logmeanexp(se=TRUE) -> L_pf
L_pf

##                      se 
## -16744.2904    195.8473

Though the initial simulation seems to fit the data well, we can find that the log-likelihood behaves not good for the initial guessing parameters, we then do the local searching.

Local Search

library(doParallel)
registerDoParallel()
registerDoRNG(5312022)

# set the random walk parameters
covid_cooling.fraction.50 <- 0.5
covid_rw.sd <- rw.sd(
    Beta=0.002, rho=0.002,eta=ivp(0.002)
)

params <- c(Beta=12.9,mu_IR=1.15,mu_EI=0.08,rho=0.8,tau=1,eta=0.1,N=3e8)

bake(file="lik_local_1.rds",{
  foreach(i=1:20,.combine=c) %dopar% {
    library(pomp)
    library(tidyverse)
    mif2(covid1,
         params = params,
         Np=2000,
         Nmif=50,
         cooling.fraction.50=covid_cooling.fraction.50,
         rw.sd=covid_rw.sd) 
  } -> mifs_local 
  mifs_local
}) -> mifs_local

coefs_local <- coef(mifs_local)
max_coefs_local <- coefs_local[,which.max(logLik(mifs_local))]

bake(file="local_results_1.rds",{
  foreach(mf=mifs_local, .combine=rbind) %dopar% {
    library(pomp)
    library(tidyverse)
    evals <- replicate(10,logLik(pfilter(mf,Np=20000)))
    ll <- logmeanexp(evals,se=TRUE)
    mf %>% coef() %>% bind_rows() %>%
      bind_cols(loglik=ll[1],loglik.se=ll[2])
  } -> local_results
  
  local_results
}) -> local_results

The top five best local searches:

bind_rows(local_results) %>%
  filter(is.finite(loglik)) %>%
  filter(loglik.se < 8) %>%
  arrange(-loglik) -> best_local_searches
head(best_local_searches,5)

## # A tibble: 5 × 9
##    Beta mu_IR mu_EI   rho   tau    eta         N  loglik loglik.se
##   <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>     <dbl>   <dbl>     <dbl>
## 1  14.8  1.15  0.08 0.872     1 0.100  300000000 -16474.      5.07
## 2  15.5  1.15  0.08 0.807     1 0.102  300000000 -16678.      4.83
## 3  17.1  1.15  0.08 0.820     1 0.0983 300000000 -16898.      2.67
## 4  17.6  1.15  0.08 0.887     1 0.0994 300000000 -16940.      2.94
## 5  17.5  1.15  0.08 0.864     1 0.101  300000000 -16987.      5.07

bind_rows(local_results) %>%
  filter(is.finite(loglik)) %>%
  filter(loglik.se < 8) %>%
  filter(loglik>max(loglik)-10000) -> tmp1

pairs(~loglik+Beta+eta+rho,data=tmp1,pch=16)

We here do simulation for the best local searched parameters:

covid1 %>%
  simulate(
    params=unlist(best_local_searches[1,]),
    nsim=10,format="data.frame",include.data=TRUE
  ) -> sims
sims %>%
  ggplot(aes(x=days,y=reports,group=.id,color=.id=="data"))+
  geom_line()+
  guides(color="none")+labs(title="Simulation for data with local parameters searching")

There is certain trend for the loglik value on parameters like Beta, but the loglik plots look so sparse that it does not give us a clear picture or hint of the ridge in likelihood surface. Thus, we move on to do global search.

We here use the best 5 value for local search as well as the initial parameters guessing to consider the parameter box for global searching[3].

Global search

covid_box1 <- rbind(
  Beta=c(10,20),
  mu_EI=c(0.07,0.09),
  rho=c(0.7,1),
  eta=c(0,0.09),
  N=c(3e8,3e8),
  mu_IR=c(1, 1.25),
  tau=c(0.85,1.1)
)

bake(file="mifs_global_1.rds",{
    foreach(i=1:10,.combine=c) %dopar% {
        library(pomp)
        library(tidyverse)
        mif2(covid1,
             params = c(apply(covid_box1,1,function(x)runif(1,x[1],x[2]))),
             Np=2500,
             Nmif=250,
             cooling.fraction.50=covid_cooling.fraction.50,
             rw.sd=covid_rw.sd) 
    } -> mifs_global 
    mifs_global
}) -> mifs_global

bake(file="global_search_1.rds",{
    foreach(mf=mifs_global, .combine=rbind) %dopar% {
        library(pomp)
        library(tidyverse)
        evals <- replicate(10,logLik(pfilter(mf,Np=50000)))
        ll <- logmeanexp(evals,se=TRUE)
        mf %>% coef() %>% bind_rows() %>%
            bind_cols(loglik=ll[1],loglik.se=ll[2])
    } -> global_results
    
    global_results
}) -> global_results1

bind_rows(global_results1) %>%
  filter(loglik.se < 10 ) %>%
  arrange(-loglik) -> best_global_results1
head(as.data.frame(best_global_results1),5)

##       Beta      mu_EI       rho        eta     N    mu_IR       tau    loglik
## 1 12.30763 0.07945842 0.9998831 0.08948002 3e+08 1.053337 0.9708503 -14148.69
## 2 17.28959 0.08106070 0.9093095 0.06954621 3e+08 1.245942 0.9445085 -16174.48
## 3 16.25388 0.07476923 0.8893382 0.07349088 3e+08 1.129485 1.0554523 -16199.03
## 4 16.09867 0.08932817 0.7614243 0.03955038 3e+08 1.233796 1.0900555 -24341.41
##   loglik.se
## 1 9.7205506
## 2 5.9764233
## 3 5.9866589
## 4 0.8067284

The best global search has a likelihood of -14148, which is significantly better than the log-likelihood from the local search. The pairs plot for the global search is as below.

bind_rows(global_results1) %>% 
bind_rows(local_results) %>%
  filter(is.finite(loglik)) %>%
  filter(loglik.se < 10) -> tmp
pairs(~loglik+Beta+rho+mu_IR+mu_EI+rho+eta+tau,data=tmp,pch=16, col="red")

set.seed(531)
covid1 %>%
  simulate(
    params=unlist(best_global_results1[1,]),
    nsim=10,format="data.frame",include.data=TRUE
  ) -> sims
sims %>%
  ggplot(aes(x=days,y=reports,group=.id,color=.id=="data"))+
  geom_line()+
  guides(color="none")+labs(title="Simulation for data with global parameters searching")

We observe reasonable, but not ideal fits for our data in both the global and local searches. In particular, they both appear to miss the peak of cases, in timing and extent, which appears to occur roughly on day 355 at 250,000 cases. The local search overestimates the peak, whilst the global search underestimates it.

Delta

Vaccinations were a few months into their rollout and the effects were just beginning to become observable by the time the Delta variant started spreading in the US. The dates that we have defined to study the Delta variant is 2021-05-01 to 2021-11-30. For this second segment of our data, we define a SEIRV[4] model, the additional term V included to account for those that have been fully vaccinated.

\(S_t\): the number of susceptible people at time t
\(E_t\): the number of exposed people at time t
\(I_t\): the number of infectious people at time t
\(R_t\): the number of recovered people at time t
\(V_t\): the number of people vaccinated at time t

And the model’s differential equations are given by following expressions:

\(\frac{dS_t}{dt} = -\beta/N S_t I_t -\alpha/N S_t\)

\(\frac{dE_t}{dt} = \beta/N S_t I_t -\mu_{EI} E_t\)

\(\frac{dI_t}{dt} = \mu_{EI} E_t -\mu_{IR} I_t\)

\(\frac{dV_t}{dt} = \alpha/N S_t\)

\(\frac{dR_t}{dt} = \mu_{IR} I_t\)

To justify the parameter setting, H was initialised to the reported cases on 2021-5-1 and the exposed parameter was initialised to the reported cases on 2021-5-2. We set our infected parameter to double the reported cases on the date 2021-5-1. As a point of reference for modelling the vaccinated parameter, the percentage of those vaccinated in the US at the beginning of this time series (on 2021-3-30) was 30.54% and then on 2021-5-1 was 31.15%[1].

seirv_step <- Csnippet("
  double dN_SE = rbinom(S,1-exp(-Beta*I/N*dt));
  double dN_EI = rbinom(E,1-exp(-mu_EI*dt));
  double dN_IR = rbinom(I,1-exp(-mu_IR*dt));
  double dN_SV = rbinom(S, 1-exp(-alpha/N*dt));
  S -= dN_SE + dN_SV;
  E += dN_SE - dN_EI;
  I += dN_EI - dN_IR;
  R += dN_IR;
  V += dN_SV;
  H += dN_IR;
")

seirv_rinit <- Csnippet("
  S = nearbyint(eta*N);
  E = 49087;
  I = 2*49743;
  R = nearbyint(N*(1-eta)-(49087+2*49743 + round(N*0.3*0.01) + 49743));
  V = nearbyint(N*0.3*0.01);
  H = 49743;
  ")

dmeas <- Csnippet("
double tol = 1.0e-25;
double mean_cases = rho*H;
double sd_cases = sqrt(mean_cases*mean_cases);

if(reports > 0.0){
lik = pnorm(reports,mean_cases,sd_cases,1,0)- pnorm(reports-0.5,mean_cases,sd_cases,1,0)
+ tol;
} else{
lik = pnorm(reports+0.5,mean_cases,sd_cases,1,0) +tol;
}
if (give_log) lik = log(lik);

")
rmeas <- Csnippet("

reports = rnorm(rho*H, sqrt(rho*H ) );
if (reports > 0.0) {
reports = nearbyint(reports);
} else {
reports = 0.0;
}

")

delta %>% mutate(days = c(1:214), reports=avg_7) -> df2

covid2 = df2 %>%
  pomp(
    times="days",t0=0,
    rprocess=euler(seirv_step,delta.t=1),
    rinit=seirv_rinit,
    rmeasure=rmeas,
    dmeasure=dmeas,
    accumvars="H",
    partrans=parameter_trans(
      log=c("Beta","alpha", "tau","mu_EI","mu_IR"),
      logit=c("rho","eta")
    ),
    paramnames=c("N","Beta","alpha","mu_EI","mu_IR","rho","eta","tau"),
    statenames=c("S","E","I","R","V", "H")
  )

set.seed(531)
params <- c(Beta=13,alpha = 0.05,mu_IR=1.8,mu_EI=0.09,rho=0.8,tau=1,eta=0.1,N=3e8)

covid2 %>%
  simulate(
    params=params,
    nsim=10,format="data.frame",include.data=TRUE
  ) -> sims
sims %>%
  ggplot(aes(x=days,y=reports,group=.id,color=.id=="data"))+
  geom_line()+
  guides(color="none")+labs(title="Simulation for data")

library(doParallel)
library(doRNG)
library(iterators)
registerDoParallel()
registerDoRNG(5312022)

params <- c(Beta=13,alpha = 0.05,mu_IR=1.8,mu_EI=0.09,rho=0.8,tau=1,eta=0.1,N=3e8)

foreach(i=1:10,.combine=c) %dopar% {
  library(pomp)
  library(tidyverse)
  covid2 %>% pfilter(params=params,Np=100)
} -> pf
pf %>% logLik() %>% logmeanexp(se=TRUE) -> L_pf
L_pf

##                          se 
## -3240.0106209     0.6828551

Local Search

library(doParallel)
library(doRNG)
registerDoParallel()
registerDoRNG(5312022)

# set the random walk parameters
covid_cooling.fraction.50 <- 0.5
covid_rw.sd <- rw.sd(
    Beta=0.02, alpha =0.01, rho=0.02,mu_IR=0.01,mu_EI=0.01,eta=ivp(0.002)
)

params <- c(Beta=13,alpha = 0.05,mu_IR=1.8,mu_EI=0.09,rho=0.8,tau=1,eta=0.1,N=3e8)


bake(file="lik_local_2.rds",{
  foreach(i=1:20,.combine=c) %dopar% {
    library(pomp)
    library(tidyverse)
    mif2(covid2,
         params = params,
         Np=2000,
         Nmif=50,
         cooling.fraction.50=covid_cooling.fraction.50,
         rw.sd=covid_rw.sd) 
  } -> mifs_local 
  mifs_local
}) -> mifs_local

coefs_local <- coef(mifs_local)
max_coefs_local <- coefs_local[,which.max(logLik(mifs_local))]

bake(file="local_results_2.rds",{
  foreach(mf=mifs_local, .combine=rbind) %dopar% {
    library(pomp)
    library(tidyverse)
    evals <- replicate(5,logLik(pfilter(mf,Np=2000)))
    ll <- logmeanexp(evals,se=TRUE)
    mf %>% coef() %>% bind_rows() %>%
      bind_cols(loglik=ll[1],loglik.se=ll[2])
  } -> local_results
  
  local_results
}) -> local_results

bind_rows(local_results) %>%
  filter(is.finite(loglik)) %>%
  filter(loglik.se < 10) %>%
  arrange(-loglik) -> best_local_searches
head(best_local_searches,5)

## # A tibble: 5 × 10
##    Beta  alpha mu_IR mu_EI   rho   tau    eta         N loglik loglik.se
##   <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>     <dbl>  <dbl>     <dbl>
## 1 11.7  0.0674 1.75  0.170 0.898     1 0.0992 300000000 -5759.     1.00 
## 2  9.06 0.107  0.855 0.152 0.789     1 0.0989 300000000 -6159.     0.896
## 3  8.41 0.190  0.880 0.199 0.929     1 0.101  300000000 -6454.     9.95 
## 4 13.7  0.0658 2.63  0.215 0.959     1 0.0994 300000000 -7291.     0.164
## 5 10.6  0.0592 1.37  0.274 0.947     1 0.0999 300000000 -7481.     0.138

set.seed(531)
covid2 %>%
  simulate(
    params=unlist(best_local_searches[1,]),
    nsim=10,format="data.frame",include.data=TRUE
  ) -> sims
sims %>%
  ggplot(aes(x=days,y=reports,group=.id,color=.id=="data"))+
  geom_line()+
  guides(color="none")+labs(title="Simulation for data with local parameters searching")

Global Search

covid_box2 <- rbind(
  Beta=c(7,15),
  mu_EI=c(0.07,0.09),
  alpha=c(0,0.2),
  rho=c(0.5,1),
  eta=c(0,0.5),
  N=c(3e8,3e8),
  mu_IR=c(0.7, 1.5),
  tau=c(0,2)
)

bake(file="mifs_global_2.rds",{
    foreach(i=1:20,.combine=c) %dopar% {
        library(pomp)
        library(tidyverse)
        mif2(covid2,
             params = c(apply(covid_box2,1,function(x)runif(1,x[1],x[2]))),
             Np=3000,
             Nmif=250,
             cooling.fraction.50=covid_cooling.fraction.50,
             rw.sd=covid_rw.sd) 
    } -> mifs_global2
    mifs_global2
}) -> mifs_global2

bake(file="global_search_2.rds",{
    foreach(mf=mifs_global2, .combine=rbind) %dopar% {
        library(pomp)
        library(tidyverse)
        evals <- replicate(10,logLik(pfilter(mf,Np=50000)))
        ll <- logmeanexp(evals,se=TRUE)
        mf %>% coef() %>% bind_rows() %>%
            bind_cols(loglik=ll[1],loglik.se=ll[2])
    } -> global_results2
    
    global_results2
}) -> global_results2

The best global search had the following coefficients and log likelihood and simulated results:

bind_rows(global_results2) %>%
  filter(is.finite(loglik)) %>%
  filter(loglik.se < 5 ) %>%
  arrange(-loglik) -> best_global_results2
head(as.data.frame(best_global_results2),5)

##       Beta      mu_EI      alpha       rho        eta     N     mu_IR       tau
## 1 6.478088 0.09201846 0.10899263 0.9887020 0.11280416 3e+08 0.8590916 1.1137165
## 2 6.627273 0.07277982 0.06059614 0.9964879 0.07452394 3e+08 0.4207374 0.7006898
## 3 1.693379 0.75722150 0.06664166 0.9017059 0.43622504 3e+08 1.1279543 1.0546965
## 4 2.146755 1.24823145 0.22935993 0.7937392 0.34884011 3e+08 1.0262129 1.0916003
## 5 2.064472 1.83016286 0.01830383 0.8183197 0.36125617 3e+08 1.0295068 0.9732440
##      loglik    loglik.se
## 1 -2707.706 0.0007600655
## 2 -2709.161 0.0008548933
## 3 -2919.093 2.6375954870
## 4 -6378.334 3.4227820985
## 5 -6605.254 0.3478611999

The best global search has a likelihood of -2707 and 0.00076 se, which is significantly better than the log-likelihood from the local search. The pairs plot and simulation plot for the global search are as below.

bind_rows(global_results2) %>% 
bind_rows(local_results) %>%
  filter(is.finite(loglik)) %>%
  filter(loglik.se < 5)  -> temp
pairs(~loglik+Beta+rho+mu_IR+mu_EI+alpha,data=temp,pch=16, col="red")

covid2 %>%
  simulate(
    params=unlist(best_global_results2[1,]),
    nsim=10,format="data.frame",include.data=TRUE
  ) -> sims

sims %>%
  ggplot(aes(x=days,y=reports,group=.id,color=.id=="data"))+
  geom_line()+
  guides(color="none")+labs(title="Simulation for data with global parameters searching")

Our global and local search return distinct simulation curves, with the local search providing a more accurate estimate. Both simulations appear to not align well with the data in terms of estimating the extent of the peak, whilst the global search attempts to model the timing (day 125) more accurately.

Omicron

We continue to use a SEIRV[4] model to fit the Omicron data, which is defined as the daily cases between the dates 2021-12-01 and 2022-04-01. Situationally, there was little change in the US that could be translated into our parameters, other than a change in the vaccination rate. We here consider people get fully vaccinated can also be infected by the virus.

\(S_t\): the number of susceptible people at time t
\(E_t\): the number of exposed people at time t
\(I_t\): the number of infectious people at time t
\(R_t\): the number of recovered people at time t
\(V_t\): the number of people vaccinated at time t

And the model’s differential equations are given by following expressions:

\(\frac{dS_t}{dt} = -\beta/N S_t I_t -\alpha/N S_t\)

\(\frac{dE_t}{dt} = \beta/N S_t I_t -\mu_{EI} E_t + \sigma \beta/N V_t I_t\)

\(\frac{dI_t}{dt} = \mu_{EI} E_t -\mu_{IR} I_t\)

\(\frac{dV_t}{dt} = \alpha/N S_t -\sigma \beta/N V_t I_t\)

\(\frac{dR_t}{dt} = \mu_{IR} I_t\)

Parameters were initialised using the same principles as with the Delta variant. The reported cases on 2021-12-1 was set as the initial value of H and the reported cases on 2021-12-2 was set as the initial value of E. We set our infected parameter to double the reported cases on the date 2021-12-01. We also observed that the percentage of the vaccinated population stood at 59.35% on 2021-11-30 and 59.45% on 2021-12-1.

seirv2_step <- Csnippet("
  double dN_SE = rbinom(S,1-exp(-Beta*I/N*dt));
  double dN_EI = rbinom(E,1-exp(-mu_EI*dt));
  double dN_IR = rbinom(I,1-exp(-mu_IR*dt));
  double dN_SV = rbinom(S, 1-exp(-alpha/N*dt));
  double dN_VE = rbinom(V, 1-exp(-sigma*Beta*I/N*dt));
  S -= dN_SE + dN_SV;
  E += dN_SE + dN_VE - dN_EI;
  I += dN_EI - dN_IR;
  R += dN_IR;
  V += dN_SV - dN_VE;
  H += dN_IR;
")

seirv2_rinit <- Csnippet("
    S = round(N*eta);
    E = 97550;
    I = 2*80513;
    R = round(N*(1-eta)-(97550+2*80513 + round(N*(0.5945-0.5935)) + 87248)); 
    V = round(N*(0.5945-0.5935));
    H = 87248;
  ")

dmeas <- Csnippet("
double tol = 1.0e-25;
double mean_cases = rho*H;
double sd_cases = sqrt(mean_cases*mean_cases);

if(reports > 0.0){
lik = pnorm(reports,mean_cases,sd_cases,1,0)- pnorm(reports-0.5,mean_cases,sd_cases,1,0)
+ tol;
} else{
lik = pnorm(reports+0.5,mean_cases,sd_cases,1,0) +tol;
}
if (give_log) lik = log(lik);

")
rmeas <- Csnippet("

reports = rnorm(rho*H, sqrt(rho*H ) );
if (reports > 0.0) {
reports = nearbyint(reports);
} else {
reports = 0.0;
}

")

omicron %>% mutate(days = c(1:122), reports=avg_7) -> df3

covid3 = df3 %>%
  pomp(
    times="days",t0=0,
    rprocess=euler(seirv2_step,delta.t=1),
    rinit=seirv2_rinit,
    rmeasure=rmeas,
    dmeasure=dmeas,
    accumvars="H",
    partrans=parameter_trans(
      log=c("Beta","alpha", "sigma","tau","mu_EI","mu_IR"),
      logit=c("rho","eta")
    ),
    paramnames=c("N","Beta","alpha","sigma","mu_EI","mu_IR","rho","eta","tau"),
    statenames=c("S","E","I","R","V", "H")
  )

set.seed(531)
params <- c(Beta=30,alpha =0.05,sigma=0.02,mu_IR=1.5,mu_EI=0.1,rho=0.8,tau=1,eta=0.1,N=3e8)

covid3 %>%
  simulate(
    params=params,
    nsim=10,format="data.frame",include.data=TRUE
  ) -> sims
sims %>%
  ggplot(aes(x=days,y=reports,group=.id,color=.id=="data"))+
  geom_line()+
  guides(color="none")+labs(title="Simulation for data")

library(doParallel)
library(doRNG)
library(iterators)
registerDoParallel()
registerDoRNG(5312022)

params <- c(Beta=30,alpha =0.05,sigma=0.02,mu_IR=1.5,mu_EI=0.1,rho=0.8,tau=1,eta=0.1,N=3e8)

foreach(i=1:10,.combine=c) %dopar% {
  library(pomp)
  library(tidyverse)
  covid3 %>% pfilter(params=params,Np=100)
} -> pf
pf %>% logLik() %>% logmeanexp(se=TRUE) -> L_pf
L_pf

##                          se 
## -3603.1566874     0.1330475

Local Search

library(doParallel)
registerDoParallel()
registerDoRNG(5312022)

# set the random walk parameters
covid_cooling.fraction.50 <- 0.5
covid_rw.sd <- rw.sd(
    Beta=0.002, alpha =0.001,sigma =0.002, rho=0.002,mu_IR=0.01,mu_EI=0.01,eta=ivp(0.002)
)

params <- c(Beta=30,alpha =0.05,sigma=0.02,mu_IR=1.5,mu_EI=0.1,rho=0.8,tau=1,eta=0.1,N=3e8)


bake(file="lik_local_3.rds",{
  foreach(i=1:20,.combine=c) %dopar% {
    library(pomp)
    library(tidyverse)
    mif2(covid3,
         params = params,
         Np=2000,
         Nmif=50,
         cooling.fraction.50=covid_cooling.fraction.50,
         rw.sd=covid_rw.sd) 
  } -> mifs_local 
  mifs_local
}) -> mifs_local

coefs_local <- coef(mifs_local)
max_coefs_local <- coefs_local[,which.max(logLik(mifs_local))]

bake(file="local_results_3.rds",{
  foreach(mf=mifs_local, .combine=rbind) %dopar% {
    library(pomp)
    library(tidyverse)
    evals <- replicate(5,logLik(pfilter(mf,Np=2000)))
    ll <- logmeanexp(evals,se=TRUE)
    mf %>% coef() %>% bind_rows() %>%
      bind_cols(loglik=ll[1],loglik.se=ll[2])
  } -> local_results
  
  local_results
}) -> local_results

bind_rows(local_results) %>%
  filter(is.finite(loglik)) %>%
  filter(loglik.se < 0.5) %>%
  arrange(-loglik) -> best_local_searches
head(best_local_searches,5)

## # A tibble: 5 × 11
##    Beta  alpha  sigma mu_IR  mu_EI   rho   tau    eta         N loglik loglik.se
##   <dbl>  <dbl>  <dbl> <dbl>  <dbl> <dbl> <dbl>  <dbl>     <dbl>  <dbl>     <dbl>
## 1  24.6 0.0507 0.0166  1.23 0.0693 0.815     1 0.0975 300000000 -1645.  0.00127 
## 2  23.5 0.0460 0.0234  1.27 0.0723 0.817     1 0.0979 300000000 -1645.  0.00106 
## 3  24.5 0.0530 0.0163  1.11 0.0668 0.827     1 0.0964 300000000 -1645.  0.00161 
## 4  24.5 0.0517 0.0199  1.23 0.0691 0.800     1 0.0957 300000000 -1645.  0.00148 
## 5  24.8 0.0465 0.0224  1.16 0.0667 0.801     1 0.0983 300000000 -1645.  0.000788

covid3 %>%
  simulate(
    params=unlist(best_local_searches[1,]),
    nsim=10,format="data.frame",include.data=TRUE
  ) -> sims
sims %>%
  ggplot(aes(x=days,y=reports,group=.id,color=.id=="data"))+
  geom_line()+
  guides(color="none")+labs(title="Simulation for data with local parameters searching")

Global Search

covid_box <- rbind(
  Beta=c(25,35),
  mu_EI=c(0,0.5),
  alpha=c(0,0.1),
  rho=c(0,1),
  eta=c(0,0.5),
  N=c(3e8,3e8),
  mu_IR=c(0, 5),
  tau=c(0,2),
  sigma=c(0,0.5)
)
#Beta=30,alpha =0.05,sigma=0.02,mu_IR=1.5,mu_EI=0.1,rho=0.8,tau=1,eta=0.1,N=3e8

bake(file="mifs_global3.rds",{
    foreach(i=1:20,.combine=c) %dopar% {
        library(pomp)
        library(tidyverse)
        mif2(covid3,
             params = c(apply(covid_box,1,function(x)runif(1,x[1],x[2]))),
             Np=20000,
             Nmif=250,
             cooling.fraction.50=covid_cooling.fraction.50,
             rw.sd=covid_rw.sd) 
    } -> mifs_global 
    mifs_global
}) -> mifs_global

bake(file="global_search3.rds",{
    foreach(mf=mifs_global, .combine=rbind) %dopar% {
        library(pomp)
        library(tidyverse)
        evals <- replicate(10,logLik(pfilter(mf,Np=50000)))
        ll <- logmeanexp(evals,se=TRUE)
        mf %>% coef() %>% bind_rows() %>%
            bind_cols(loglik=ll[1],loglik.se=ll[2])
    } -> global_results
    
    global_results
}) -> global_results3

bind_rows(global_results3) %>%
  filter(is.finite(loglik)) %>%
  filter(loglik.se < .5 ) %>%
  arrange(-loglik) -> best_global_results3
head(as.data.frame(best_global_results3),5)

##       Beta      mu_EI      alpha       rho        eta     N     mu_IR
## 1 14.49075 0.15716263 0.05318494 0.9235025 0.10309679 3e+08 2.0406757
## 2 22.96002 0.06955564 0.06941834 0.7861368 0.10874120 3e+08 1.2238263
## 3 33.97512 0.06085419 0.05191722 0.8679845 0.06227463 3e+08 0.9314555
## 4 25.25646 0.06298270 0.03346174 0.7359994 0.11114879 3e+08 1.1917544
## 5 23.59684 0.05886329 0.01069119 0.6823240 0.13707199 3e+08 1.0721675
##          tau      sigma    loglik    loglik.se
## 1 0.81494483 0.27376243 -1640.246 0.0003151663
## 2 0.62841287 0.30899137 -1648.572 0.0002551736
## 3 0.05934004 0.61653863 -1648.919 0.0003069671
## 4 0.77488153 0.32587294 -1649.572 0.0002221989
## 5 1.40495745 0.00282577 -1657.235 0.0001244582

bind_rows(global_results3) %>% 
bind_rows(local_results) %>%
  filter(is.finite(loglik)) %>%
  filter(loglik.se < .5)-> temp
pairs(~loglik+Beta+rho+mu_IR+mu_EI+alpha,data=temp,pch=16, col="red")

set.seed(531)
covid3 %>%
  simulate(
    params=unlist(best_global_results3[1,]),
    nsim=10,format="data.frame",include.data=TRUE
  ) -> sims
sims %>%
  ggplot(aes(x=days,y=reports,group=.id,color=.id=="data"))+
  geom_line()+
  guides(color="none")+labs(title="Simulation for data with global parameters searching")

Conclusion

Our most accurate simulations were for the SEIR model on our pre-Delta variant data. We tried a variety of parameters and iterations, but our simulations on our Delta and Omicron data did not return accurate fits. There are a few reasons as to why we might expect this to occur.

Our pre-Delta data is, relatively speaking, an easier dataset to model. We don’t have to incorporate any vaccination data since we are in the early stages of the disease spread and with this, we don’t need to worry about the rate of change of vaccination rates. The government response in the US was minimal other than encouraging people to stay at home and limit contact, but this proved to be useless given the contagiousness of the disease.
The US as a whole is a difficult country to model. State government mandates essentially ensured that each state operated as its own “mini-country” and attempting to model these as one, single entity might not make the most logical sense. The laws implemented to counteract COVID in states such as New York for example, were vastly different and far stricter than those implemented in Texas.
Vaccination rates aren’t uniform. At the beginning of the vaccine rollout the US did run into the issue of supply restrictions across the country. Differing attitudes towards the vaccine in different states also contributes towards varying rates across the country.

References

[1] https://ourworldindata.org/coronavirus

[2] https://ionides.github.io/531w21/final_project/project02/blinded.html

[3] https://ionides.github.io/531w21/final_project/project13/blinded.html

[4] https://ionides.github.io/531w21/final_project/project03/blinded.html

Scholarship Acknowledgement

The most analysis methods on COVID-19 dataset in US with POMP model are referenced from previous final projects in 2021, where we combined the advantages from different projects and also add our novel contributions based on our own data. As mentioned in above text, we inspired by [2] to use dataset from reported infectious number in US but with more time size (from 2020-1-23 to 2022-04-01), and use more smoothed data “7 days averagely report” to do the analysis, which may mitigate the backward of noise. After the EDA and ARMA modeling, we separate the date rather than simply the time but more according to different variants domain period to estimate different parameters for different models. For SEIRV model, we refer to [4] but separate Delta domain and Omicron domain by whether the fully vaccinated people can be infected. For local and global search’s format, we refer to [3].

ARMA and POMP Analysis on COVID-19 Variants in the US

19/04/2022

Introduction

Explore Data Analysis

ARMA Modeling

POMP Modelling

Pre-Delta Domain

Local Search

Global search

Delta

Local Search

Global Search

Omicron

Local Search

Global Search

Conclusion

References

Scholarship Acknowledgement