1 Introduction

What started as a few reported cases back in December 2019, the Coronavirus disease 2019 (COVID-19) rapidly turned into a pandemic that affected millions of people all across the world. Since COVID-19 was first declared a pandemic in March 2020, many countries have been implementing their own safety guidelines and regulations in hopes of decreasing the number of positive cases. As of 2022, there have been several variants of the Coronavirus disease discovered (i.e. Delta, Omicron), each with different transmission rates and reported symptoms. Although there has been a considerable amount of improvement in preventing the spread of COVID-19 through the development of vaccines and implementation of health guidelines since the start of the virus, COVID-19 remains a widely researched, yet difficult topic given its uncertainty and variability.

In our project, we focus on the COVID-19 cases in New York, which has one of the highest number of covid cases in the United States. More specifically, we focus on the number of COVID-19 cases in New York City, which has a population size of roughly 18 million[1], the highest number in the state of New York. Our objective is to gain understanding of the trend and spread of COVID-19 cases in a highly affected area. In our analysis, we fit the Susceptible-Infectious-Recovered (SIR) compartmental model, as well as other variations, which includes the Susceptible-Exposed-Infection-Recovered (SEIR) model and the Susceptible-Exposed-Infection-Quarantined-Recovered (SEIQR) models. Through the comparison of their log-likelihoods, we see which components of the model are most significant in explaining the behavior of COVID-19 in New York City.

2 Data

The dataset we used comes from the website of New York government[2], which we filtered to only the data from New York City. The data contains the number of positive cases in New York city and ranges from March 1, 2020 to April 14, 2022.

3 Exploratory Analysis

From the time series plot of COVID-19 cases in New York City, we see primarily three regions with peaks in the number of positive cases. We see a small peak around April 2020, soon after COVID-19 was declared a pandemic, and a small peak around January 2021, right before vaccines became readily available for everyone. The largest peak occured around January 2022, and this could have been due to several different factors, such as the presence of new variants, surge in the number of tourists and amount of travel, less strict regulations, and more.

For our project, we found it difficult to model and account for several peaks in the data. Therefore, we focus on only modeling the Omicron variant, which is present in the highest peak in the data from December 4, 2021 to February 1, 2022.

4 SIR Model

We first consider the SIR model, as it is a basic model and we want to see whether it can do a good job in fitting the data. The SIR model is consist of three stages:

S: susceptible (all individuals)

I: infected (symptomatic)

R: removed (recovered or deceased)

And the model is given by following expression:

\[\begin{eqnarray} \frac{dS(t)}{dt} &=& \frac{-\beta I(t) S(t)}{N}, \\ \frac{dI(t)}{dt} &=& \frac{\beta I(t) S(t)}{N} - \gamma I(t),\\ \frac{dR(t)}{dt} &=& \gamma I(t),\\ S(t) + &I(t)& + R(t) = N, \ \ \forall t \end{eqnarray}\]

The S(t) represent the susceptible population at time t, I(t) represents the infected population at time t, R(t) represents the recovered population at time t, and N represents the total population in this area. The transmission rate is \(\beta\), the recovery rate is \(\mu_{IR}\).

Also, the initial value of I will be the active population on the first day, due to the lack of relevant data. Considering that Omicron is highly infectious, but have milder symptoms and therefore may not be reported, we assume it to be 5000. This active population represents the confirmed cases that have not been resolved, which also means infectious population.The population of New York City, or N, is 1886700[3]. For the initial value of S, as we do not know the true value, we will use the fraction of total population, which is represented by \(\eta\), to predict the S(0). There is an accumulator variable H, which is not shown in the expression above, that is used to tally only the daily confirmed cases, such that \(\text{reports}_t \sim \text{Binomial}(H(t), \rho)\), where \(\rho\) is the reporting rate, and it will be reset to zero at the beginning of each day.[5]

Because the population of New York city changes relatively slowly[3], we will not consider the death rate and how the population of recovered individuals will be affected in this project.

4.1 Initial SIR Model

After many attempts on different parameters, we choose \(\beta=0.48\), \(\mu_{IR}=0.1\), \(\rho=0.74\), \(\eta=0.95\) as our initial parameters. We found that a larger value of \(\eta\) fits the data better, which may be because most of the population had no resistance to Omicron in its early stage[4].

sir_step <- Csnippet("
double dN_SI = rbinom(S,1-exp(-Beta*I/N*dt)); 
double dN_IR = rbinom(I,1-exp(-mu_IR*dt));
S -= dN_SI;
I += dN_SI - dN_IR;
R += dN_IR;
H += dN_IR;
")

sir_init <- Csnippet("
S = nearbyint(eta*N);
I = 5000;
R = nearbyint((1-eta)*N);
H = 0;
")

dmeas <- Csnippet("
      double ll = dbinom(pos,H,rho,give_log);
      lik =  (!isfinite(ll) ? -1000 : ll );
")

rmeas <- Csnippet("
pos = rbinom(H,rho);
")

# SIR Model
covidSIR = df %>%
  pomp(
    times="time",t0=0,
    rprocess=euler(sir_step,delta.t=1),
    rinit=sir_init,
    rmeasure=rmeas,
    dmeasure=dmeas,
    accumvars="H",
    partrans=parameter_trans(
      log=c("Beta","mu_IR"),
      logit=c("rho","eta")
    ),
    statenames=c("S","I","R","H"),
    paramnames=c("Beta","mu_IR","eta","rho","N")
  )

set.seed(20220419)

# Population size
pop = 1886700 

# Initial guesses of the parameters for SIR model
sir_params=c(Beta=0.48,mu_IR=0.27,rho=0.74,eta=0.95,N=pop)

# Simulation
covidSIR %>%
  simulate(params=sir_params, nsim=20,format="data.frame", include.data=TRUE) %>%
  ggplot(aes(x=time,y=pos,group=.id,color=.id=="data"))+
  geom_line()+
  xlab("Time") + 
  ylab("Number of Positive Cases")+
  scale_color_hue("",breaks=c("FALSE","TRUE"),labels=c("estimated","observed"))

The likelihood estimate of the initial guess is -76015.4391 with a Monte Carlo standard error of 201.8824.

registerDoRNG(123294940)

# Calculate the likelihood of the initial guess
foreach(i=1:10,.combine=c) %dopar% {
  library(pomp)
  covidSIR %>% pfilter(params=sir_params,Np=5000)
} -> sir_pf

sir_pf %>% logLik() %>% logmeanexp(se=TRUE) -> sir_L_pf

sir_pf[[1]] %>% 
  coef() %>% 
  bind_rows() %>% 
  bind_cols(loglik=sir_L_pf[1],loglik.se=sir_L_pf[2]) %>%
  write_csv("sir_lik.csv")

sir_L_pf

                   se 
-74595.752   2961.375

The simulations based on the starting values are able to capture general trend of the data, but they seem a little too smooth and do not model the peaks well. Next, we will use iterative filtering to search for the maximum likelihood estimates (MLE).

4.2 SIR Local Search

We first run a local search that starts from our initial guess using iterated filtering with 50 iterations. From the trace plot we can see that the likelihood is increasing for some of the runs, while others are stuck in the local maxima and have trouble climbing up the likelihood surface.

registerDoRNG(20220419)

bake(file="local_search_sir.rds",{
  foreach(i=1:20,.combine=c) %dopar% { # 20 calculation in total
    library(pomp) # make sure the library is loaded
    library(tidyverse)
    covidSIR %>%
      mif2( #iterated filtering in pomp
        params=sir_params,
        Np=2000, Nmif=50, # use 2000 particles in filter, perform 50 iterations
        cooling.fraction.50=0.5,
        rw.sd=rw.sd(Beta=0.02,rho=0.02, eta=ivp(0.02), mu_IR=0.02) # random walk standard deviation; ivp: indicate value parameter
      )
  } -> sir_local
  sir_local
}) -> sir_local

sir_local %>%
  traces() %>%
  reshape2::melt() %>%
  ggplot(aes(x=iteration,y=value,group=L1,color=factor(L1)))+
  geom_line()+
  facet_wrap(~variable,scales="free_y")

4.2.1 Results

The search result table displays six groups of estimates with the maximum likelihood values. The estimated parameters and likelihood are both stable, and the maximum likelihood estimate (MLE) obtained a log likelihood of -50126.19 and a standard error of 9.831, which is an improvement from our initial guess.

4.2.1.1 Likelihood Surface

bake(file="local_search_lik_sir.rds",{
  foreach(mf=sir_local,.combine=rbind) %dopar% {
    library(pomp)
    library(tidyverse)
    evals <- replicate(10, logLik(pfilter(mf,Np=20000))) # for each 20 parameter estimates, run 10 pfilter with 20,000 particles
    ll <- logmeanexp(evals,se=TRUE)
    mf %>% coef() %>% bind_rows() %>%
      bind_cols(loglik=ll[1],loglik.se=ll[2])
  } %>% filter(is.finite(loglik))  -> sir_lik_local
  sir_lik_local
}) -> sir_lik_local

read_csv("sir_lik.csv") %>%
  bind_rows(sir_lik_local) %>%
  arrange(-loglik) %>%
  filter(is.finite(loglik)) %>%
  write_csv("sir_lik.csv")

pairs(~loglik+Beta+mu_IR+eta+rho,data=sir_lik_local,pch=16)

4.2.1.2 Log-Likelihood

sir_lik_local %>% arrange(-loglik) %>% head %>% 
  knitr::kable(digits = 3, caption = "SIR Local Search Results")

SIR Local Search Results
Beta	mu_IR	rho	eta	N	loglik	loglik.se
3.558	0.006	0.528	0.943	1886700	-50126.19	9.831
3.164	0.006	0.561	0.948	1886700	-50402.28	17.095
3.178	0.006	0.562	0.950	1886700	-50424.36	5.478
3.720	0.006	0.550	0.952	1886700	-50430.15	68.424
4.140	0.006	0.542	0.946	1886700	-50552.04	73.219
2.784	0.006	0.576	0.959	1886700	-50634.28	65.415

4.3 SIR Global Search

Next, we will perform a global search from multiple starting points, using multiple stages of iterated filtering process with longer iterations and a gradually decreasing magnitude of perturbations. We randomly draw 100 sets of starting values from a multivariate uniform distribution where \(\beta\in[1,10]\), \(\mu_{IR}\in[0,7]\), \(\rho\in[0,0.4]\), \(\eta\in[0.4,0.6]\). The final MLE results are shown in the table below.

4.3.1 Results

From the simulations, we see that the fitting result we get is not as good as the initial value, therefore suggesting the need for further exploration. Perhaps adding another component in the model will improve the fit.

4.3.1.1 Log-Likelihood

set.seed(2062379496)

runif_design(
  lower=c(Beta=1,mu_IR=0, rho=0,eta=0.4),
  upper=c(Beta=10,mu_IR=7, rho=0.4,eta=0.6),
  nseq=100
) -> guesses

mf1 <- sir_local[[1]]

fixed_params <- c(N=1886700)

bake(file="global_search_sir.rds",{
  registerDoRNG(1270401374)
  foreach(guess=iter(guesses,"row"), .combine=rbind) %do% {
    mf1 %>%
      mif2(params=c(unlist(guess),fixed_params)) %>%
      mif2(Nmif=50) -> mf
    replicate(
      10,
      mf %>% pfilter(Np=20000) %>% logLik()
    ) %>%
      logmeanexp(se=TRUE) -> ll
    mf %>% coef() %>% bind_rows() %>%
      bind_cols(loglik=ll[1],loglik.se=ll[2])
  } -> results
  results
}) %>%
  filter(is.finite(loglik)) -> results

results %>% arrange(-loglik) %>% head %>% 
  knitr::kable(digits = 3, caption = "SIR Global Search Results")

SIR Global Search Results
Beta	mu_IR	rho	eta	N	loglik	loglik.se
2.120	2.787	0.330	0.578	1886700	-56543.52	20.710
3.763	22.616	0.301	0.352	1886700	-57164.20	11077.745
3.803	5.536	0.261	0.366	1886700	-60414.24	14131.677
4.208	21.184	0.289	0.324	1886700	-60660.95	10.370
2.812	13.152	0.306	0.475	1886700	-61194.44	285.154
2.763	20.329	0.293	0.489	1886700	-62105.64	122.515

4.3.1.2 Simulations with the MLE

# MLE result of the parameters for SIR model
opt_sir_params=results %>% arrange(-loglik) %>% slice(1) %>% select(-starts_with("loglik")) %>% unlist()

# Simulation
covidSIR %>% 
  simulate(params=opt_sir_params, nsim=20,format="data.frame", include.data=TRUE) %>%
  ggplot(aes(x=time,y=pos,group=.id,color=.id=="data"))+
  geom_line()+
  xlab("Time") + 
  ylab("Number of Positive Cases")+
  scale_color_hue("",breaks=c("FALSE","TRUE"),labels=c("estimated","observed"))

5 SEIR Model

Compared to the SIR model, the SEIR model adds a stage “E” which means infected individuals must pass a period of latency before becoming infectious[6][7]. In practice, SEIR model is more adaptable than SIR model.

The basic model assumptions and settings are consistent with the SIR model. In addition, we set the initial value of E as 10000[8].

5.1 Initial SEIR Model

We set \(\beta=2.1\), \(\mu_{EI}=0.088\), \(\mu_{IR}=0.25\), \(\rho=0.8\), \(\eta=0.87\) as our initial values for the model. However, we see from the simulations that the curve is still too smooth to fully capture the peaks of the data.

seir_step <- Csnippet("
  double dN_SE = rbinom(S,1-exp(-Beta*I/N*dt));
  double dN_EI = rbinom(E,1-exp(-mu_EI*dt));
  double dN_IR = rbinom(I,1-exp(-mu_IR*dt));

  S -= dN_SE;
  E += dN_SE - dN_EI;
  I += dN_EI - dN_IR;
  R += dN_IR;
  H += dN_IR;
")

seir_init <- Csnippet("
  S = nearbyint(eta*N);
  E = 10000;
  I = 7000;
  R = nearbyint((1-eta)*N);
  H = 0;
")

dmeas <- Csnippet("
  double ll = dbinom(pos,H,rho,give_log);
  lik =  (!isfinite(ll) ? -1000 : ll );
")

rmeas <- Csnippet("
  pos = rbinom(H,rho);
")

# SEIR Model
covidSEIR = df %>%
  pomp(
    times="time",t0=0,
    rprocess=euler(seir_step,delta.t=7),
    rinit=seir_init,
    rmeasure=rmeas,
    dmeasure=dmeas,
    accumvars="H",
    partrans=parameter_trans(
      log=c("Beta","mu_EI", "mu_IR"),
      logit=c("rho","eta")
    ),
    statenames=c("S","E","I","R","H"),
    paramnames=c("Beta","mu_EI","mu_IR","eta","rho","N")
  )

set.seed(5312021)

# Initial guesses of the parameters for SEIR model
seir_params=c(Beta=2.1,mu_EI=0.088,mu_IR=0.25,rho=0.8,eta=0.87,N=pop)

# Simulation
covidSEIR %>%
  simulate(params=seir_params, nsim=20,format="data.frame", include.data=TRUE) %>%
  ggplot(aes(x=time,y=pos,group=.id,color=.id=="data"))+
  geom_line()+
  xlab("Time") + 
  ylab("Number of Positive Cases")+
  scale_color_hue("",breaks=c("FALSE","TRUE"),labels=c("estimated","observed"))

The likelihood estimate of the initial guess is -164056.1595 with a Monte Carlo standard error of 231.5666. Next, we will use iterative filtering to search for the MLE.

registerDoRNG(123294940)

# Calculate the likelihood of the initial guess
foreach(i=1:10,.combine=c) %dopar% {
  covidSEIR %>% pfilter(params=seir_params,Np=5000)
} -> seir_pf

seir_pf %>% logLik() %>% logmeanexp(se=TRUE) -> seir_L_pf

seir_pf[[1]] %>% 
  coef() %>% 
  bind_rows() %>% 
  bind_cols(loglik=seir_L_pf[1],loglik.se=seir_L_pf[2]) %>%
  write_csv("seir_lik.csv")

seir_L_pf

                       se 
-164056.1595     231.5666

5.2 SEIR Local Search

We conduct a local search that starts from our initial guess using iterated filtering with 20 iterations. From the trace plot we can see that the likelihood is increasing for most of the runs, \(\mu_{IR}\) decreases and converges, but other parameters cannot be determined.

registerDoRNG(482947940)

bake(file="local_search_seir.rds",{
  foreach(i=1:20,.combine=c) %dopar% {
    covidSEIR %>%
      mif2(
        params = seir_params,
        Np=5000, Nmif=20,
        cooling.fraction.50=0.5,
        rw.sd=rw.sd(Beta=0.02, rho=0.02, mu_EI=0.02, mu_IR=0.02, eta=ivp(0.02)),
        partrans=parameter_trans(log=c("Beta", "mu_EI"),logit=c("rho","eta")),
        paramnames=c("Beta","mu_EI", "rho","eta")
      )
  } -> seir_local
  seir_local
}) -> seir_local


seir_local %>%
  traces() %>%
  reshape2::melt() %>%
  ggplot(aes(x=iteration,y=value,group=L1,color=factor(L1)))+
  geom_line()+
  guides(color=FALSE)+
  facet_wrap(~variable,scales="free_y")

5.2.1 Results

The maximum likelihood estimate is -85890.82, which is higher than that of our initial guess. However, it is smaller than the MLE of the SIR model.

5.2.1.1 Likelihood Surface

registerDoRNG(900242057)

bake(file="local_search_lik_seir.rds",{
  foreach(mf=seir_local,.combine=rbind) %do% {
    evals <- replicate(10, logLik(pfilter(mf,Np=20000)))
    ll <- logmeanexp(evals,se=TRUE)
    mf %>% coef() %>% bind_rows() %>%
      bind_cols(loglik=ll[1],loglik.se=ll[2])
  } %>% filter(is.finite(loglik))  -> seir_lik_local
  seir_lik_local
}) -> seir_lik_local

pairs(~loglik+Beta+mu_IR+eta+rho,data=sir_lik_local,pch=16)

5.2.1.2 Log-Likelihood

seir_lik_local %>% arrange(-loglik) %>% head %>% 
  knitr::kable(digits = 3, caption = "SEIR Local Search Results")

SEIR Local Search Results
Beta	mu_EI	mu_IR	rho	eta	N	loglik	loglik.se
4.069	0.215	0.017	0.743	0.861	1886700	-85890.82	244.144
4.136	0.299	0.014	0.768	0.858	1886700	-86506.55	262.503
5.420	0.204	0.018	0.732	0.879	1886700	-87686.85	59.507
2.564	0.374	0.018	0.730	0.871	1886700	-89448.59	31.457
6.958	0.148	0.011	0.814	0.860	1886700	-90701.56	32.706
5.065	0.238	0.018	0.728	0.876	1886700	-91102.38	161.419

5.3 Global Search

Again, we will perform a global search from multiple starting points, using multiple stages of iterated filtering process with longer iterations and gradually decreasing the magnitude of perturbations[9]. We randomly draw 100 sets of starting values from a multivariate uniform distribution where \(\beta\in[2,10]\), \(\mu_{EI}\in[0.1,0.35]\), \(\mu_{IR}\in[0,0.5]\), \(\rho\in[0.65,0.85]\), \(\eta\in[0.85,0.89]\). The final MLE results are shown in the table below.

5.3.1 Results

The likelihood estimate of global search is -85130.68, which is a little higher than that of the local search. However, it is still less than the MLE of the SIR model global search. It semes that the fitting of the simulations using the global estimates is also not as good as that of the simulations using our initial values.

5.3.1.1 Log-Likelihood

set.seed(2062379496)

runif_design(
  lower=c(Beta=2,mu_EI=0.1,mu_IR=0,rho=0.65,eta=0.85),
  upper=c(Beta=10,mu_EI=0.35,mu_IR=0.5,rho=0.85,eta=0.89),
  nseq=100
) -> guesses

mf1 <- seir_local[[1]]

fixed_params <- c(N=1886700)

bake(file="global_search_seir.rds",{
  registerDoRNG(1270401374)
  foreach(guess=iter(guesses,"row"), .combine=rbind) %do% {
    mf1 %>%
      mif2(params=c(unlist(guess),fixed_params)) %>%
      mif2(Nmif=20) -> mf
    replicate(
      10,
      mf %>% pfilter(Np=20000) %>% logLik()
    ) %>%
      logmeanexp(se=TRUE) -> ll
    mf %>% coef() %>% bind_rows() %>%
      bind_cols(loglik=ll[1],loglik.se=ll[2])
  } -> results
  results
}) %>%
  filter(is.finite(loglik)) -> results

results %>% arrange(-loglik) %>% head %>% 
  knitr::kable(digits = 3, caption = "SEIR Global Search Results")

SEIR Global Search Results
Beta	mu_EI	mu_IR	rho	eta	N	loglik	loglik.se
44.202	0.109	0.009	0.810	0.887	1886700	-85130.68	75.452
70.967	0.128	0.012	0.779	0.883	1886700	-87906.52	50.333
9.647	0.127	0.011	0.784	0.878	1886700	-88371.65	13.025
12.762	0.077	0.010	0.794	0.900	1886700	-89678.02	26.780
2.460	0.273	0.014	0.722	0.905	1886700	-89884.56	70.212
2.582	0.403	0.018	0.719	0.903	1886700	-90418.21	38.332

5.3.1.2 Simulations with the MLE

# MLE result of the parameters for SEIR model
opt_seir_params=results %>% arrange(-loglik) %>% slice(1) %>% select(-starts_with("loglik")) %>% unlist()

# Simulation
covidSEIR %>%
  simulate(params=opt_seir_params, nsim=20,format="data.frame", include.data=TRUE) %>%
  ggplot(aes(x=time,y=pos,group=.id,color=.id=="data"))+
  geom_line()+
  xlab("Time") + 
  ylab("Number of Positive Cases")+
  scale_color_hue("",breaks=c("FALSE","TRUE"),labels=c("estimated","observed"))

6 SEIQR Model

The Susceptible-Exposed-Infectious-Quarantine-Removed and/or Recovered (SEIQR) model builds off of the SEIR model by including a “Q” term. Considering that isolation is mandatory when exposed to or infected with COVID-19 during the epidemic, we added this factor to the model to explore its effect. SEIQR is defined in a similar as the above models, with slight adjustments to the “E” stage, which now refers to the state where individuals are infected but not infectious, and the “Q” stage, which refers to the quarantine state[10].

While the isolation/quarantine policy in New York is enforced, we do not expect everyone to abide by this rule. Considering this fact, we set the initial value of Q to be 200.

6.1 Initial SEIQR Model

We set our initial values to be \(\beta=2\), \(\mu_I=0.07\), \(\mu_{R1}=0.08\), \(\rho=0.4\), \(\mu_{R2}=0.1\), \(\eta=0.05\), \(N=1886700\). From the overall trend, the model roughly describes the actual curve, and there is no excessive smoothness. However, it is evident that there is still room for improvement in the selecting the right parameters.

seiqr_step <- Csnippet("
  double t1 = rbinom(S,1-exp(-Beta*I*dt));
  double t2 = rbinom(E,1-exp(-dt*mu_I));
  double t3 = rbinom(I,1-exp(-dt*mu_R1));
  double t4 = rbinom(Q,1-exp(-dt*mu_R2));
  S -= t1;
  E += t1 - t2;
  I += t2 - t3;
  Q += t3 - t4;
  R += t4;
"
)

seiqr_dmea <- Csnippet("
  double ll = dnorm(pos,Q,rho*Q+1e-10,give_log);
  lik = (!isfinite(ll) ? -1000 : ll );
"
)

seiqr_rmea <- Csnippet("
  pos= rnorm(Q,rho*Q+1e-10);
"
)

seiqr_rinit <- Csnippet("
  S=nearbyint(eta*N);
  E=10000;
  I=7000;
  Q=200;
  R=nearbyint((1-eta)*N);
"
)

covid_statenames <- c("S","E","I","Q","R")
covid_paramnames <- c("Beta","rho","mu_I","mu_R1","mu_R2","eta","N")
fixed <- c(N=1886700)
covidSEIQR <- pomp(
  data=df,
  times="time",
  t0=0,
  rprocess=euler(
    step.fun=seiqr_step,
    delta.t=1),
  rmeasure=seiqr_rmea,
  dmeasure=seiqr_dmea,
  partrans=parameter_trans(
    log=c("Beta","mu_I","mu_R1","mu_R2","eta","rho")),
  statenames=covid_statenames,
  paramnames=covid_paramnames,
  rinit=seiqr_rinit
)

seiqr_params = c(Beta=2,mu_I=0.07,mu_R1=0.08,rho=0.4,mu_R2=0.1,eta=0.05,N=1886700)
sims <- simulate(covidSEIQR,params=seiqr_params,
                 nsim=5,format="data.frame",include=TRUE,seed = 20220419)

ggplot(sims,mapping=aes(x=time,y=pos,group=.id,color=.id=="data"))+
  geom_line()+scale_color_hue("",breaks=c("FALSE","TRUE"),labels=c("estimated","observed"))+labs(x="Time",y="Number of Positive Cases")

The likelihood estimate of the initial guess is -6.088247e+02 with a Monte Carlo standard error of 5.111665e-03, which is much better than the estimates of the two previous models. Next, we will use iterative filtering to search for the MLE.

registerDoRNG(123294940)

# Calculate the likelihood of the initial guess
foreach(i=1:10,.combine=c) %dopar% {
  covidSEIQR %>% pfilter(params=seiqr_params,Np=5000)
} -> seiqr_pf

seiqr_pf %>% logLik() %>% logmeanexp(se=TRUE) -> seiqr_L_pf

seiqr_pf[[1]] %>% 
  coef() %>% 
  bind_rows() %>% 
  bind_cols(loglik=seiqr_L_pf[1],loglik.se=seiqr_L_pf[2]) %>%
  write_csv("seiqr_lik.csv")

print(seiqr_L_pf)

                         se 
-6.088247e+02  5.111665e-03

6.2 SEIQR Local Search

The plot of the log likelihood seems to fluctuate around a mean value, with no apparent convergence. Other parameters also fluctuate to a certain extent, with some showing more of a convergence than others. These results may be due to an inappropriate model specification, given that the log likelihood does not seem to converge well[11].

registerDoRNG(482947940)

bake(file="local_search_seiqr.rds",{
  foreach(i=1:20,.combine=c) %dopar% {
    covidSEIQR %>%
      mif2(
        params = seiqr_params,
        Np=5000, Nmif=20,
        cooling.fraction.50=0.5,
        rw.sd=rw.sd(Beta=0.02, rho=0.02, mu_I=0.02, mu_R1=0.02, mu_R2=0.02, eta=ivp(0.02)),
        partrans=parameter_trans(log=c("Beta", "mu_I", "mu_R1", "mu_R2"),logit=c("rho","eta")),
        paramnames=c("Beta","mu_I", "mu_R1", "mu_R2", "rho","eta")
      )
  } -> seiqr_local
  seiqr_local
}) -> seiqr_local

seiqr_local %>%
  traces() %>%
  melt() %>%
  ggplot(aes(x=iteration,y=value,group=L1,color=factor(L1)))+
  geom_line()+
  xlab("Time") + 
  ylab("Number of Positive Cases")+
  guides(color=FALSE)+
  facet_wrap(~variable,scales="free_y")

6.2.1 Results

The likelihood estimate is -602.140, higher than that of the initial guess and much higher than that of the previous two models. We also see that the standard error is very small, with a value of 0.001.

6.2.1.1 Likelihood Surface

registerDoRNG(900242057)
bake(file="local_search_lik_seiqr.rds",{
  foreach(mf=seiqr_local,.combine=rbind) %do% {
    evals <- replicate(10, logLik(pfilter(mf,Np=20000)))
    ll <- logmeanexp(evals,se=TRUE)
    mf %>% coef() %>% bind_rows() %>%
      bind_cols(loglik=ll[1],loglik.se=ll[2])
  } %>% filter(is.finite(loglik))  -> seiqr_lik_local
  seiqr_lik_local
}) -> seiqr_lik_local

read_csv("seiqr_lik.csv") %>%
  bind_rows(seiqr_lik_local) %>%
  arrange(-loglik) %>%
  filter(is.finite(loglik)) %>%
  write_csv("seiqr_lik.csv")

pairs(~loglik+Beta+mu_I+mu_R1+mu_R2+eta+rho,data=seiqr_lik_local,pch=16)

6.2.1.2 Log-Likelihood

seiqr_lik_local %>% arrange(-loglik) %>% head %>% 
  knitr::kable(digits = 3, caption = "SEIQR Local Search Results")

SEIQR Local Search Results
Beta	mu_I	mu_R1	rho	mu_R2	eta	N	loglik	loglik.se
2.898	0.077	0.095	0.574	0.108	0.048	1886700	-602.140	0.001
1.206	0.070	0.103	0.532	0.106	0.050	1886700	-602.188	0.001
1.181	0.076	0.095	0.530	0.104	0.047	1886700	-602.290	0.001
1.756	0.079	0.093	0.546	0.100	0.043	1886700	-602.327	0.002
3.503	0.079	0.089	0.536	0.102	0.045	1886700	-602.445	0.001
3.249	0.062	0.107	0.536	0.098	0.045	1886700	-602.746	0.001

6.3 SEIQR Global Search

We now perform a global search from multiple starting points using iterative filtering. We randomly draw 100 sets of starting values from a multivariate uniform distribution where \(\beta\in[1,5]\), \(\mu_I\in[0.05,0.09]\), \(\mu_{R1}\in[0.08,0.12]\), \(\mu_{R2}\in[0.09,0.12]\), \(\rho\in[0.4,0.6]\), \(\eta\in[0.04,0.0525]\). The final MLE results are shown in the table below.

6.3.1 Results

The results obtained by global search and local search are similar. There is little difference between the simulations using the parameters selected through global search and the simulations using our initial values. In general, they are able to detect the trend, but evidently they are still different from the actual data.

6.3.1.1 Log-Likelihood

set.seed(2062379496)

runif_design(
  lower=c(Beta=1,mu_I=0.05,mu_R1=0.08,mu_R2=0.09,rho=0.4,eta=0.04),
  upper=c(Beta=5,mu_I=0.09,mu_R1=0.12,mu_R2=0.12,rho=0.6,eta=0.0525),
  nseq=100
) -> guesses

mf1 <- seiqr_local[[1]]

fixed_params <- c(N=1886700)

bake(file="global_search_seiqr.rds",{
  registerDoRNG(1270401374)
  foreach(guess=iter(guesses,"row"), .combine=rbind) %do% {
    mf1 %>%
      mif2(params=c(unlist(guess),fixed_params)) %>%
      mif2(Nmif=20) -> mf
    replicate(
      10,
      mf %>% pfilter(Np=20000) %>% logLik()
    ) %>%
      logmeanexp(se=TRUE) -> ll
    mf %>% coef() %>% bind_rows() %>%
      bind_cols(loglik=ll[1],loglik.se=ll[2])
  } -> results
  results
}) %>%
  filter(is.finite(loglik)) -> results

results %>% arrange(-loglik) %>% head %>% 
  knitr::kable(digits = 3, caption = "SEIQR Global Search Results")

SEIQR Global Search Results
Beta	mu_I	mu_R1	mu_R2	rho	eta	N	loglik	loglik.se
2.434	0.084	0.091	0.094	0.565	0.041	1886700	-601.821	0.001
4.552	0.076	0.095	0.093	0.534	0.042	1886700	-601.907	0.001
3.355	0.087	0.095	0.093	0.547	0.040	1886700	-601.919	0.001
5.039	0.073	0.096	0.104	0.538	0.048	1886700	-601.960	0.001
0.394	0.082	0.092	0.099	0.550	0.043	1886700	-601.960	0.001
8.573	0.072	0.093	0.101	0.539	0.046	1886700	-601.977	0.001

6.3.1.2 Simulations with the MLE

# MLE result of the parameters for SEIR model
opt_seiqr_params=results %>% arrange(-loglik) %>% slice(1) %>% select(-starts_with("loglik")) %>% unlist()

# Simulation
covidSEIQR %>%
  simulate(params=opt_seiqr_params, nsim=5,format="data.frame", include.data=TRUE) %>%
  ggplot(aes(x=time,y=pos,group=.id,color=.id=="data"))+
  geom_line()+
  xlab("Time") + 
  ylab("Number of Positive Cases")+
  scale_color_hue("",breaks=c("FALSE","TRUE"),labels=c("estimated","observed"))

7 Conclusion

From our analysis, we see that the three POMP models are able to describe the actual curve of the data to a certain extent. With the SIR and SEIR models, we were able to find initial values that better fit the data; however, the fitting effect of the MLE estimation obtained through global search is worse. This is a result worthy of further study. The log likelihood value of the SEIQR model is the lowest, and the fitting effect of the MLE estimation is the best among the three models. From a statistical perspective, we would conclude this as the best fitting model for the data. However, it is noteworthy that the log likelihood for this model has strong fluctuations instead of a clear convergence.

In terms of the parameters between the three models, we see that the biggest difference is in \(\eta\), the initial susceptible fraction. The \(\eta\) values obtained by the SIR and SEIR models are very large, but the \(\eta\) values obtained by the SEIQR model are very small. Intuitively speaking, because the overall vaccination rate in New York is very high, most people should have a certain resistance to Omicron. Therefore, having a lower \(\eta\) would make more sense, which the SEIQR model was able to capture.

From both a statistical and intuitive standpoint, we conclude that the SEIQR model best models the number of positive COVID-19 cases in New York City from December 4, 2021 to February 1, 2022. This suggests the importance of considering “exposed” and “quarantine” stages to best understand the trend of the number of cases. While our analysis results indicate room for improvement in terms of model fitting and estimation, overall, they are able to depict the rise and fall of COVID-19 case numbers during the impact of the Omnicron variant.

8 Reference

[1]JHU COVID-19 Dashboard,https://gisanddata.maps.arcgis.com/apps/dashboards/index.html#/bda7594740fd40299423467b48e9ecf6
[2]https://coronavirus.health.ny.gov/positive-tests-over-time-region-and-county
[3]https://www.macrotrends.net/cities/23083/new-york-city/population#:~:text=The%20metro%20area%20population%20of,a%200.07%25%20decline%20from%202018
[4]https://statsandr.com/blog/covid-19-in-belgium/
[5]https://ionides.github.io/531w21/final_project/project03/blinded.html
[6]https://ionides.github.io/531w21/final_project/project15/blinded.html
[7]https://kingaa.github.io/sbied/stochsim/notes.pdf
[8]https://julia.quantecon.org/continuous_time/seir_model.html
[9]https://kingaa.github.io/sbied/pfilter/notes.pdf
[10]https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8053363/
[11]https://www.pnas.org/doi/10.1073/pnas.2006520117

STATS 531 Final Project

4/19/2022