1 Introduction

What started as a few reported cases back in December 2019, the Coronavirus disease 2019 (COVID-19) rapidly turned into a pandemic that affected millions of people all across the world. Since COVID-19 was first declared a pandemic in March 2020, many countries have been implementing their own safety guidelines and regulations in hopes of decreasing the number of positive cases. As of 2022, there have been several variants of the Coronavirus disease discovered (i.e. Delta, Omicron), each with different transmission rates and reported symptoms. Although there has been a considerable amount of improvement in preventing the spread of COVID-19 through the development of vaccines and implementation of health guidelines since the start of the virus, COVID-19 remains a widely researched, yet difficult topic given its uncertainty and variability.

In our project, we focus on the COVID-19 cases in New York, which has one of the highest number of covid cases in the United States. More specifically, we focus on the number of COVID-19 cases in New York City, which has a population size of roughly 18 million[1], the highest number in the state of New York. Our objective is to gain understanding of the trend and spread of COVID-19 cases in a highly affected area. In our analysis, we fit the Susceptible-Infectious-Recovered (SIR) compartmental model, as well as other variations, which includes the Susceptible-Exposed-Infection-Recovered (SEIR) model and the Susceptible-Exposed-Infection-Quarantined-Recovered (SEIQR) models. Through the comparison of their log-likelihoods, we see which components of the model are most significant in explaining the behavior of COVID-19 in New York City.

2 Data

The dataset we used comes from the website of New York government[2], which we filtered to only the data from New York City. The data contains the number of positive cases in New York city and ranges from March 1, 2020 to April 14, 2022.

3 Exploratory Analysis

From the time series plot of COVID-19 cases in New York City, we see primarily three regions with peaks in the number of positive cases. We see a small peak around April 2020, soon after COVID-19 was declared a pandemic, and a small peak around January 2021, right before vaccines became readily available for everyone. The largest peak occured around January 2022, and this could have been due to several different factors, such as the presence of new variants, surge in the number of tourists and amount of travel, less strict regulations, and more.

For our project, we found it difficult to model and account for several peaks in the data. Therefore, we focus on only modeling the Omicron variant, which is present in the highest peak in the data from December 4, 2021 to February 1, 2022.

4 SIR Model

We first consider the SIR model, as it is a basic model and we want to see whether it can do a good job in fitting the data. The SIR model is consist of three stages:

S: susceptible (all individuals)

I: infected (symptomatic)

R: removed (recovered or deceased)

And the model is given by following expression:

\[\begin{eqnarray} \frac{dS(t)}{dt} &=& \frac{-\beta I(t) S(t)}{N}, \\ \frac{dI(t)}{dt} &=& \frac{\beta I(t) S(t)}{N} - \gamma I(t),\\ \frac{dR(t)}{dt} &=& \gamma I(t),\\ S(t) + &I(t)& + R(t) = N, \ \ \forall t \end{eqnarray}\]

The S(t) represent the susceptible population at time t, I(t) represents the infected population at time t, R(t) represents the recovered population at time t, and N represents the total population in this area. The transmission rate is \(\beta\), the recovery rate is \(\mu_{IR}\).

Also, the initial value of I will be the active population on the first day, due to the lack of relevant data. Considering that Omicron is highly infectious, but have milder symptoms and therefore may not be reported, we assume it to be 5000. This active population represents the confirmed cases that have not been resolved, which also means infectious population.The population of New York City, or N, is 1886700[3]. For the initial value of S, as we do not know the true value, we will use the fraction of total population, which is represented by \(\eta\), to predict the S(0). There is an accumulator variable H, which is not shown in the expression above, that is used to tally only the daily confirmed cases, such that \(\text{reports}_t \sim \text{Binomial}(H(t), \rho)\), where \(\rho\) is the reporting rate, and it will be reset to zero at the beginning of each day.[5]

Because the population of New York city changes relatively slowly[3], we will not consider the death rate and how the population of recovered individuals will be affected in this project.

4.1 Initial SIR Model

After many attempts on different parameters, we choose \(\beta=0.48\), \(\mu_{IR}=0.1\), \(\rho=0.74\), \(\eta=0.95\) as our initial parameters. We found that a larger value of \(\eta\) fits the data better, which may be because most of the population had no resistance to Omicron in its early stage[4].

sir_step <- Csnippet("
double dN_SI = rbinom(S,1-exp(-Beta*I/N*dt)); 
double dN_IR = rbinom(I,1-exp(-mu_IR*dt));
S -= dN_SI;
I += dN_SI - dN_IR;
R += dN_IR;
H += dN_IR;
")

sir_init <- Csnippet("
S = nearbyint(eta*N);
I = 5000;
R = nearbyint((1-eta)*N);
H = 0;
")

dmeas <- Csnippet("
      double ll = dbinom(pos,H,rho,give_log);
      lik =  (!isfinite(ll) ? -1000 : ll );
")

rmeas <- Csnippet("
pos = rbinom(H,rho);
")

# SIR Model
covidSIR = df %>%
  pomp(
    times="time",t0=0,
    rprocess=euler(sir_step,delta.t=1),
    rinit=sir_init,
    rmeasure=rmeas,
    dmeasure=dmeas,
    accumvars="H",
    partrans=parameter_trans(
      log=c("Beta","mu_IR"),
      logit=c("rho","eta")
    ),
    statenames=c("S","I","R","H"),
    paramnames=c("Beta","mu_IR","eta","rho","N")
  )

set.seed(20220419)

# Population size
pop = 1886700 

# Initial guesses of the parameters for SIR model
sir_params=c(Beta=0.48,mu_IR=0.27,rho=0.74,eta=0.95,N=pop)

# Simulation
covidSIR %>%
  simulate(params=sir_params, nsim=20,format="data.frame", include.data=TRUE) %>%
  ggplot(aes(x=time,y=pos,group=.id,color=.id=="data"))+
  geom_line()+
  xlab("Time") + 
  ylab("Number of Positive Cases")+
  scale_color_hue("",breaks=c("FALSE","TRUE"),labels=c("estimated","observed"))

The likelihood estimate of the initial guess is -76015.4391 with a Monte Carlo standard error of 201.8824.

registerDoRNG(123294940)

# Calculate the likelihood of the initial guess
foreach(i=1:10,.combine=c) %dopar% {
  library(pomp)
  covidSIR %>% pfilter(params=sir_params,Np=5000)
} -> sir_pf

sir_pf %>% logLik() %>% logmeanexp(se=TRUE) -> sir_L_pf

sir_pf[[1]] %>% 
  coef() %>% 
  bind_rows() %>% 
  bind_cols(loglik=sir_L_pf[1],loglik.se=sir_L_pf[2]) %>%
  write_csv("sir_lik.csv")

sir_L_pf
                   se 
-74595.752   2961.375 

The simulations based on the starting values are able to capture general trend of the data, but they seem a little too smooth and do not model the peaks well. Next, we will use iterative filtering to search for the maximum likelihood estimates (MLE).

5 SEIR Model

Compared to the SIR model, the SEIR model adds a stage “E” which means infected individuals must pass a period of latency before becoming infectious[6][7]. In practice, SEIR model is more adaptable than SIR model.

The basic model assumptions and settings are consistent with the SIR model. In addition, we set the initial value of E as 10000[8].

5.1 Initial SEIR Model

We set \(\beta=2.1\), \(\mu_{EI}=0.088\), \(\mu_{IR}=0.25\), \(\rho=0.8\), \(\eta=0.87\) as our initial values for the model. However, we see from the simulations that the curve is still too smooth to fully capture the peaks of the data.

seir_step <- Csnippet("
  double dN_SE = rbinom(S,1-exp(-Beta*I/N*dt));
  double dN_EI = rbinom(E,1-exp(-mu_EI*dt));
  double dN_IR = rbinom(I,1-exp(-mu_IR*dt));

  S -= dN_SE;
  E += dN_SE - dN_EI;
  I += dN_EI - dN_IR;
  R += dN_IR;
  H += dN_IR;
")

seir_init <- Csnippet("
  S = nearbyint(eta*N);
  E = 10000;
  I = 7000;
  R = nearbyint((1-eta)*N);
  H = 0;
")

dmeas <- Csnippet("
  double ll = dbinom(pos,H,rho,give_log);
  lik =  (!isfinite(ll) ? -1000 : ll );
")

rmeas <- Csnippet("
  pos = rbinom(H,rho);
")

# SEIR Model
covidSEIR = df %>%
  pomp(
    times="time",t0=0,
    rprocess=euler(seir_step,delta.t=7),
    rinit=seir_init,
    rmeasure=rmeas,
    dmeasure=dmeas,
    accumvars="H",
    partrans=parameter_trans(
      log=c("Beta","mu_EI", "mu_IR"),
      logit=c("rho","eta")
    ),
    statenames=c("S","E","I","R","H"),
    paramnames=c("Beta","mu_EI","mu_IR","eta","rho","N")
  )

set.seed(5312021)

# Initial guesses of the parameters for SEIR model
seir_params=c(Beta=2.1,mu_EI=0.088,mu_IR=0.25,rho=0.8,eta=0.87,N=pop)

# Simulation
covidSEIR %>%
  simulate(params=seir_params, nsim=20,format="data.frame", include.data=TRUE) %>%
  ggplot(aes(x=time,y=pos,group=.id,color=.id=="data"))+
  geom_line()+
  xlab("Time") + 
  ylab("Number of Positive Cases")+
  scale_color_hue("",breaks=c("FALSE","TRUE"),labels=c("estimated","observed"))

The likelihood estimate of the initial guess is -164056.1595 with a Monte Carlo standard error of 231.5666. Next, we will use iterative filtering to search for the MLE.

registerDoRNG(123294940)

# Calculate the likelihood of the initial guess
foreach(i=1:10,.combine=c) %dopar% {
  covidSEIR %>% pfilter(params=seir_params,Np=5000)
} -> seir_pf

seir_pf %>% logLik() %>% logmeanexp(se=TRUE) -> seir_L_pf

seir_pf[[1]] %>% 
  coef() %>% 
  bind_rows() %>% 
  bind_cols(loglik=seir_L_pf[1],loglik.se=seir_L_pf[2]) %>%
  write_csv("seir_lik.csv")

seir_L_pf
                       se 
-164056.1595     231.5666 

6 SEIQR Model

The Susceptible-Exposed-Infectious-Quarantine-Removed and/or Recovered (SEIQR) model builds off of the SEIR model by including a “Q” term. Considering that isolation is mandatory when exposed to or infected with COVID-19 during the epidemic, we added this factor to the model to explore its effect. SEIQR is defined in a similar as the above models, with slight adjustments to the “E” stage, which now refers to the state where individuals are infected but not infectious, and the “Q” stage, which refers to the quarantine state[10].

While the isolation/quarantine policy in New York is enforced, we do not expect everyone to abide by this rule. Considering this fact, we set the initial value of Q to be 200.

6.1 Initial SEIQR Model

We set our initial values to be \(\beta=2\), \(\mu_I=0.07\), \(\mu_{R1}=0.08\), \(\rho=0.4\), \(\mu_{R2}=0.1\), \(\eta=0.05\), \(N=1886700\). From the overall trend, the model roughly describes the actual curve, and there is no excessive smoothness. However, it is evident that there is still room for improvement in the selecting the right parameters.

seiqr_step <- Csnippet("
  double t1 = rbinom(S,1-exp(-Beta*I*dt));
  double t2 = rbinom(E,1-exp(-dt*mu_I));
  double t3 = rbinom(I,1-exp(-dt*mu_R1));
  double t4 = rbinom(Q,1-exp(-dt*mu_R2));
  S -= t1;
  E += t1 - t2;
  I += t2 - t3;
  Q += t3 - t4;
  R += t4;
"
)

seiqr_dmea <- Csnippet("
  double ll = dnorm(pos,Q,rho*Q+1e-10,give_log);
  lik = (!isfinite(ll) ? -1000 : ll );
"
)

seiqr_rmea <- Csnippet("
  pos= rnorm(Q,rho*Q+1e-10);
"
)

seiqr_rinit <- Csnippet("
  S=nearbyint(eta*N);
  E=10000;
  I=7000;
  Q=200;
  R=nearbyint((1-eta)*N);
"
)

covid_statenames <- c("S","E","I","Q","R")
covid_paramnames <- c("Beta","rho","mu_I","mu_R1","mu_R2","eta","N")
fixed <- c(N=1886700)
covidSEIQR <- pomp(
  data=df,
  times="time",
  t0=0,
  rprocess=euler(
    step.fun=seiqr_step,
    delta.t=1),
  rmeasure=seiqr_rmea,
  dmeasure=seiqr_dmea,
  partrans=parameter_trans(
    log=c("Beta","mu_I","mu_R1","mu_R2","eta","rho")),
  statenames=covid_statenames,
  paramnames=covid_paramnames,
  rinit=seiqr_rinit
)

seiqr_params = c(Beta=2,mu_I=0.07,mu_R1=0.08,rho=0.4,mu_R2=0.1,eta=0.05,N=1886700)
sims <- simulate(covidSEIQR,params=seiqr_params,
                 nsim=5,format="data.frame",include=TRUE,seed = 20220419)

ggplot(sims,mapping=aes(x=time,y=pos,group=.id,color=.id=="data"))+
  geom_line()+scale_color_hue("",breaks=c("FALSE","TRUE"),labels=c("estimated","observed"))+labs(x="Time",y="Number of Positive Cases")

The likelihood estimate of the initial guess is -6.088247e+02 with a Monte Carlo standard error of 5.111665e-03, which is much better than the estimates of the two previous models. Next, we will use iterative filtering to search for the MLE.

registerDoRNG(123294940)

# Calculate the likelihood of the initial guess
foreach(i=1:10,.combine=c) %dopar% {
  covidSEIQR %>% pfilter(params=seiqr_params,Np=5000)
} -> seiqr_pf

seiqr_pf %>% logLik() %>% logmeanexp(se=TRUE) -> seiqr_L_pf

seiqr_pf[[1]] %>% 
  coef() %>% 
  bind_rows() %>% 
  bind_cols(loglik=seiqr_L_pf[1],loglik.se=seiqr_L_pf[2]) %>%
  write_csv("seiqr_lik.csv")

print(seiqr_L_pf)
                         se 
-6.088247e+02  5.111665e-03 

7 Conclusion

From our analysis, we see that the three POMP models are able to describe the actual curve of the data to a certain extent. With the SIR and SEIR models, we were able to find initial values that better fit the data; however, the fitting effect of the MLE estimation obtained through global search is worse. This is a result worthy of further study. The log likelihood value of the SEIQR model is the lowest, and the fitting effect of the MLE estimation is the best among the three models. From a statistical perspective, we would conclude this as the best fitting model for the data. However, it is noteworthy that the log likelihood for this model has strong fluctuations instead of a clear convergence.

In terms of the parameters between the three models, we see that the biggest difference is in \(\eta\), the initial susceptible fraction. The \(\eta\) values obtained by the SIR and SEIR models are very large, but the \(\eta\) values obtained by the SEIQR model are very small. Intuitively speaking, because the overall vaccination rate in New York is very high, most people should have a certain resistance to Omicron. Therefore, having a lower \(\eta\) would make more sense, which the SEIQR model was able to capture.

From both a statistical and intuitive standpoint, we conclude that the SEIQR model best models the number of positive COVID-19 cases in New York City from December 4, 2021 to February 1, 2022. This suggests the importance of considering “exposed” and “quarantine” stages to best understand the trend of the number of cases. While our analysis results indicate room for improvement in terms of model fitting and estimation, overall, they are able to depict the rise and fall of COVID-19 case numbers during the impact of the Omnicron variant.