1 Introduction

COVID-19’s introduction to the United States has proven to be an incredible force which has shaped our lives throughout 2020 into 2021. This was acutely realized in the state of Michigan which experienced rapid surges, several rounds of restrictions and mask hesitancy. Because the virus can present itself differently based on the individual, it can lead to a variety of infectious periods. The CDC [1] reported that someone infected with a less severe case should no longer be infectious 10 days after the onset of symptoms, while those with more severe cases can be infectious for 20 days or longer. An article published by the National Institute of Health [2] said that symptoms don’t typically start until 5-6 days after infection, with some cases not seeing symptoms until day 14. The variability in the virus has caused severe underreporting, which has been estimated to be around 10% of actual cases according to one article published by researchers at MIT [3]. The aforementioned conditions along with a latent symptom period and asymptomatic spread has led to an interesting and challenging modeling experience. Past research [4] on modeling COVID-19 has been done with a specific location in mind; for our purposes, we work to study the state of Michigan and begin by applying an ARMA model and a POMP model and we then draw comparisons between both.

2 Exploratory Data Analysis (EDA)

The data is from the Michigan website [5] and available dates included reported positive COVID-19 cases from March 2020 to April 2021. The data is originally given by cases by county. For our purposes, we aggregate the data of all available counties by day. By assessing the plot we can see that there are a couple of large spikes; the largest spike of reported cases was over the winter months. For the purposes of our analysis, the data is subsetted to only analyze and model the “winter spike” from October 1st 2020 to February 1st 2021.

A line of best fit is included in the plot to visually assess the linearity and stationarity of the plot [6]. It can be noticed that in order to proceed with a more introductory time series method such as ARMA modeling, we may want to detrend the data. In addition the Augmented Dickey–Fuller (ADF) and Kwiatkowski–Phillips–Schmidt–Shin (KPSS) tests [7] are run on the data to further test whether it suggests a non-stationary trend.

Recall that the ADF tests the following [8],

\[H_{0}: \text{There is a unit root.} \quad \text{v.s.} \quad H_{A}: \text{The data is stationary.} \]

## 
##  Augmented Dickey-Fuller Test
## 
## data:  data_subset$Cases
## Dickey-Fuller = -1.9628, Lag order = 4, p-value = 0.592
## alternative hypothesis: stationary

And the KPSS tests the following [9],

\[H_{0}: \text{The data is trend-stationary.} \quad \text{v.s.} \quad H_{A}: \text{There is a unit root.} \]

## 
##  KPSS Test for Level Stationarity
## 
## data:  data_subset$Cases
## KPSS Level = 0.63409, Truncation lag parameter = 4, p-value = 0.01954

These tests further suggest that the data may not be stationary or trend-stationary. Therefore, to detrend the data, for the preparation of ARMA modeling, the Hodrick-Prescott (HP) [10] filter is used with a \(\lambda = 100\) to conform with standards in practice [11] . By rerunning the ADF and KPSS tests, we obtain results that suggest that the HP Filter applied to the data has detrended it. We then proceed with our ARMA modeling on the detrended Michigan COVID-19 cases data.

## 
##  Augmented Dickey-Fuller Test
## 
## data:  cases_hp
## Dickey-Fuller = -12.532, Lag order = 4, p-value = 0.01
## alternative hypothesis: stationary
## 
##  KPSS Test for Level Stationarity
## 
## data:  cases_hp
## KPSS Level = 0.015727, Truncation lag parameter = 4, p-value = 0.1

3 ARMA Modeling

We begin with a standard ARMA model and use the Akaike Information Criterion (AIC) [12] to select our model. In particular, we look for a model that produces the relatively smaller AIC value; however, this could also lead to picking a large and complex model that could lead to the issue of numerical instability [13]. Therefore, the choice of model is based on selecting the model with a balance of low AIC and parsimony.

Table 3.1: Table #
MA0 MA1 MA2 MA3 MA4 MA5
AR0 2063.11 2058.84 2021.03 2013.14 2002.94 1998.11
AR1 2061.58 2053.36 2019.88 2009.50 2007.73 1998.15
AR2 2052.03 2005.33 2001.59 2006.08 1988.18 1989.93
AR3 2046.55 2003.49 2003.54 2006.55 1967.99 1968.78
AR4 2033.01 2000.79 1997.78 1995.45 1969.63 1955.30
AR5 2000.23 1997.51 1970.48 1967.39 1971.78 1937.09
## 
## Call:
## arima(x = cases_hp, order = c(1, 0, 1))
## 
## Coefficients:
##           ar1     ma1  intercept
##       -0.7143  0.9583    -0.0510
## s.e.   0.0833  0.0389    94.3654
## 
## sigma^2 estimated as 847105:  log likelihood = -1022.68,  aic = 2053.36

We will consider the ARMA(2, 2) model for meeting the criteria for the goal of having a relatively small AIC and model simplicity and proceed with residual analysis.

3.1 Residual Analysis

The residual analysis of the ARMA(2,2) model raises concerns. The residuals do not appear to follow a nearly normal distribution, and by analyzing the Autocorrelation Function (ACF) plot we can see that there are some lines outside of the blue dashed lines which test the 95% confidence region that the residuals are independent and identically distributed under the null [14]. In addition there seems to be a larger spike on every seventh lag. Thus, rather than conforming with just an ARMA model of the data, we attempt to model it using a Partially Observed Markov Process (POMP) model to see if it results in an improved model.

4 POMP Modeling

We now attempt to model the COVID-19 cases in Michigan using the SEIR model. All analyses in this section works with the pomp [15] package in R. The population for the state of Michigan was estimated to be about 10 million. The number of infected individuals in the SEIR model is initialized by considering the number of cases up to October 1st 2020. We use the 7-day moving average of cases within the period of interest to smooth out the data set we are running the pomp model with.

4.1 SEIR Model [16] [17]

The SEIR model builds on the most basic susceptible-infected-recovered (SIR) model by adding a compartment to account for a latency period that infected individuals must pass through before becoming infectious as stated in the introduction. [18]

Flow diagram taken from Chp 12 notes [19]

For the SEIR model the compartments are [20] [21]

  • \(S\): susceptible (population vulnerable to infection)
  • \(E\): exposed (asymptomatic infected population)
  • \(I\): infected (symptomatic infected population)
  • \(R\): removed (recovered or dead population)
  • \(C\) denotes the reported cases

Associated with each arrow is the rate at which individuals move through the compartments [22] [23] [24] :

  • \(\mu_{\bullet S}\) represents the rate of births into \(S\)

  • \(\mu_{SE}\), \(\mu_{EI}\), and \(\mu_{IR}\) are the transition rates between \(S\), \(E\), \(I\), and \(R\)

    • \(\mu_{SE}\) is the force of infection and can be described as \(\mu_{SE}=\beta \frac{I}{N}\) where \(\beta\) is the contact rate (in contacts per person per time)

    • \(\frac{1}{\mu_{EI}}\) is the mean latency period of the disease (in days)

    • \(\frac{1}{\mu_{IR}}\) is the mean infectious period of the disease (in days)

  • \(\mu_{S \bullet }\), \(\mu_{E \bullet}\), \(\mu_{I \bullet}\), and \(\mu_{R \bullet}\) denote mortality rates at each compartment

  • and \(\rho\) denotes the reporting rate

To simplify the model we ignore demographic changes so that \(\mu_{ \bullet S} = \mu_{S \bullet } = \mu_{I \bullet} = \mu_{R \bullet } = 0\) and fix the population at \(N=10,000,000\), which is about the population of the state of Michigan. The total population can be described as \(S(t) + E(t)+I(t)+R(t)=N\) for any time \(t\).

To track flows of individuals through the compartments we use the notation \(N_{SE}(t)\) to count individuals who transition from \(S\) to \(E\) by time \(t\), and use a similar counting process for \(N_{EI}(t)\) and \(N_{IR}(t)\).

By setting initial conditions for \(S(0)\), \(E(0)\), \(I(0)\), and \(R(0)\) we can represent the model with the following ordinary differential equations [25] [26]:

  • \(\frac{dS}{dt} = -\mu_{SE}(t)\,S(t)\)

  • \(\frac{dE}{dt} = \mu_{SE}(t)\,S(t)-\mu_{EI}(t)\,E(t)\)

  • \(\frac{dI}{dt} = \mu_{EI}(t)\,E(t)-\mu_{IR}(t)\,I(t)\)

  • \(\frac{dR}{dt} = \mu_{IR}(t)\,I(t)\)

4.1.1 Implementing the SEIR model in pomp

As we are analyzing the Michigan “winter spike” of COVID-19 infections we must initialize our pomp model to the start of the spike on October 1st 2020. The parameter \(\eta\) is the fraction of people susceptible to the virus. Considering SARS-CoV-2 is a novel virus we can assume at the start of 2020 everyone is susceptible and that \(\eta = 1\). After some initial modeling we found that a reporting rate of \(\rho = 0.1\) tends to fit the data fairly well, so we hold that value constant in our model. This matches with what we found in the literature [27].

Taking into account reported cases up to October 2020 and the reporting rate, we estimate the remaining fraction of people susceptible to the virus in October as \(\eta = 0.84\). To approximate the initial number of individuals in compartments \(E\) and \(I\) we look at the cases reported a week before and after October 1st. Though symptoms don’t usually show until 5-6 days after infection, it can take longer before an individual feels the effects of COVID. [2] Due to this, we assumed the average latent period was around 7 days. The CDC reported that someone is typically no longer infectious by day 10 of symptoms, which is why we also chose a 7 day infectious period. [1]

As individuals in the initial latent population will be eventually tested and moved to the infected population, we approximate \(E(0)\) as the seven day sum of cases reported between October 1st - October 7th, divided by the reporting rate \(\rho\). Likewise, individuals currently infected and moving towards recovery would have tested positive in the days leading up to October 1st, so to approximate \(I(0)\) we take the seven day sum of cases reported between September 24th - September 30th and divide by the reporting rate \(\rho\).

In summary, to initialize our model we set:

  • \(S = N\eta\)
  • \(E = 90,000\)
  • \(I = 66,000\)
  • \(R = N(1-\eta)\)
library(pomp)

paramnames1=c("N","Beta","mu_EI","mu_IR","rho","eta")
statenames1=c("S","E","I","R","H")

seir_step <- Csnippet("
  double dN_SE = rbinom(S,1-exp(-Beta*I/N*dt));
  double dN_EI = rbinom(E,1-exp(-mu_EI*dt));
  double dN_IR = rbinom(I,1-exp(-mu_IR*dt));
  S -= dN_SE;
  E += dN_SE - dN_EI;
  I += dN_EI - dN_IR;
  R += dN_IR;
  H += dN_IR;
")

seir_init <- Csnippet("
  S = nearbyint(eta*N);
  E = 90000;
  I = 66000;
  R = nearbyint((1-eta)*N);
  H = nearbyint((1-eta)*N);
")

seir_dmeas <- Csnippet("
  lik = dbinom(reports,H,rho,give_log);
")

seir_rmeas <- Csnippet("
  reports = rbinom(H,rho);
")

data_cov %>%
  pomp(
    times="day",t0=0,
    rprocess=euler(seir_step,delta.t=1/7),
    rinit=seir_init,
    rmeasure=seir_rmeas,
    dmeasure=seir_dmeas,
    accumvars="H",
    partrans=parameter_trans(
    logit=c("Beta","mu_EI","mu_IR")
    ),
    statenames=c("S","E","I","R","H"),
    paramnames=c("N","Beta","mu_EI","mu_IR","rho","eta")
  ) -> covSEIR

data_cov %>%
   pomp(
    times="day",t0=0,
    rprocess=euler(seir_step,delta.t=1/7),
    rinit=seir_init, rmeasure=seir_rmeas, dmeasure=seir_dmeas,
    partrans=parameter_trans(logit=c("Beta", "mu_EI","mu_IR")),
    accumvars="H", statenames=c("S","E","I","R","H"),
    paramnames=c("N","Beta","mu_EI","mu_IR","rho","eta"),
    cdir=".", cfile="covSEIR"
   ) -> covSEIR

4.1.2 Simulations [28] [29]

Based on our original findings, we set the parameters to represent what the conditions of COVID looked like at the time. Because the statistics around COVID are estimates and our POMP model isn’t perfectly representative, we manually varied the parameters \(\beta\), \(\mu_{EI}\), and \(\mu_{IR}\), and found that \(\beta=0.31\), \(\mu_{EI}=0.5\), and \(\mu_{IR}=0.19\) were good starting points for the model parameters.

4.1.3 SEIR Local Search [30] [31]

We ran a local search for the maximum log likelihood after setting \(N\) and \(\rho\) constant (as described above) and varying \(\beta\), \(\mu_{EI}\), and \(\mu_{IR}\) and \(\eta\).

params <- c(Beta=0.31, mu_EI=0.5,mu_IR=0.19,rho=0.1,eta=0.84, N = pop)

covSEIR %>%
  pfilter(Np=1000,params=params) -> pf
plot(pf, main= "Particle Filter Check")

##                        se 
## -54404.76088     41.45644

To run the local search in a reasonable amount of time on a laptop we set Np=1000 and Nmif=50. The results of the maximum log likelihood and corresponding model parameters are shown in the table below.

## # A tibble: 1 x 8
##    Beta  mu_EI mu_IR   rho   eta        N  loglik loglik.se
##   <dbl>  <dbl> <dbl> <dbl> <dbl>    <dbl>   <dbl>     <dbl>
## 1 0.543 0.0618 0.139   0.1 0.703 10000000 -85119.      136.

By looking at the conditional log likelihood plot, we can see that the tails are leading to the lowest log likelihood as we found it difficult to match the rapid increase and decrease in cases. We saw \(\beta\) vary 0.3 to 0.7 between simulations and iterations. This seems logical due to the mask wearing, social distancing, state regulation and counts per day which can all be represented in this variable. These factors are also reflected in the \(\mu_{IR}\) which we saw vary, though it did somewhat converge around 0.09. Ideally there would be covariates representing the aforementioned factors, so that \(\beta\) and \(\mu_{IR}\) could represent their intended meaning. \(\eta\) varied between 0.7 and 0.9, though it’s average was around 0.83 which likely accounts for the percentage of the population which already had COVID in this period. \(\mu_{EI}\) gave a suspicious result as it had a steep drop and converged around 0.

4.1.4 SEIR Global Search [32] [33]

Based on this plot we see that the \(\mu_{EI}\) stays between 0.0 and 0.15. \(\eta\) appeared to be a little lower than our local search, around 0.6, which is interesting considering the novelness of the virus. We found \(\beta\) to be inconclusive which we also saw in the local search.

5 Conclusion and Discussion

Prediction from ARMA indicates the necessity of the POMP model. We had hoped that the SEIR results would have shown reasonable predicting power of the winter spike. However, the local and global search do not show reliable results as we had hoped. We did find that the \(\mu_{EI}\), \(\eta\) were consistent with a slight convergence. However, the \(\mu_{IR}\), \(\beta\) and log likelihood did not seem to be working well with the model and provided inconclusive results with poor convergence. We believe that our model could be improved via selecting a more adequate measurement model. One suggestion for a different measurement model would be to implement a Negative Binomial or a Poisson Model [34].

Dealing with an unprecedented pandemic in statistical models is a difficult challenge to carry through and model properly. Thus there is still a lot of work to do with handling COVID-19 data and other directions that can be taken that were not presented here. A suggestion for future work in the area of COVID-19 case modeling could be to create a slightly more complex model with a couple more compartments. The compartments that we had in mind for suggestion would be: Infected and symptomatic and infected and asymptomatic. Since the start of the pandemic, there have been several conjectures on whether asymptomatic individuals are able to expose and infect others, being able to account for that would be interesting in a POMP model.

Additional improvement could have been made by introducing covariates into the model; the covariates can include social distancing rate with regards to restrictions that could vary beta and eta values. For example, restriction policies for entertainment venues such as movie theaters were released from October 9, 2020, and the reported cases then surged [35]. Then, the restriction was again enforced on November 15, 2020 and cases were dramatically decreased [36]. Lastly, the vaccination effect was not considered in this modeling even though the first vaccine was done on December 8, 2020 because the actual immunization can present around 6 weeks after the first dose [37], and the vaccination status was slow at that time. Also, the variants were reported first on Jan 16,2021, so not considered as well [38]. Nevertheless, further studies on the case trends with regards to the increasing variants ratio and vaccination rate could allow more interesting results.


Annotated Timeline of MI COVID-19 Cases with State Restrictions \[\text{Plot created in Excel with information from Wikipedia [35]}\]

6 Division of Labor

For anonymity, description of individual contributions are removed.

7 References

[1] Information on disposition here

[2] Information on symptoms here.

[3] Information on reporting and underreporting here.

[4] Previous Stats 531 Final Project, Flattening the Curve: A Case Study of COVID-19 Spread in Seattle, WA here

[5] Source of data here.

[6] Method of assessing stationary and code for plots from previous Stats 531 Midterm Project here

[7] Idea for using ADF and KPSS tests from previous Stats 531 Midterm Project here

[8] Information and hypothesis tests on ADF test here

[9] Information and hypothesis tests for KPSS test here

[10] Information about the HP Filter here

[11] Information about using lambda equal to 100 for HP Filter here

[12] Information about ARMA model and use of Akaike Information Criterion (AIC) here

[13] Information about analyzing Akaike Information Criterion (AIC) here.

[14] Information about analyzing the Autocorrelation Function (ACF) plot here

[15] Information about POMP model here

[16] Example of simulation of stochastic dynamic models here

[17] Information about simulation of stochastic dynamic models here

[18] Information about COVID-19 epidemic model with latency period here

[19] Information about SEIR model here

[20] Information about compartments here

[21] Information about compartmental models in epidemiology here

[22] Information about compartment structure here

[23] Information about compartmental models in epidemiology here

[24] Example of simulation of stochastic dynamic models here

[25] Information about differential equations for modeling COVID 19 here

[26] Information about compartmental models in epidemiology here

[27] Introduction to reporting rate and susceptibility here.

[28] Information and code about building and simulating POMP model here.

[29] Information and code about building and simulating POMP model here.

[30] Information about SEIR local search here

[31] Information about SEIR local search here

[32] Information about SEIR global search here

[33] Information about SEIR global search here

[34] Discussion of using Negative Binomial and Poisson Model here

[35] Information about Michigan’s reported cases here

[36] Information about Michigan’s restriction policies here

[37] Information about Michigan’s vaccination trends here

[38] Information about Michigan’s variants reporting here

7.1 Additional Acknowledgements

We also acknowledge the discussions and guidance from the course Professor and GSI during weekly group meetings, and the occasional discussion posts that other groups had posted on Piazza.