In the landscape of the COVID-19 pandemic, people all over the world have been impacted in terms of their work and life. Many variants of COVID-19 have developed since its breakthrough in March 2020, which has resulted in the large number of infections and quick spread among close contacts of infected people. This pandemic has lasted for about 2 years, and it interrupted people’s normal life including travel and work.
In our analysis, we focus on the COVID-19 data in Washtenaw county, Michigan from July 1st, 2021 to April 6th, 2022. As University of Michigan students, we were particularly interested in how COVID-19 has spread in our county (Washtenaw). To get an understanding of how COVID-19 cases change over the time, we built a Susceptible-Exposed-Infection-Recovered (SEIR) compartment model. Through local search and global search, we aim to find how the SEIR model can adapt to the real COVID-19 data in Washtenaw County, MI. We also compare to a likelihood benchmark in the context of non-mechanistic models, including a negative binomial model and an ARMA model. This allows us to put the log-likelihoods we obtain through global search in the context of these non-mechanistic benchmarks.
The data was collected from the website of the Michigan state government [1], where the dataset has a time frame of March 2020 to April 2022. We chose to focus on just the time frame from July 2021 to April 2022, as a different variant (Omicron) has become the most prevalent and much previous work has been done using the data from March 2020 to 2021 [2]. We saw in particular that another STATS 531 final project from Winter 2021 used COVID-19 data from Washtenaw County [3], so we are curious if using a different time frame of more recent data will lead to different results than this previous work.
Figure 1 shows a time series plot smoothed by the Loess method, where the blue smoothed line displays an increasing trend before January 2022 and a decreasing trend after that. Figure 2 shows the average COVID-19 cases by month, where a peak in January 2022 is very obvious.
cap_fig1 =
"**Figure 1.** *Cases of COVID-19 in Washtenaw County from July 2021 to April 2022 (smoothed by loess method)*"
data = washtenaw
data %>% ggplot(aes(x = Date, y = Cases)) + geom_line() +
geom_smooth(method='loess') +
theme_bw() + labs(x = "Date", y = "Cases")
# monthly average cases
cap_fig2 =
"**Figure 2.** *The average COVID-19 cases by month*"
data %>% mutate(year_month = as.yearmon(Date, "%Y %m")) %>%
group_by(year_month) %>% summarize(avg_cases = mean(Cases)) %>% ungroup()%>%
ggplot +
geom_line(aes(x = year_month, y =avg_cases)) +
theme_bw() +
labs(x = "Month Year", y = "Monthly Average Cases")
Figure 1 shows a time series plot smoothed by the Loess method, where the blue smoothed line displays an increasing trend before January 2022 and a decreasing trend after that. Figure 2 shows the average COVID-19 cases by month, where a peak in January 2022 is very obvious.
cap_fig3 =
"**Figure 3.** *The ACF plot of COVID-19 cases in Washtenaw County*"
acf(data$Cases, main="ACF: Covid-19 Cases")
Figure 3 displays the auto-correlation plot of COVID-19 cases in Washtenaw County, MI. This plot motivates our exploration of the time-dependent relationship of the number of COVID-19 cases.
We chose to use the SEIR model which consists of four compartments including “S”, “E”, “I” and “R”. The “S” compartment represents the susceptible people; the “E” compartment represents the number of exposed individuals in a period of latency before becoming infectious; the “I” compartment represents the infectious people; the “R” represents those recovered or removed from the susceptible population.
Suppose the number of people in each compartment at time \(t\) is \(S(t), E(t), I(t), R(t)\), respectively. The model can be specified as follows:
\[ \begin{split} S(t)&=S(0)-N_{SE}(t)\\ E(t)&=E(0)+N_{SE}(t)-N_{EI}(t)\\ I(t)&=I(0)+N_{EI}(t)-N_{IR}(t)\\ R(t)&=R(0)+N_{IR}(t) \end{split} \]
where the number of people transiting from one compartment to another is given by:
\[ \begin{split} \Delta N_{SE}&\sim \mathrm{Binomial}(S,1-e^{\beta\frac{I}{N}\Delta t})\\ \Delta N_{EI}&\sim \mathrm{Binomial}(E,1-e^{-\mu_{EI}\Delta t})\\ \Delta N_{IR}&\sim \mathrm{Binomial}(I,1-e^{^{-\mu_{IR}\Delta t}}) \end{split} \]
As for the parameters, \(\beta\) is the contact rate with \(b_1\) when time is in the first half of time period and \(b_2\) when time is in the second half of time period ,and \(\mu_{SI}=\beta I(t)\) denotes the rate at which individuals in S transition to E, \(\mu_{EI}\) is the rate at which individuals in E transition to I and \(\mu_{IR}\) denotes the transition rate from I to R. The probability of a case being reported is \(\rho\).
Let’s get the initial values based on some statistics:
The population of Washtenaw county on the start date: 372,258 (Estimated, retrieved from [5])
Initial infected people: 30 (An intuitive value based on the confirmed cases on July 1st.)
We are going to divide the time series into two stages. According to the CDC, the time when Omicron became the primary variant in the US is around the beginning of December [2]. So, we set December 1st as the division of the two stages. Since the speeds of spreading of the Delta and Omicron variants are considered different, we set the parameters of \(\beta\) to be \(b_1\) and \(b_2\) respectively. We expect to see that the estimate of \(b_2\) will be larger than \(b_1\).
The code we have written for our project is based on/inspired by code we have written for previous homework assignments, course notes, and previous semesters’ projects for STATS 531 but modified for the purposes of our project [3] [4].
seir_step <- Csnippet("
double Beta = intervention ? b2 : b1;
double dN_SE = rbinom(S,1-exp(-Beta*I/N*dt));
double dN_EI = rbinom(E,1-exp(-mu_EI*dt));
double dN_IR = rbinom(I,1-exp(-mu_IR*dt));
S -= dN_SE;
E += dN_SE - dN_EI;
I += dN_EI - dN_IR;
R += dN_IR;
H += dN_IR;
")
seir_init <- Csnippet("
S = nearbyint(eta*N);
E = 30;
I = 30;
H = 0;
R = nearbyint((1-eta)*N-E-I) + 1;
")
seir_dmeas <- Csnippet("
double tol=1.0e-25;
double mean =rho*H;
double sd =sqrt(pow(tau*H,2)+rho*H);
if(Cases>0.0){
lik=pnorm(Cases+0.5,mean,sd,1,0)-pnorm(Cases-0.5,mean,sd,1,0)+tol;
} else {
lik=pnorm(Cases+0.5,mean,sd,1,0)+tol;
}
if(give_log) lik=log(lik);
")
seir_rmeas <- Csnippet("
Cases = rnorm(rho*H, sqrt(pow(tau*H,2)+rho*H));
if(Cases>0.0){
Cases=nearbyint(Cases);
} else {
Cases=0.0;
}
")
seir_covar <- covariate_table(
t = washtenaw$Time,
intervention= c(rep(0, 154),
rep(1, 125)),
times = "t")
washtenaw %>%
select(Time, Cases) %>%
pomp(
times="Time", t0=1,
rmeasure=seir_rmeas,
dmeasure=seir_dmeas,
accumvars="H",
rprocess=euler(seir_step,delta.t=1),
rinit=seir_init,
partrans=parameter_trans(
log = c("mu_EI", "mu_IR", "b1", "b2", "tau"),
logit = c("rho", "eta")
),
paramnames=c("N","b1", "b2", "rho", "mu_EI","mu_IR","eta", "tau"),
statenames=c("S","E","I","R","H"),
covar = seir_covar
) -> washtenawSEIR
First, we carry out replicated particle filters at our initial guess. We find the estimated log-likelihood below:
foreach(i=1:10,.combine=c) %dopar% {
library(pomp)
washtenawSEIR %>% pfilter(Np=5000)
} -> pf
pf %>% logLik() %>% logmeanexp(se=TRUE) -> L_pf
L_pf
## se
## -7213.183 1510.358
Then we can use simulation and visualization to see how good our initial guesses can fit:
Simulations based on our initial starting points are able to capture the general trend of our data as shown below, but improvements can still be made.
First, we specify our initial parameters for our local search starting point. We chose to fix \(N=372258\) and \(\mu_{IR}=0.2\).
fixed_params <- c(N=372258, mu_IR=0.2)
coef(washtenawSEIR, names(fixed_params)) <- fixed_params
foreach(i=1:20,.combine=c) %dopar% {
library(pomp)
library(tidyverse)
washtenawSEIR %>%
mif2(
Np=2000, Nmif=100,
cooling.fraction.50=0.5,
rw.sd=rw.sd(b1=0.02, b2=0.02, rho=0.02, tau=0.002, mu_EI=0.02, eta=ivp(0.02)),
)
} -> mifs_local
After conducting our search, we take a look at the diagnostic plots as follows:
There is fluctuation in the parameters that are not fixed. Additionally, our likelihood does not strictly increase as iterations proceed, which may indicate a problem.
Then we evaluate the likelihood using replicated particle filters at each point estimate.
The next step is conducting global search for the parameters \(b_1\), \(b_2\), \(\rho\), \(\mu_{EI}\) and \(\eta\) using the iterated filtering process. We randomly draw 500 sets of starting values from a multivariate uniform distribution where $b_1, b_2, , , , $ and \(\mu_{EI} \in [0.01, 0.5]\).
set.seed(0)
runif_design(
lower=c(b1=0, b2=0, rho=0.01, eta=0, tau=0.001, mu_EI=0.01),
upper=c(b1=5, b2=10, rho=0.9, eta=1, tau=0.2, mu_EI=0.5),
nseq=500
) -> guesses
mf1 <- mifs_local[[1]]
bake(file="global_search.rds",
dependson=guesses,{
registerDoRNG(1270401374)
foreach(guess=iter(guesses,"row"), .combine=rbind) %dopar% {
library(pomp)
library(tidyverse)
mf1 %>%
mif2(params=c(guess, fixed_params)) %>%
mif2(Nmif=100) -> mf
replicate(
10,
mf %>% pfilter(Np=5000) %>% logLik()
) %>%
logmeanexp(se=TRUE) -> ll
mf %>% coef() %>% bind_rows() %>%
bind_cols(loglik=ll[1],loglik.se=ll[2])
} -> results
attr(results,"ncpu") <- getDoParWorkers()
results
}) %>%
filter(is.finite(loglik)) -> results
t_global <- attr(results,"system.time")
ncpu_global <- attr(results,"ncpu")
global_results <- results
The parameters derived from the global search are:
b1 | b2 | rho | mu_IR | mu_EI | tau | eta | N | loglik | loglik.se |
---|---|---|---|---|---|---|---|---|---|
1.186655 | 1.5863 | 0.9694866 | 0.2 | 1.762525 | 0.4408428 | 0.1807037 | 372258 | -1547.52 | 1.3273 |
1.186655 | 1.5863 | 0.9694866 | 0.2 | 1.762525 | 0.4408428 | 0.1807037 | 372258 | -1547.52 | 1.3273 |
1.186655 | 1.5863 | 0.9694866 | 0.2 | 1.762525 | 0.4408428 | 0.1807037 | 372258 | -1547.52 | 1.3273 |
1.186655 | 1.5863 | 0.9694866 | 0.2 | 1.762525 | 0.4408428 | 0.1807037 | 372258 | -1547.52 | 1.3273 |
1.186655 | 1.5863 | 0.9694866 | 0.2 | 1.762525 | 0.4408428 | 0.1807037 | 372258 | -1547.52 | 1.3273 |
1.186655 | 1.5863 | 0.9694866 | 0.2 | 1.762525 | 0.4408428 | 0.1807037 | 372258 | -1547.52 | 1.3273 |
The result shows that \(b_2\) is bigger than \(b_1\), proving that the Omicron variant is more easily spread among the population.
We can see that the variance of our fitted time series is slightly large. It may be because of the fluctuation in the original data.
We have shown both local and global search of the mechanistic model above, but many of these parameters do not have meaningful scientific interpretation. In this section, we calculate the log-likelihood in the non-mechanistic models, including a negative binomial model and an ARMA model. This allows us to compare our global search results to a benchmark.
nb_lik = function(theta) {- sum(dnbinom(
data$Cases, size = exp(theta[1]), prob = exp(theta[2]), log = TRUE
))}
nb_mle = optim(c(0, -5), nb_lik)
-nb_mle$value
## [1] -1652.252
The negative binomial model gives a log-likelihood of \(-1652.252\). The log-likelihood of \(-1547\) attained by global search beats this non-mechanistic model.
In addition, a Gaussian auto-regressive moving-average (ARMA) model as a non-mechanistic model may also provide us with another perspective of log-likelihood. We made use of transformed cases \(log(y_n +1)\) in our ARMA model and then calculate the original log-likelihood in an appropriate way, where \(y_n\) is the number of cases.
log_cases = log(1 + data$Cases)
smoothed = spectrum(log_cases, spans=c(5, 5, 5), main="Periodogram (Smoothed)")
idx_max = which.max(smoothed$spec)
idx_max2 = which.max(smoothed$spec[41:length(smoothed$spec)]) + 40
abline(v=smoothed$freq[idx_max], lty=2, col='red') # freq = 0.003472222, 288 days
abline(v=smoothed$freq[idx_max2], lty=2, col='red') # freq = 0.1423611, 1 week
The smoothed periodogram of transformed cases above shows us that the first dominant frequency is around \(w_1 = 0.003472222\) which corresponds to the cycle of \(288\) days, and the second dominant frequency is around \(w_2 = 0.1423611\), which corresponds to a cycle of 1 week. Therefore, we explore the best possible seasonal ARMA models \(SARIMA(p, 0, q) \times (1, 0, 1)_7\) with the period of 7 days in terms of Akaike Information Criterion (AIC), where \(0\leq p\leq 5\), \(0\leq q\leq 5.\)
aic_table=function(data, P, Q){
table=matrix(NA, (P+1), (Q+1))
for(p in 0:P) {
for(q in 0:Q) {
model_aic = try(
arima(data, order = c(p, 0, q), seasonal = list(order = c(1, 0, 1), period = 7))$aic,
silent = TRUE
)
table[p + 1,q + 1] = ifelse(
inherits(model_aic, "try-error"), NA, model_aic
)
}
}
dimnames(table) = list(paste("AR", 0:P, sep=""), paste("MA", 0:Q, sep=""))
table
}
knitr::kable(aic_table(data = log_cases, P = 5, Q = 5))
MA0 | MA1 | MA2 | MA3 | MA4 | MA5 | |
---|---|---|---|---|---|---|
AR0 | 346.4929 | 275.1273 | 260.0006 | 251.1119 | 193.8270 | 194.1448 |
AR1 | 197.5747 | 177.2757 | 170.7478 | 163.6753 | 162.0773 | 163.0357 |
AR2 | 192.9134 | 176.3333 | 169.6617 | 161.8660 | 160.9721 | 162.7601 |
AR3 | 159.1812 | 159.8479 | 161.5929 | 163.3872 | 162.6681 | NA |
AR4 | 159.5692 | 160.0918 | 161.9805 | 163.8329 | 162.8044 | 167.3686 |
AR5 | 161.2516 | 161.9459 | 163.9343 | 165.9595 | 160.4336 | NA |
arma30= arima(log_cases, order = c(3, 0, 0),
seasonal = list(order = c(1, 0, 1), period = 7))
arma30$loglik - sum(log_cases)
## [1] -1308.984
The AIC table above shows that \(SARIMA(3, 0, 0) \times (1, 0, 1)_7\) is the best model to move forward with based on this metric. The corrected log-likelihood for the original data is \(-1308.984\), which is slightly higher than that of global search. The SARIMA model captures the periodic phenomenon which the SEIR model may not take into account, since our SEIR model mainly considers the flow among the compartments.
In this report, we conduct a time series analysis using an SEIR pomp model on COVID-19 cases from July 2021 to April 2022 in Washtenaw County, MI.
Due to the Omicron variant of the COVID-19 virus, we divide the time series into two stages, where the contact rate \(\beta\) has two different parameter values during these two time stages. The results of our model show that the contact rate in the second stage is larger than that in the first stage, which suggests the power of Omicron variant.
Although there is still room to improve the SEIR model, such as using the SEQIR model that takes a quarantine state into consideration, our SEIR model fits the data of COVID-19 cases in Washtenaw County well.