Introduction

Novel coronavirus disease, COVID-19, a serious contagious disease was first discovered in December 2019 and now presented a big threat for global public healthy security. Many articles have been published to study the transmission, treatment, and the social and economic effect of the COVID-19 pandemic. Many final projects in STATS531 from 2020 and 2021 focus on COVID-19 datasets “Project25” (2020), but at that time, the Omicron variant has not been discovered. The omicron variant of COVID-19 has become the dominant strain of the virus circulating in the U.S. and the rest of the world. It’s more easily spread than previous strains because it has more mutations than any other variant so far (“Omicron-Variant” 2022). The largest number of daily cases is one million four hundred thousand on January 10th (“NewYorkTime” 2022). A good news is that Omicron infection generally causes less severe disease than infection with prior variants (“CDC” 2022). The reported cases drops quickly to nearly ten thousand a day and many states have lowered the restrictions related to COVID-19. On March 21st, University of Michigan changed face covering policy for COVID-19 (“UMICH” 2022). However, the number of cases starts to rise in April, and has reached 35 thousand. A pubilic concern is that the daily reported cases may not be accurate, because many people who are infected but not showing any symptoms may not be reported. Based on this situation, our team raises two questions. Firstly, can we use time series model to fit the data related to Omicron variant? Secondly, how much can we trust the daily reported number?

In this case study, we will use the daily cases of COVID-19 in Washtenaw,MI, from December 1st, 2021 to April 6th, 2022 to understand the transmission dynamics of COVID-19. We choose a certain county because policies in different places vary and Washtenaw is where we live in. The starting time point is based on the report from CDC (“Firstcase” 2021). We will carry out the analysis by ARMA and POMP, then we will compare the results and make conclusions.

COVID-19 Data

The daily confirmed cases of COVID-19 in Washtenaw,MI was obtained from the website of Michigan.gov. Below is the plot of COVID-19 cases from 2020-03-01 to 2022-04-06. We could see at the beginning, it was not serious. From about time 25 in the plot, its cases increase suddenly. At about time 35, it reached its peak. After time 50, situations become better and better. Now the cases are even lower than the beginning.

From the ACF plot of the original reported confirmed COVID-19 cases, we can see that the data is very correlated, with very high autocorrelation. Because of this, taking the differenced data will help in terms of having stationarity in the data. As seen from the plot below, the differenced data of the number of reported cases seem more stationary compared to the original dataset. Further, although there are some high correlation in several lags, like at lag 7 and 14, it is still an improvement compared to the ACF of the original data. The lag at 7 indicates that our data may contain seasonality.

Check for Seasonality

In the previous ACF plot of the differenced data, it can be observed that there may be some seasonality that exists in the reported confirmed cases of COVID-19, since there is high correlation at lag 7 and lag 14. This may suggest some form of weekly seasonality. To further confirm this observation, a spectral density plot is plotted below. The spectral density plot agrees that there exist a seasonality pattern in the data, with an approximately seven days(1/0.13) period.

Fitting SARIMA Model

Since we’ve seen that seasonality exists in the data, and that we are fitting over the differenced data, we will fit an SARIMA(p,1,q) model with a weekly period, with the equation \[\phi(B)Φ(B^{7})(1−B)^1(1−B^{7})^DY_n−\mu= \psi(B)Ψ(B^{7})\epsilon_n\] where \({\epsilon_n}\) is a white noise process, \(\mu\) is the mean of the differenced process \((1-B)^1(1-B^{7})\), and as stated in Lecture Notes Chapter 6.

We then proceed to model selection by finding the AIC values for various possible values or p and q for the SARIMA(p,1,q). The values are presented in the table as below.

##          MA0      MA1      MA2      MA3      MA4
## AR0 1437.999 1438.916 1429.261 1429.829 1431.339
## AR1 1439.484 1432.121 1430.224 1430.817 1432.807
## AR2 1433.903 1429.340 1420.611 1432.805 1423.163
## AR3 1433.136 1431.151 1433.100 1423.598 1424.586

Using the AIC report above, the model SARIMA(2,1,2) has the lowest AIC value among all the SARIMA models. Thus, we will choose this model to fit the SARIMA model.

As seen from the plot of the inverse roots of the SARIMA model above, we know that we have fitted a stationary casual model. Since all the inverse AR and MA roots are inside the unit circle, the model is therefore casual and stationary. Next, we use the residual plot and Q-Q plot to check how our model fits the data. The figures are shown below. We can see that the residuals remain large around the peaks, and the Q-Q plot indicates that the residuals do not follow normal distribution. Nevertheless, we have to admit that the data around the peaks are hard to fit and our SARIMA model generally performs well. To achieve better results, we will use POMP model.

POMP: Recurrent-SEPIR Model

In the course, we learn SIR(Susceptible-Infectious-Recovered) model and SEIR(Susceptible-Exposed-Infectious-Recovered) model. These two models are simple but powerful. However, the assumption that recovered people are no longer susceptible does not hold for COVID-19 (“Reinfection” 2022). Therefore, we think that a recurrent model might be more suitable, which means that people who have recovered from COVID-19 can be infected again. Besides, it is possible that infected people do not show any symptoms, but they can still spread the virus. These people do not know that they are infected and they will not get tested, so these cases will not be reported (“Nosymptom” 2022). Based on this situation, we add one more branch after the “Exposed” state. We name it as “Potential Infected”, and use the letter “P” to represent the state. This means that, after the “Exposed” state, people can go to “I” or “P”, but only cases from state “I” can be reported. The model is shown in the figure below.

Model

Our model can be described as follows.

And the transition between different stages can be described by the following differential equations.

The parameters in the model are defined as follows.

We observed that the data looks quite different in different stages. In the first half, the number of reported cases is larger, and there are several peaks. In the second half, the reported cases are fewer, and the value generally maintain at a certain level. To tackle this problem, we let \(\beta\) vary in different periods. The code to generate our POMP model is shown below.

#load the model
#intervention_indicator
intervention_indicator <- rep(0,length(dat$Time))
for(i in 1:length(dat$Time)){
  if (i<=24){
    intervention_indicator[i] <- 1
  }
  else if (i>24 & i<=28){
    intervention_indicator[i] <- 2
  }
  else if (i>28 & i<=34){
    intervention_indicator[i] <- 3
  }
  else if (i>35 & i<=41){
    intervention_indicator[i] <- 4
  }
  else if (i>41 & i<=51){
    intervention_indicator[i] <- 5
  }
  else {
    intervention_indicator[i] <- 6
  }
}

#covariate table
covar_ <- covariate_table(
  day=dat$Time,
  intervention = intervention_indicator,
  times="day"
)

sepir_step <- Csnippet("
  double Beta_intervention;
  
  if (intervention==1){
    Beta_intervention = Beta*b1;
  }
  else if (intervention==2){
    Beta_intervention = Beta*b2;
  }
  else if (intervention==3){
    Beta_intervention = Beta*b3;
  }
  else if (intervention==4){
    Beta_intervention = Beta*b4;
  }
  else if (intervention==5){
    Beta_intervention = Beta*b5;
  }
  else {
    Beta_intervention = Beta;
  }
  
  double dN_SE = rbinom(S,1-exp(-Beta_intervention*(I+P)/N*dt));
  double dN_EPI = rbinom(E,1-exp(-mu_EPI*dt));
  double dN_PR = rbinom(P,1-exp(-mu_PR*dt));
  double dN_IR = rbinom(I,1-exp(-mu_IR*dt));
  double dN_RS = rbinom(I,1-exp(-mu_RS*dt));
  S -= dN_SE - dN_RS;
  E += dN_SE - dN_EPI;
  P += nearbyint(alpha*dN_EPI) - dN_PR;
  I += nearbyint((1-alpha)*dN_EPI) - dN_IR;
  R += dN_PR+dN_IR;
  H += dN_IR;
")

sepir_init <- Csnippet("
  S = nearbyint(eta*N);
  E = 100;
  I = 200;
  P = 50;
  R = nearbyint((1-eta)*N);
  H = 0;
")

dmeas <- Csnippet("
  double tol=1.0e-25;
  double mean =rho*H;
  double sd =sqrt(pow(tau*H,2)+rho*H);
  if(reports>0.0){
    lik=pnorm(reports+0.5,mean,sd,1,0)-pnorm(reports-0.5,mean,sd,1,0)+tol;
  } else {
    lik=pnorm(reports+0.5,mean,sd,1,0)+tol;
  }
  if(give_log) lik=log(lik);
")

rmeas <- Csnippet("
  reports = rnorm(rho*H, sqrt(pow(tau*H,2)+rho*H));
  if(reports>0.0){
    reports=nearbyint(reports);
  } else {
    reports=0.0;
  }
")

Before we use local search and global search to find the best parameters, we try to find a set of parameters as our starting point. According to the census (“Population” 2021), we fix \(N\) as 367601 (The number has been updated now). We tune the starting point following two criterion. Firstly, the value of parameters should be reasonable. Secondly, the simulated results should be close to the real data. Our initial guess and the simulated result are shown below.

#set params
params_guess <- c(Beta=0.4,b1=2.2,b2=3,b3=2,b4=1.6,b5=1.2,mu_EPI=1.6,mu_PR=0.6,mu_IR=0.47,mu_RS=0.2,alpha=0.4,rho=0.95,eta=0.6,N=367601,tau=0.25) 

The log likelihood is -3611.64, and the standard error is 78.32. The result is far from satisfactory. From the simulated figure, we can see that the reported cases in our simulation decay to zero, however, in reality, the number of cases still maintains at a certain level. Nevertheless, this is just an initial guess.

Model Fit Analysis

The first plot shows that effective sample size is about 5,000 and the maximization of log likelihood is relatively good. Although we could see the particle fails some time between time 110 and time 120, they recover soon. Based on the whole plot, we could say this result is good.

For the second plot, we could find one black curve is strange in \(\tau\),\(\rho\),\(b1\) and so on. As I think, it is an outlier of the general convergence for these parameters. Furthermore, looking into more detail about the first diagnostic plot. For log likelihood plot, there is a little different with other curves. For another plot, no apparent difference. Therefore, maybe it’s not an abnormal situation. We can’t say our filtering process is wrong. Thus, we choose to ignore that and think it’s only an outlier. Based on the situation, we could find for most of the parameters their convergence is pretty good. Parameter \(\mu_{EPI}\) might have some convergence problem. But generally the model fitting is good enough to carry out interpretations.

The third plot is all the parameters we fix, therefore, it’s a constant.

Summary

In this project, we use SARIMA model and POMP model to fit data that is related to reported cases after the spread of Omicron variant. For POMP model, we design a new recurrent-SEPIR model, based on facts about Omicron variant. We find that both of two models can fit the data well. The reason might be that the size of data is not large and the values generally maintain at a certain level despite several peaks, so SARIMA model can also perform very well. However, POMP model can explain the data better. Based on the result of POMP model, we still need to pay attention to Omicron variant and protect ourselves from being infected, because some cases without any symptoms are not reported and can infect others. Additionally, migration is not considered in our model, so are birth and death. It is possible that we can achieve better results if more situations are considered.

Reference