Comparative Analysis of ARIMA and SEIR Models Using COVID-19 Data

Introduction

In the first half of the semester, we learned about ARIMA models, which are basic tools for fitting time series data. In the second half, we focused on models more suitable for epidemiology, particularly the POMP model and the fundamental mathematical model of infectious diseases known as the SIR model, which delineates the dynamics among Susceptible, Infectious, and Recovered individuals. Therefore, the main question we explore in this report is:

Can the SEIR model, which extends the SIR framework by including an Exposed (E) category, provide a better fit for the process of infectious disease transmission compared to the ARIMA model?

For this analysis, we have selected data on the spread of the coronavirus (COVID-19).

The COVID-19 outbreak began in the United States in early 2020, with the first confirmed case reported on January 20. This initial case involved a resident of Washington State who had returned from Wuhan, China. In the following weeks, the number of cases gradually increased across various states. By March, COVID-19 was rapidly spreading nationwide, prompting the implementation of travel restrictions and stay-at-home orders in numerous locations. As of today, the COVID-19 pandemic has not yet concluded. Although most countries have significantly reduced the severity and mortality of cases through extensive vaccination campaigns, the virus continues to spread globally. This persistence is particularly driven by the emergence and transmission of new variants of the virus.

The data we used were separately recorded for different regions, detailing the confirmed cases, deaths, and recoveries from COVID-19. In the Exploratory Data Analysis (EDA) section, we analyzed the progression of COVID-19 in three regions: Washington State, California, and New York. For subsequent modeling with ARIMA and SEIR models, we primarily used 1500 data points from Washington State to determine whether the SEIR model could more effectively capture the characteristics of the epidemic’s spread.

EDA

Given the distinct demographic and public health landscapes of these states, our analysis focuses on these regions to capture the unique aspects of COVID-19 transmission and control measures. Our aim is to apply a POMP model to these data to elucidate the transmission dynamics of the virus. This approach will not only provide insights into the pandemic’s trends but also serve as a foundational framework for more intricate derivative models in subsequent analyses.

The simulation charts provided exhibit several key characteristics:

Cyclical Fluctuations: There’s a distinct cyclical pattern observed in the number of susceptibles (S), infected (I), and recovered (R) individuals. This typically suggests periodic outbreaks and retreats of the disease within the population. Extreme Fluctuations: The number of susceptibles rapidly falls close to zero and then recovers just as quickly, which is not consistent with real-world scenarios. In reality, outbreaks tend to spread gradually rather than instantaneously.

Constant Number of Recovered: After each wave of infection, the number of recovered individuals seems to reach a new equilibrium without a significant decrease. Analysis of the Model

Model Assumptions: The model may be overly simplistic or may not capture key features of disease transmission. For example, if the model assumes that individuals become susceptible immediately after recovery, this could lead to cyclical fluctuations. This might be reasonable for certain diseases, but with COVID-19, recovered individuals typically have a period of immunity.

Parameter Settings: The value of Beta (transmission rate) may be set too high, or the recovery rate (mu_IR) may be set too low, leading to rapid spread and recovery of the disease within the population. Additionally, the initial proportion of susceptibles (eta) might need adjustment. Time Scale: The time step of the model may need to be adjusted to better capture the rates of infection and recovery.

Adjust Transmission Rate (Beta): Decrease the Beta value to slow down the spread of the virus.

Adjust Recovery Rate (mu_IR): Increase the mu_IR value to speed up the recovery rate of infected individuals, which will result in lower peaks of infection.

ARIMA model

Given time series data \(\{X_t\}_{t=1}^N\), an ARIMA(p,d,q) model is given by:

\[\phi (B) ((1 - B)^d X_t-\mu) = \psi (B) \varepsilon_t\]

Where:

\(\mu = E[X_t]\).
\(\phi (B) = 1 - \sum_{i=1}^{p} \phi_i B^i\) is the autoregressive part of the model.
\(\psi(B) = 1 + \sum_{i=1}^{q} \psi_i B^i\) is the moving average part of the model.
\(B\) is the lag operator. \(BY_n = Y_{n-1}\).
\(\varepsilon_t\) is the error term at time \(t\), and normally we assume \(\varepsilon_t\) \(iid \sim N(0,1)\).

Here \(d\) is the order of difference. When \(d = 1\), \((1 - B) X_t = X_t - X_{t-1}\), each time series observation is subtracted from its previous observation. By doing this calculation, linear trend in the original series will be removed.

Model selection

We use AIC criterion to choose the best model. After trying d = 0, 1, 2, we decided to find the model from different ARIMA(P,1,Q) model with parameters P and Q ranging from 0 to 3. The AIC table for each model are shown below.

AIC Values for ARIMA Models
	MA0	MA1	MA2	MA3
AR0	4052.772	4016.322	4006.013	3995.776
AR1	4006.580	4008.502	4004.311	3992.999
AR2	4008.430	4010.559	3992.193	3983.290
AR3	4000.668	3985.339	3987.248	3985.902

As we can see, ARIMA(2, 1, 3) achieve the smallest AIC value 3983.290. However, since AIC penalized less for complexity, we also compare this model with ARIMA(3,1,1). Since these two ARIMA models are not nested, it is inappropriate to use the likelihood ratio test for comparison. Therefore, we decided to choose the ARIMA(2, 1, 3) model after comparing their residual plots, autocorrelation function (ACF) graphs, and QQ plots.

From the residual plot, it can be observed that the residuals suddenly increase around 2022, but the QQ plot and ACF show good performance. We also plotted the fitted data as below, which generally appears to be well-fitted.

## Don't know how to automatically pick scale for object of type <ts>. Defaulting
## to continuous.

SEIR model

The SEIR model is consist of four stages:

S: susceptible population
E: exposed population, asymptomatic but won’t test positive yet
I: infected population, symptomatic/tests positive
R: removed population

Suppose the number of people in each compartment at time \(t\) is \(S(t)\), \(E(t)\), \(I(t)\), \(R(t)\), respectively. The model can be specified as follows:

\[ \begin{aligned} S(t) &= S(0) - \Delta N_{SE}(t) \\ E(t) &= E(0) + \Delta N_{SE}(t) - \Delta N_{EI}(t) \\ I(t) &= I(0) + \Delta N_{EI}(t) - \Delta N_{IR}(t) \\ R(t) &= R(0) + \Delta N_{IR}(t) \end{aligned} \]

where the number of people transiting from one compartment to another is given by:

\[ \begin{aligned} \Delta N_{SE} &\sim \text{Binomial}(S, 1 - e^{-\beta \frac{I}{N} \Delta t}) \\ \Delta N_{EI} &\sim \text{Binomial}(E, 1 - e^{-\mu_{EI} \Delta t}) \\ \Delta N_{IR} &\sim \text{Binomial}(I, 1 - e^{-\mu_{IR} \Delta t}) \end{aligned} \]

As for the parameters, \(\beta\) is the contact rate with \(b_1\) when time is in the first half of time period and \(b_2\) when time is in the second half of time period, and \(\mu_{SI} = \beta I(t)\) denotes the rate at which individuals in \(S\) transition to \(E\), \(\mu_{EI}\) is the rate at which individuals in \(E\) transition to \(I\) and \(\mu_{IR}\) denotes the transition rate from \(I\) to \(R\). The probability of a case being reported is \(\rho\).

Based on the accuracy of the prediction scenarios from the EDA portion of the pomp model, and when viewed in the context of covid19 at the time. Washington is the state where the first case of covid19 appeared in the United States, so it would be more convincing for us to choose its confirmation data to build the SEIR model. After the initial model building we use local optimization on local search and overall optimization on gloabl search to help the model adjust the parameters to the extent that the prediction results are more realistic.

Original model

The blue line in the figure represents real data on the number of confirmed diagnoses over time in Washington State, and the red line represents the SEIR model’s predictions of the number of confirmed diagnoses over time based on our initial parameters. We can see that the model under the initial parameters fails to make predictions, with predictions almost universally close to 0.

Local search and optimization

The five images as above show the convergence of local search for each parameter in the optimization. We can clearly see that except for the population represented by N, all the other parameters are converging to a positive number close to 0 eventually. And we also derive the final reference values of the five parameters given by the local optimization.

Based on the five parameters adjusted by the local optimization, we plot again to make predictions. Although the prediction is significantly better this time, the peak position still deviates from the real situation. In the real-world scenario, there is a distinct peak around day 100 that is not captured by the model, suggesting that the model may not have fully accounted for all the factors that influence transmission or case reporting. The mismatch of peaks may indicate that key dynamics or external factors were missed in the simulation, or that parameter values need to be adjusted to achieve a more accurate fit.

Global Search and Optimization

To find the optimal parameters in the global search, we used the GenSA software package, which defines lower and upper limits for each parameter and measures the fit of the SEIR simulation model to the actual case data using a cost function that calculates the sum of the squared differences between observed and simulated cases. To ensure an accurate and efficient process, we quantified the iteration limits and temperature decrements when running the GenSA function. In the end, we obtained the following optimal parameters.

##         beta        sigma        gamma            N          rho 
## 8.343802e-01 1.823573e-01 2.141743e-01 1.344720e+06 6.604048e-01

According to the image above,we can clearly see that global optimization is even less effective than local optimization, the fit between predicted and true values is too low.

Final SEIR Model

Based on the locally optimized parameter data, we again tuned and derived a SEIR model with predictions closer to the real situation.

Conclusion

In this project, we employed both ARIMA and SEIR models to fit the transmission dynamics of a particular virus in the Washington region.

Overall, both models performed commendably. The ARIMA model was able to capture every subtle trend in the variation, though it did not perform as well around the 2022 outbreak peak.

The SEIR model, grounded in the principles of epidemiological dynamics, appeared to more naturally capture the specific characteristics of infectious disease spread within populations, especially noticeable around the peak of the curve. The peak provided by the SEIR model aligns more closely with the actual data peak, suggesting that this model may more accurately depict the dynamics of epidemic outbreaks. However, our SEIR model still fell short of the ARIMA model in terms of fitting precision, a gap that could perhaps be bridged by incorporating variables such as Death into the model.

Reference

[1] Data sources: https://covid19datahub.io/articles/data.html

[2] Course notes: https://kingaa.github.io/sbied/stochsim/notes.pdf

[3] Course codes: https://kingaa.github.io/sbied/stochsim/main.R

[4] The SEIRS model for infectious disease dynamics, https://www.nature.com/articles/s41592-020-0856-2

[5] Final project modeling Covid 19 with multivariate POMP Model: https://ionides.github.io/531w22/final_project/project16/blinded.html

[6] Code optimization and error correction, https://chat.openai.com/