Introduction

The main goal of this project is to explore and model the temporal dynamics of the pandemic in Kerala using time series models. We apply ARIMA, VAR and POMP models to better understand the trends, patterns, and potential structure underlying the spread of COVID-19.

The original data were reported on a daily basis, spanning from January 31, 2020 to May 22, 2022. To enhance model interpretability and reduce daily reporting volatility, all models are trained on the weekly-aggregated dataset. Weekly data provide a smoother representation of the underlying trends, helping to reduce the impact of daily fluctuations and reporting delays, which are common in epidemiological data.

About the Datasets

This project uses dataset collected from the Kerala Government COVID-19 Dashboard [1].

COVID-19 Case Data

  • Time range: Feburary 2, 2020 – May 15, 2022 (Week 1 - Week 119)
  • Number of observations: 119 weekly records
  • Variables:
    • Date: The first date of the week
    • Confirmed: Number of newly confirmed COVID-19 cases reported within the week
    • Recovered: Number of individuals who recovered from COVID-19 within the week
    • Deceased: Number of reported deaths due to COVID-19 within the week

Exploratory Data Analysis

ACF and PACF of Weekly Confirmed, Recovered, Deceased COVID-19 Cases

Figure 2: ACF and PACF of Weekly Confirmed, Recovered, Deceased COVID-19 Cases in Kerala, India

Figure 2: ACF and PACF of Weekly Confirmed, Recovered, Deceased COVID-19 Cases in Kerala, India

Figure 2 shows the ACF and PACF of weekly confirmed, recovered, deceased COVID-19 cases and vaccination administered in Kerala, India. The ACF and PACF plots for confirmed, recovered, and deceased cases show strong autocorrelation and slow decay, indicating non-stationarity and the need for differencing.

Smoothed periodogram of Weekly Confirmed, Recovered, Deceased COVID-19 Cases

Figure 3: Smoothed periodogram of Weekly Confirmed, Recovered, Deceased COVID-19 Cases

Figure 3: Smoothed periodogram of Weekly Confirmed, Recovered, Deceased COVID-19 Cases

Figure 3 is the smoothed periodogram of weekly confirmed, recovered, deceased COVID-19 cases and vaccination administered. The smoothed periodograms show no strong seasonal peaks for any series, suggesting that trend, rather than seasonality, is the dominant feature across confirmed, recovered, and deceased cases.

Model 1: ARIMA

The first model we fit was an ARIMA model using the confirmed cases time series. This model focuses solely on the confirmed cases, and helps establish a baseline understanding of the pandemic progression before incorporating additional variables or building more complex models.

AIC- Based Model Selection

Firstly, we selected the best ARIMA model based on AIC values [2, 3].

AIC Table for ARIMA(p,1,q) Models
MA0 MA1 MA2 MA3 MA4 MA5
AR0 2768.886 2713.977 2682.276 2684.174 2680.517 2677.687
AR1 2718.279 2703.313 2684.218 2673.930 2673.850 2675.826
AR2 2688.899 2690.125 2682.829 2673.985 2675.799 2677.797
AR3 2689.952 2687.935 2684.444 2675.966 2677.788 2678.246
AR4 2690.873 2688.484 2683.913 2676.762 2678.758 2680.760
AR5 2681.468 2683.464 2684.146 2678.761 2680.629 2675.033

The AIC table shows the model fit quality for different combinations of ARIMA(p,1,q) models (where differencing order d = 1 is fixed according to the ACF/PACF in EDA).

  • The model with the lowest AIC value is ARIMA(5,1,5) (AIC ≈ 2675.033), followed closely by several other models like ARIMA(3,1,3) and ARIMA(5,1,2).

  • In general, lower AIC indicates a better model fit, balancing goodness of fit and model complexity.

  • Conclusion: Based on the AIC values, ARIMA(5,1,5) would be a strong candidate for modeling confirmed cases, although slightly simpler models (like ARIMA(3,1,3)) could also be considered if parsimony is preferred.

ARIMA(5,1,5) Model Specification

Let \(Y_t\) represent the original time series (weekly confirmed cases).

Since \(d = 1\), the series is differenced once:

\[ \nabla Y_t = Y_t - Y_{t-1} \]

The ARIMA(5,1,5) model for the differenced series is specified as:

\[ \nabla Y_t = \phi_1 \nabla Y_{t-1} + \phi_2 \nabla Y_{t-2} + \phi_3 \nabla Y_{t-3} + \phi_4 \nabla Y_{t-4} + \phi_5 \nabla Y_{t-5} + \epsilon_t + \theta_1 \epsilon_{t-1} + \theta_2 \epsilon_{t-2} + \theta_3 \epsilon_{t-3} + \theta_4 \epsilon_{t-4} + \theta_5 \epsilon_{t-5} \]

where:

  • \(\phi_1, \phi_2, \phi_3, \phi_4, \phi_5\) are the autoregressive (AR) coefficients,

  • \(\theta_1, \theta_2, \theta_3, \theta_4, \theta_5\) are the moving average (MA) coefficients,

  • \(\epsilon_t\) is white noise (i.i.d. with mean 0 and constant variance).

Model Fitting

## Series: ts_confirmed 
## ARIMA(5,1,5) 
## 
## Coefficients:
##           ar1     ar2     ar3      ar4      ar5     ma1     ma2      ma3
##       -0.7968  0.1543  0.6487  -0.0750  -0.0990  1.7467  0.9849  -0.9971
## s.e.   0.1227  0.1488  0.1450   0.1478   0.1168  0.1031  0.1792   0.1692
##           ma4      ma5
##       -1.4622  -0.7648
## s.e.   0.1695   0.0987
## 
## sigma^2 = 350631431:  log likelihood = -1326.52
## AIC=2675.03   AICc=2677.52   BIC=2705.51
## 
## Training set error measures:
##                    ME     RMSE      MAE MPE MAPE      MASE        ACF1
## Training set 954.8393 17838.73 9741.407 NaN  Inf 0.1749791 -0.01154425

The fitted ARIMA(5,1,5) model achieved a log-likelihood of -1326.52, indicating the overall goodness of fit. In general, a higher log-likelihood suggests a better model fit to the data.

Figure 4: ARIMA(5,1,5) Fitted vs. Actual plot

Figure 4: ARIMA(5,1,5) Fitted vs. Actual plot

The time series plot (Figure 5) comparing the actual confirmed cases (blue) and the ARIMA(5,1,5) fitted values (red) shows that the model tracks the major peaks and overall trends quite closely. Although there are some deviations at sharp turning points — especially during rapid spikes and declines — the model generally captures the overall behavior of the pandemic progression well.

Model Diagnostics

## 
##  Box-Ljung test
## 
## data:  res
## X-squared = 5.8596, df = 20, p-value = 0.9991
Figure 5: Model Diagnostics for ARIMA(5,1,5)

Figure 5: Model Diagnostics for ARIMA(5,1,5)

Residual Diagnostics

  • The residual time series appears centered around zero with no clear patterns, although some variance increases during peaks.

  • The ACF of residuals shows no significant autocorrelation, supporting the model adequacy [2, 3].

Normality Check

  • The Q-Q plot shows slight deviations from normality at the tails, but residuals are generally close to the theoretical normal line, which is acceptable for ARIMA models [2, 3].

Box-Ljung Test

  • The Box-Ljung test yields a p-value of 0.9991, suggesting no significant autocorrelation remaining in the residuals (fail to reject the null hypothesis of independence) [2, 3].
Figure 6: Inverse Roots of Both AR and MA Polynomials

Figure 6: Inverse Roots of Both AR and MA Polynomials

Stability Check

  • The AR inverse roots are all located within the unit circle, indicating that the model satisfies the stationarity condition. The MA inverse roots lie mostly within the unit circle, satisfying the invertibility condition. However, a few roots are located near or on the boundary, indicating that the model is close to non-invertibility. As a result, shocks may decay more slowly over time, although the process remains invertible. [4].

Overall, the ARIMA(5,1,5) model provides a reasonable fit to the confirmed cases data. The residuals show no significant autocorrelation, the model passes the Box-Ljung test, and the roots confirm stationarity and invertibility. While slight deviations from normality are observed, the model effectively captures the main trend and fluctuations, making it a suitable baseline for further analysis.

Model 2: VAR

The second model we aim to build is a Vector Autoregression (VAR) model, which incorporates confirmed, recovered, and deceased cases simultaneously. Unlike the ARIMA model that focuses on a single series, the VAR model allows us to capture the dynamic relationships between multiple time series. This modeling approach is inspired by the work presented in this study, where the authors successfully used VAR models to analyze the interactions among COVID-19 variables over time [5].

Model Selection

Select the VAR Model Using Optimal Lag

To build the VAR model, we included the confirmed, recovered, and deceased cases as jointly modeled time series. The optimal lag length was selected using information criteria (AIC, HQ, SC, and FPE). Based on the VARSelect output, most criteria suggested a higher lag order (around 20), but to balance model complexity and overfitting risk, we chose p = 9 for the final VAR model. This setup allows the model to capture dependencies across multiple past weeks while maintaining reasonable parsimony.

Model Specification

The VAR model captures the joint dynamics of confirmed (\(y_1\)), recovered (\(y_2\)), and deceased (\(y_3\)) cases.
The general form of a VAR(\(p\)) model is:

\[ \mathbf{Y}_t = \mathbf{B}_0 + \mathbf{B}_1 \mathbf{Y}_{t-1} + \mathbf{B}_2 \mathbf{Y}_{t-2} + \cdots + \mathbf{B}_p \mathbf{Y}_{t-p} + \boldsymbol{\epsilon}_t \]

where:

  • \(\mathbf{Y}_t\) is a \(3 \times 1\) vector of the three time series at time \(t\),

  • \(\mathbf{B}_0\) is a \(3 \times 1\) vector of intercepts,

  • \(\mathbf{B}_1, \mathbf{B}_2, \ldots, \mathbf{B}_p\) are \(3 \times 3\) coefficient matrices, - \(\boldsymbol{\epsilon}_t\) is a \(3 \times 1\) vector of error terms.

For example, a VAR(1) model (for illustration) would be written as:

\[ \begin{pmatrix} y_1(t) \\ y_2(t) \\ y_3(t) \end{pmatrix} = \begin{pmatrix} b_{10} \\ b_{20} \\ b_{30} \end{pmatrix} + \begin{pmatrix} b_{11} & b_{12} & b_{13} \\ b_{21} & b_{22} & b_{23} \\ b_{31} & b_{32} & b_{33} \end{pmatrix} \begin{pmatrix} y_1(t-1) \\ y_2(t-1) \\ y_3(t-1) \end{pmatrix} + \begin{pmatrix} \epsilon_1(t) \\ \epsilon_2(t) \\ \epsilon_3(t) \end{pmatrix} \]

In our case, \(p = 9\), so the model includes lags up to nine weeks back for each series [5].

Model Fitting

## [1] -3229.098

The log-likelihood for the VAR(9) model was manually computed as -3229.10 due to the inclusion of a constant term, which prevented direct extraction from the model output. Given the relatively large sample size and the complexity of the data (confirmed, recovered, and deceased cases), this log-likelihood value suggests that the VAR(9) model provides a reasonable fit to the data, capturing the main dynamics without overfitting.

Figure 7: VAR(9) Model Fit to Confirmed, Recovered, and Deceased Cases

Figure 7: VAR(9) Model Fit to Confirmed, Recovered, and Deceased Cases

Figure 7 shows the VAR(9) model fit to the confirmed, recovered, and deceased cases over time. Across all three series, the fitted values (red) closely follow the actual observed data (blue), capturing the overall trends, peaks, and shifts quite well. Minor discrepancies appear around sharp transitions, particularly for deceased cases, where the fitted values slightly overshoot or lag behind the actual counts. Nevertheless, the model effectively tracks the general behavior of each time series, suggesting that the VAR(9) model provides a good approximation of the underlying dynamics.

Model Diagnostics

ACF

Figure 8: ACF of VAR(9) Model: Confirmed, Recovered, and Deceased Cases

Figure 8: ACF of VAR(9) Model: Confirmed, Recovered, and Deceased Cases

The residual ACF plots for confirmed, recovered, and deceased cases show no significant autocorrelation, suggesting that the VAR(9) model adequately captured the dynamics of the series.

QQ Plot

Figure 9: Q-Q Plot of VAR(9) Model: Confirmed, Recovered, and Deceased Cases

Figure 9: Q-Q Plot of VAR(9) Model: Confirmed, Recovered, and Deceased Cases

Q-Q plots indicate that residuals are approximately normally distributed, with slight deviations at the tails, which is acceptable for practical purposes.

Box- Ljung Test

## 
##  Box-Ljung test
## 
## data:  residuals_var[, "Confirmed"]
## X-squared = 1.9394, df = 10, p-value = 0.9968
## 
##  Box-Ljung test
## 
## data:  residuals_var[, "Recovered"]
## X-squared = 3.2267, df = 10, p-value = 0.9756
## 
##  Box-Ljung test
## 
## data:  residuals_var[, "Deceased"]
## X-squared = 6.2883, df = 10, p-value = 0.7905

The Box-Ljung tests for all three series yield high p-values (all > 0.79), meaning we fail to reject the null hypothesis of no autocorrelation, supporting model adequacy.

Stability Check

The OLS-CUSUM test checks the stability of a VAR model by examining whether the cumulative sum of residuals stays within confidence bounds. If the plot crosses the bounds, it indicates possible structural breaks, meaning the VAR model may need to be re-specified [6].

Figure 10: OLS-CUSUM plots of VAR(9) Model: Confirmed, Recovered, and Deceased Cases

Figure 10: OLS-CUSUM plots of VAR(9) Model: Confirmed, Recovered, and Deceased Cases

OLS-CUSUM plots for each equation show that the empirical fluctuation processes remain well within the boundaries, indicating model stability over time.

Overall, the VAR(9) model provides a good fit to the confirmed, recovered, and deceased case series. Model diagnostics indicate that the residuals are uncorrelated, approximately normally distributed, and stable over time. The fitted values closely capture the major trends and variations in the data, supporting the use of the VAR model for analyzing the dynamic relationships between these pandemic variables.

Model 3: SEIRS Model1

Introduction

Model Specification

We propose a SEIRS (Susceptible - Exposed - Infectious - Recovered - Susceptible) Model (Class notes 11-18 in STATS531 Modeling and Analysis of Time Series Data(2025) [7] and COVID_19 final project COVID-19 in Kent County, MI (2024) [3]), which is a compartmental epidemiological model that extends the classical SEIR model by including the transition from state \(R\) to \(S\), thus allowing re-infections.

SEIRS Diagram

SEIRS Diagram

The model can be characterized by the following system of differential equations,

\[\begin{align*} \frac{\mathrm{d}S}{\mathrm{d}t} &= -\mu_{SE} S(t) + \mu_{RS} R(t) \\ \frac{\mathrm{d}E}{\mathrm{d}t} &= \mu_{SE} S(t) - \mu_{EI} E(t) \\ \frac{\mathrm{d}I}{\mathrm{d}t} &= \mu_{EI} E(t) - \mu_{IR} I(t) \\ \frac{\mathrm{d}R}{\mathrm{d}t} &= \mu_{IR} I(t) - \mu_{RS} R(t) \\ \end{align*}\]

where

  • \(\mu_{SE} = \beta(t)\) is the transition rate from state \(S\) to state \(E\), also known as the transmission rate;
  • \(\mu_{EI}\) is the transition rate from state \(E\) to state \(I\);
  • \(\mu_{IR}\) is the transition rate from state \(I\) to state \(R\);
  • \(\mu_{RS}\) is the transition rate from state \(R\) to state \(S\);
  • \(N(t) = S(t) + E(t) + I(t) + R(t)\) is the population size, which is assumed to be constant throughout the epidemic.

In order to discretize the model, we assume that each transition between compartments is a binomial random variable, i.e. \[\begin{align*} \Delta N_{SE} &= Binomial(S, 1 - e^{-\beta(t) \frac{I \Delta t}{N}}) \\ \Delta N_{EI} &= Binomial(E, 1 - e^{-\mu_{EI} \Delta t}) \\ \Delta N_{IR} &= Binomial(I, 1 - e^{-\mu_{IR} \Delta t}) \\ \Delta N_{RS} &= Binomial(R, 1 - e^{-\mu_{RS} \Delta t}) \\ \end{align*}\]

As demonstrated in earlier sections, the sample period can be segmented into three distinct phases. Accordingly, we allow the transmission rate, \(\beta(t)\), the dispersion coefficient, \(k\),and the reporting rate \(\rho\), to vary across these phases to reflect changes in underlying epidemic dynamics. Specifically, let \(\beta(t), k(t), \rho(t)\) be piecewise functions s.t. \[\begin{equation*} \beta(t) = \left\{ \begin{matrix} b_1, & t \in [1, 61] \\ b_2, & t \in [62, 96] \\ b_3, & t \in [63, 119] \end{matrix} \right.\end{equation*}\] \[\begin{equation*} k(t) = \left\{ \begin{matrix} k_1, & t \in [1, 61] \\ k_2, & t \in [62, 96] \\ k_3, & t \in [63, 119] \end{matrix} \right.\end{equation*}\]

\[\begin{equation*} \rho(t) = \left\{ \begin{matrix} \rho_1, & t \in [1, 61] \\ \rho_2, & t \in [62, 96] \\ \rho_3, & t \in [63, 119] \end{matrix}\right.\end{equation*}\]

Note that \(k\) and \(\rho\) controls the negative binomial distribution for the number of reported cases \(Y(t)\), i.e. \[ Y(t) \sim NegBinom(\rho * H, \rho * H + \frac{(\rho H)^2}{k(t)}) \]

Why not SEIR

We do not start with the SEIR model since it assumes that recovered individuals gain permanent immunity. Consequently, the SEIR model would trap all individuals in \(R\) forever with enough time, leading to depletion of susceptibles and underestimation of future waves when multiple outbreaks are observed, which is our case.

Alternatively, we consider adding the \(R \rightarrow S\) route in order to allow recovered individuals to lose immunity over time and return to the susceptible state, thus enabling re-infections.

Why Time-varying \(k\) and \(\rho\)

The problematic models mentioned in this section can be found in the Appendix.

At the very beginning, we began our analysis with a SEIRS model featuring time-varying transmission rates, while initially holding the dispersion parameter \((k)\) and the reporting rate \((\rho)\) fixed. This setup was meant for simplicity. This model fails to account for any shifts in amplitudes after the first outbreak, and instead exhibits dynamics similar to a stationary process. Besides, some parameters including \(b_2, b_3\), \(\mu_{IR}\) are either unstable or unrealistic.

Our next step was to allow the dispersion parameter \(k\) in the negative binomial distribution to vary over time, in order to capture changes in the level of noise in reported case counts across different epidemic phases. Such variations may arise from factors like evolving testing policies, reporting delays, or improvements in data infrastructure.

When using a relatively narrow range for the global search, the parameter estimates remain stable, yet some values are still nonsensical, and the simulation exhibits an unexpected spike during the initial outbreak. Expanding the search to a wider range leads to increased parameter instability, and the profile likelihood plots for \(\eta\) and \(\rho\) become problematic. The plot for \(\eta\) shows no significant pattern or structure, indicating weak identifiability, while the other plot contains a gap between \(\rho = 0.4\) and \(\rho=0.45\), which is typically a red flag suggesting either numerical instability or convergence failure.

Finally, varying only \(k(t)\) allows the model to adjust the dispersion around the expected number of reported cases, but it alone does not affect the mean of the observation distribution. As a result, it cannot account for changes in the actual level of under reporting or detection bias over time.This is why we decided to introduce a time-varying reporting rate \(\rho(t)\), and hence derived our final model.

Initial Guesses

We begin with a population size of \(N = 34,530,000\), which is estimated based on the population size of 33,406,061 from the most previous Cencus of India [8]. For simplicity, \(N\) is assumed to be constant throughout the sample time period. The initial guess for the parameters are as follows. \[\begin{equation*}\left\{\begin{matrix} b_1 = 3, b_2 = 50, b_3 = 20, \\ \mu_{EI} = 1.67, \mu_{IR} = 0.4, \mu_{RS} = 0.005, \\ \rho_1 = 0.12, \rho_2 =0.4, \rho_3 = 0.5 \\ k_1 = 2.5, k_2 = 4, k_3 = 0.5 \\ \eta = 0.1 \\ \end{matrix}\right.\end{equation*}\]

We expect \(b_2\) to be greater than \(b_1\) since the second outbreak features a larger magnitude, which suggest rising transmissibility. \(b_3\) is assumed to be moderately less than \(b_2\) to compensate for the heavy infections already caused by the 2nd outbreak, so that the third wave doesn’t unrealistically overshoot due to a reduced susceptible population.

Additionally, the median incubation period is roughly 4.5-5.8 days [9]. The infectious period is usually believed to be 1-2 days before and 8-10 days after symptoms begin [10] . To include a buffer for late recovery or reporting delays, we assume this period to be 2.5 weeks. Hence, we set \(\mu_{EI} = 1.67\), and \(\mu_{IR} = 0.5\). \(\mu_{RS}\) is set to 0.005 for numerical stability, as we examine significant divergence results and worse local search and global search results when we try to increase \(\mu_{RS}\).

Moreover, \(\eta\) is set to \(0.1\) considering the large population size in Kerala. Large \(N\) implies high absolute infection counts in spite of small proportions, which would possibly sufficient for the early seeding of the outbreaks.

Before starting the local search, it still remains to check the particle filters and plot the simulations based the guess.

ESS Check and Simulations, Initial GuessESS Check and Simulations, Initial Guess

ESS Check and Simulations, Initial Guess

With the initial parameters, the model can closely capture the timing and structure of the second and the third major outbreak. However, it overestimates the sharpness of the third peak. The first outbreak has been significantly underestimated in terms of magnitude. Besides, the effective sample size drops sharply during the first outbreak (week 10-30), indicating poor particle diversity and possibly degeneracy in the particle filter. Admittedly, the initial guesses are far from perfect, but we will see significantly improved model fit after the local search and global search.

Profile Likelihood

##           Lower      Upper
## Rho1 0.03805235 0.08148367
## Rho2 0.35667937 0.57063668
## Rho3 0.44796157 0.84804805

All three profile plots are consistent in the largest log-likelihood, which is between -1240 and -1239. The profile plots of \(\rho_1\) and \(\rho_3\) all have clear pattern and produce nice confidence interval without varying too much from our best global search model parameters; but the profile plot of \(\rho_2\) looks much more messy, a possible reason is that the log-likelihood is very hard to calculate for some points and lead to potential maximization errors, despite that, the profile plot of \(\rho_2\) has a very nice quadratic form in the top-left corner.

##           Lower     Upper
## Eta   0.4442709 0.9921901
## mu_IR 0.1106291 1.5182440

The profile of \(\eta\) has clear pattern despite several possible points where log-likelihood is hard to compute and lead to maximization errors, it produces a wide confidence interval on \(\eta\) which may coincide with our previous statement that added \(R \rightarrow S\) route can compensate our initial fraction \(\eta\) and makes initial fraction \(\eta\) not very sensitive. The profile of \(\mu_{IR}\) has a good quadratic from on the left side where \(\mu_{IR}\) is between 0 and 0.25, but contains several isolated points with large log-likelihood on the right side where \(\mu_{IR}\) is between 1.0 and 1.5 (one week recovery time), and 1.0 to 1.5 is a range we probably fail to incorporate during our previous global search, which further indicates another possible global optimum search range and another global optimum model that may improve our current global search best model by directly improving \(\mu_{IR}\).

Model 4: SEIRS Model2

As discussed above, the profile likelihood plots reveal a data point with a remarkably high log-likelihood, corresponding to a set of stable and plausible parameters. We therefore decided to proceed with a follow-up investigation. Specifically, we have replace the initial parameters for the local search with those from the data point, while keeping all other configurations:

\[\begin{equation*}\left\{\begin{matrix} b_1 = 2.33, b_2 = 3.68, b_3 = 11.40, \\ \mu_{EI} = 1.67, \mu_{IR} = 0.4, \mu_{RS} = 0.005, \\ \rho_1 = 0.07, \rho_2 =0.51, \rho_3 = 0.11 \\ k_1 = 1.26, k_2 = 6, k_3 = 2.8 \\ \eta = 0.62 \\ \end{matrix}\right.\end{equation*}\]

Local Search 2

Let’s revisit the diagnostics. The log-likelihood reaches -1,240 after 200 iterations and still exibits an upward trend. Most parameter show strong signs of convergence except for \(\eta\) and \(\rho_1\). More importantly, the estimated parameters now have better intepretability, particularly \(\mu_{IR}\), which is now centered around 1.0 \(\mathrm{week}^{-1}\) rather than 0.15\(\mathrm{week}^{-1}\).

Local Search Diagnostics

Local Search Diagnostics

The simulations looks similar to the previous model, although there is still spike at the first outbreak which is not supposed to exist.

##         b1       b2       b3       rho1      rho2      rho3    mu_EI     mu_IR
## 1 1.733088 2.814144 9.765815 0.06187154 0.5179523 0.1040071 1.269608 0.9640089
##   mu_RS       k1       k2       k3       eta        N loglik.est loglik.se.se
## 1 0.005 1.371936 6.196671 2.694699 0.7374424 34530000  -1235.802  0.002025691
Simulations, Local Search

Simulations, Local Search

Global Search 2

Global search leads to slightly larger log-likelihood (2 log units) than local search with small Monte Carlo standard error. The sets of parameters corresponding to the top 10 log-likelihood stay relatively consistent. Therefore, we can just extract the model with the largest log-likelihood as our best model during current global search, and the best parameters is as follows,

We focus on the changes related to the previous global search, the log-likelihood becomes 6 log units larger and converges around -1233 with small Monte Carlo standard error. In terms of parameters interpretation, some of the parameters become better in this model, but new problems also appear.

For non-multi-stage parameters, to be specific \(\eta, \mu_{EI}, \mu_{IR}\):

\(\eta\) increase to 0.998, which is almost 1, coincides with our previous statement that global population was considered to be entirely susceptible to SARS-CoV-2, and indicates that the added \(R \rightarrow S\) in this model mostly indicates second/third or even multiple times infectious situation.

\(\mu_{EI} \approx1.1\), which corresponds to 6 days incubation period, which is pretty much close to our previous statement 5 days.

\(\mu_{IR}\approx 0.84\) is significantly improved from 0.15 in the previous global search, and corresponds to 8.3 days recovery time on average. We admit 8.3 days is slight shorter than 2 weeks recovery time suggested in the literature, but it is more reasonable than 7 weeks on average in the previous global search.

For multi-stage parameters, to be more specific \(\rho_1, \rho_2, \rho_3, b_3\) :

\(\rho_1, \rho_2\) stay roughly the same as previous global search candidate, but a new problematic parameter is \(\rho_3 \approx 0.09\) during the third(Omicron) outbreak, which is too small and beyond our expectation, we fail to explain what our best model suggests for \(\rho_3\).

\(b_3 \approx 9.14\), it is a coincident that \(b_3\) is solved by this model, which maintains a reasonable transmission rate order, smallest transmission rate \(b_1\) at the beginning of pandemic outbreak, slightly larger \(b_2\) in the second(Delta) outbreak, and significantly larger \(b_3\) during the third(Omicron) outbreak which coincides with the statement “Omicron has a basic reproduction number estimated to be 2–3 times higher than Delta, which itself was more transmissible than earlier variants” (Liu, Y., Rocklöv, J. , 2022) [19].

Similarly, we provide 5 simple random simulations based on our second global search candidate’s parameters.

The simulation results are slightly better than previous global search best model candidate, with the first and second outbreaks well captured as previous global search candidate, while solving the early appearance issue so that the third(Omicron) outbreak is roughly better captured. Then the best model in current global search 2 can be viewed as the second global search candidate

As the profile plot of \(\mu_{IR}\) shows some interesting results in the previous global search candidate, we want to dive deeper to check the profile plot of \(\mu_{IR}\) of our second global search candidate and try to combine two profile results to get a closer and nicer look of the full-range profile likelihood of \(\mu_{IR}\).

Profile Likelihood on \(\mu_{IR}\)

## Combining two profile dataset, the 95% confidence interval of mu_IR is 0.7576469 1.126986

The profile plot (on the left) coincides with our previous guess that there is another global optimum range where \(\mu_{IR}\) falls between 0.75 to 1.25 that we fail to explore during the first global search. After combining two profile datasets and make a profile plot (on the right), we have a very clear pattern which shows two quadratic-shape peaks with the first peak corresponds to our first global search and second peak corresponds to our second global search.

Conclusion & Limitations

Then we extract the log-likelihood for our benchmark \(ARIMA(5, 1, 5)\) and our two \(SEIRS\) global search candidates, though we believe compare log-likelihood is sufficient, we also provide \(AIC\) values comparison [2].

##                  Log-Likelihood      AIC
## ARIMA(5, 1, 5)        -1326.516 2675.033
## SEIRS_Candidate1      -1239.449 2502.898
## SEIRS_Candidate2      -1233.212 2490.423

Compared with our benchmark ARIMA(5, 1, 5), we can see that both our SEIRS global search candidates (12 free parameters) have significantly larger log-likelihood and smaller AIC values. This means that the SEIRS global search candidates are actually better fit in terms of both log-likelihood (AIC) and potential mechanism interpretation values. However, between two SEIRS global search candidates, though the first candidate has smaller log-likelihood and larger AIC, consider the different parameter values and their interpretation values, we prefer to leave all of they here as our best SEIRS model in this project, which can be further explored in the future.

Also, we provide another option VAR model using the confirmed, recovery and deceased data at the same time, which is also an advantage of our well recorded dataset. Compared to ARIMA, the VAR model has the advantage of handling multiple interrelated time series simultaneously, allowing it to capture cross-variable dynamics that ARIMA cannot. Compared to POMP models, VAR models are simpler and more transparent, requiring fewer assumptions about the underlying latent processes. This makes VAR easier to estimate and interpret. But as the log-likelihood and AIC for VAR model is based on all confirmed, recovered and deceased data, while ARIMA and SEIRS only incorporate confirmed data, we can not directly compared with ARIMA and SEIRS model, which can be further discuss in the future.

One of the main limitation of our project is the fixed setting on \(\mu_{RS} = 0.005\) for SEIRS model, corresponds to 200 weeks immunity and is generally too large. But as we examined significant divergence results and worse local search and global search results when we try to increase \(\mu_{RS}\), so we give up on that and leave for future exploration. Another possible limitation is that due to the complexity of our SEIRS model, we may not have a very nice-looking global log-likelihood surface and due to our limited time and searching setting on Greatlakes (NP = 5000, NMIF = 200, Guesses = 400/800), it is really tricky and hard to find the actual global optimum. Even though we add an extra round of global search after we examine some interesting points occur in the \(\mu_{IR}\) profile plot and our parameter and simulation results look nice, we may still get stuck into some local optimum, and fail to really touch the global optimum. We believe this a very common problem in carrying out maximization algorithm in complex models, which also is left for future explorations.

Appendix

SEIRS, with Constant \(k\) and \(\rho\)

Initial Guesses

\[\begin{equation*} \left\{ \begin{matrix} b_1 = 5, b_2 = 10, b_3 = 20, \\ \mu_{EI} = 1.67, \mu_{IR} = 0.5, \mu_{RS} = 0.005, \\ \rho = 0.4, \\ k = 10, \\ \eta = 0.1 \\ \end{matrix} \right.\end{equation*}\]

With the initial guess, the model manages to capture the timeline for most outbreaks, but fails to estimate the magnitudes.

Local Search

The diagnostics plot is as follows. The log-likelihood shows rapid improvement in the early iterations, followed by clear stabilization around iteration 100, indicating good convergence of the local search. However, it eventually stays at roughly -1,800, which is lower than that of the benchmark by almost 500 units.

Most parameter trajectories exhibit some variability. Fixed parameters (\(k\), \(\mu_{RS}\), \(N\)) remain constant as expected.

The locally optimal parameters and the simulation based on them are as follows.

##         b1       b2       b3       rho    mu_EI     mu_IR mu_RS  k        eta
## 1 5.757514 48.26719 40.03208 0.2562741 1.411273 0.3811521 0.005 10 0.05511792
##          N loglik.est loglik.se.se
## 1 34530000  -1683.127   0.02625717

The model can closely capture the timing of the first outbreak, but it overestimates the amplitude. The second peak looks great. Unfortunately though, the model dies down and fails to capture any volatilities after that. Changes in model structure will be necessary if the global search cannot lead to significant improvements in model performance.

Global Search

SEIRS, with constant \(\rho\) and Time-varying \(k\)

Initial Guesses

\[\begin{equation*} \left\{ \begin{matrix} b_1 = 5, b_2 = 10, b_3 = 20, \\ \mu_{EI} = 1.67, \mu_{IR} = 0.5, \mu_{RS} = 0.005, \\ \rho = 0.4, \\ k_1 = 8, k_2 = 5, k_3 = 2 \\ \eta = 0.1 \\ \end{matrix} \right.\end{equation*}\]

The timeframes of all three outbreaks are captured more accurately, while the magnitudes are still problematic.

Local Search

Compared to the previous plot, this diagnostics plot shows much faster and more stable convergence across both log likelihood and parameter trajectories. The log likelihood stabilizes quickly with minimal variability between chains, while improving to around -1,300, indicating better local optimization.

For parameters, the inclusion of \(k_1\), \(k_2\), and \(k_3\) leads to a slightly higher-dimensional parameter space, yet their trajectories converge sharply, especially \(k_1\) and \(k_3\), which stabilize early. In contrast to the earlier plot, this suggests improved identifiability and reduced noise, possibly due to better initialization or more informative structure of the model. Parameters like \(b_1\), \(b_2\), and \(\mu_{IR}\) show clearer and more uniform convergence trends across chains. However, the steady decline of \(\eta\) across iterations is concerning, as it suggests the model increasingly favors a smaller initial susceptible population, which may be unrealistic. This could indicate issues with parameter identifiability or compensatory behavior to fit later outbreak waves. Overall, the model appears more stable and well-calibrated than before.

The locally optimal parameters and the simulation based on them are as follows.

##            b1            b2            b3           rho         mu_EI 
##  8.482112e+00  4.713658e+01  3.429103e+01  3.482484e-01  9.876440e-01 
##         mu_IR         mu_RS            k1            k2            k3 
##  5.700558e-01  5.000000e-03  1.024699e+00  4.612992e+00  4.465041e-01 
##           eta             N    loglik.est  loglik.se.se 
##  4.362946e-02  3.453000e+07 -1.276941e+03  1.795350e-03

While most of the simulation aligns reasonably well with the timing and scale of observed waves, the first spike shows a clear anomaly from one of the simulations, in which more than 1.5 million cases are reported in a single week, and that is far above the observed data. This outlier suggests instability or overdispersion in that interval, likely driven by parameter uncertainty or an extreme realization, while the rest of the simulations remain within a plausible range.

Global Search

Wider Global Search Profile Likelihood

Acknowledgement

[1] Government of Kerala. (2022). COVID-19 Dashboard. Retrieved from https://dashboard.kerala.gov.in/covid/index.php

[2] Ionides, E. L. (2025). Modeling and Analysis of time series data: Chapter 5 - Parameter estimation and model identification for ARMA models. Retrieved from https://ionides.github.io/531w25/05/slides.pdf

[3] Final Project Example: Project 12. Retrieved from https://ionides.github.io/531w24/final_project/project12/blinded.html

[4] Ionides, E. L. (2025). Modeling and Analysis of Time Series Data Chapter 6 - Extending the ARMA model: Seasonality, integration and trend. Retrieved from https://ionides.github.io/531w25/06/slides.pdf

[5] Khan, F., Saeed, A., & Ali, S. (2020). Modelling and forecasting of new cases, deaths and recover cases of COVID-19 by using Vector Autoregressive model in Pakistan. Chaos, Solitons & Fractals, 140, 110189. https://doi.org/10.1016/j.chaos.2020.110189

[6] Markovska-Simoska, S., Pop-Jordanov, J., & Kapushevska, B. (2018). Meteorological multivariable approximation and prediction with classical VAR-DCC approach. ResearchGate. https://www.researchgate.net/publication/324153796_Meteorological_multivariable_approximation_and_prediction_with_classical_VAR-DCC_approach

[7] Ionides, E. L. (2025). Modeling and Analysis of time series data: Class notes 11:18. https://ionides.github.io/531w25/.

[8] Kerala State Planning Board. (2020). Economic Review 2020 – Volume 1. Government of Kerala. https://spb.kerala.gov.in/sites/default/files/2021-01/English-Vol-1_0.pdfLinks to an external site.

[9] Lauer, S. A., Grantz, K. H., Bi, Q., Jones, F. K., Zheng, Q., Meredith, H. R., Azman, A. S., Reich, N. G., & Lessler, J. (2020). The Incubation Period of Coronavirus Disease 2019 (COVID-19) From Publicly Reported Confirmed Cases: Estimation and Application. Annals of internal medicine, 172(9), 577–582. https://doi.org/10.7326/M20-0504

[10] Centers for Disease Control and Prevention. (2024). COVID-19. In CDC Yellow Book: Health Information for International Travel 2024. https://wwwnc.cdc.gov/travel/yellowbook/2024/infections-diseases/covid-19

[11] World Health Organization (WHO) (2020).Report of the WHO-China Joint Mission on Coronavirus Disease 2019 (COVID-19). https://www.who.int/publications/i/item/report-of-the-who-china-joint-mission-on-coronavirus-disease-2019-(covid-19)

[12] Clarke, K. E. N., Jones, J. M., Deng, Y., Nycz, E., Lee, A., Iachan, R., … & MacNeil, A. (2022). Seroprevalence of Infection-Induced SARS-CoV-2 Antibodies — United States, September 2021–February 2022. Morbidity and Mortality Weekly Report (MMWR), 71(17), 606–608. https://doi.org/10.15585/mmwr.mm7117e3

[13] Pulliam, J. R. C., van Schalkwyk, C., Govender, N., von Gottberg, A., Cohen, C., Groome, M. J., … & Moultrie, H. (2022). Increased risk of SARS-CoV-2 reinfection associated with emergence of the Omicron variant in South Africa. Science, 376(6593), eabn4947. https://doi.org/10.1126/science.abn4947

[14] Lauer, S. A., et al. (2020). The Incubation Period of Coronavirus Disease 2019 (COVID-19) From Publicly Reported Confirmed Cases: Estimation and Application. Annals of Internal Medicine, 172(9), 577–582. https://www.acpjournals.org/doi/10.7326/M20-0504

[15] World Health Organization. (2020). Q&A on coronaviruses (COVID-19). Retrieved from https://www.who.int/news-room/q-a-detail/coronavirus-disease-covid-19

[16] Wu, S. L., Mertens, A. N., Crider, Y. S., et al. (2020). Substantial underestimation of SARS-CoV-2 infection in the United States. Nature Communications, 11, 4507. https://doi.org/10.1038/s41467-020-18272-4

[17] Nishiura, H., Kobayashi, T., Miyama, T., et al. (2020). Estimation of the asymptomatic ratio of novel coronavirus infections (COVID-19). International Journal of Infectious Diseases, 94, 154–155. https://doi.org/10.1016/j.ijid.2020.03.020

[18] Hellewell, J., Abbott, S., Gimma, A., et al. (2021). Estimating the effectiveness of routine asymptomatic PCR testing at different frequencies for the detection of SARS-CoV-2 infections. BMC Medicine, 19, 106. https://doi.org/10.1186/s12916-021-01982-x

[19] Liu, Y., Rocklöv, J. (2022). The reproductive number of the Omicron variant is several times relative to Delta. Journal of Travel Medicine, 29(3). https://doi.org/10.1093/jtm/taac037

[20] Chatgpt. Used for debugging and proof-reading.