Introduction

The main goal of this project is to explore and model the temporal dynamics of the pandemic in Kerala using time series models. We apply ARIMA, VAR and POMP models to better understand the trends, patterns, and potential structure underlying the spread of COVID-19.

The original data were reported on a daily basis, spanning from January 31, 2020 to May 22, 2022. To enhance model interpretability and reduce daily reporting volatility, all models are trained on the weekly-aggregated dataset. Weekly data provide a smoother representation of the underlying trends, helping to reduce the impact of daily fluctuations and reporting delays, which are common in epidemiological data.

About the Datasets

This project uses dataset collected from the Kerala Government COVID-19 Dashboard [1].

COVID-19 Case Data

Time range: Feburary 2, 2020 – May 15, 2022 (Week 1 - Week 119)
Number of observations: 119 weekly records
Variables:
- Date: The first date of the week
- Confirmed: Number of newly confirmed COVID-19 cases reported within the week
- Recovered: Number of individuals who recovered from COVID-19 within the week
- Deceased: Number of reported deaths due to COVID-19 within the week

Exploratory Data Analysis

Weekly Trends for Confirmed, Recovered, Deceased COVID-19 Cases and Vaccination Counts

Figure 1: Weekly COVID-19 case trends (Confirmed, Recovered, Deceased) in Kerala from January 2020 to March 2023

Figure 1 shows weekly trends in confirmed, recovered, and deceased COVID-19 cases in Kerala from January 2020 to May 2022. We can observe several key surges in confirmed and recovered cases.

The two vertical dashed lines in the figure represent key public health milestones:

The first line (March 22, 2021, last day of week 60) corresponds to the initial rollout of COVID-19 vaccines and spread of delta variant in the region.
The second line (January 2, 2022, last day of Week 100) marks the emergence and rapid spread of the Omicron variant.

These two events dramatically changed the trajectory and severity of the pandemic. Therefore, we divide the full timeline into three distinct stages.

ACF and PACF of Weekly Confirmed, Recovered, Deceased COVID-19 Cases

Figure 2: ACF and PACF of Weekly Confirmed, Recovered, Deceased COVID-19 Cases in Kerala, India

Figure 2 shows the ACF and PACF of weekly confirmed, recovered, deceased COVID-19 cases and vaccination administered in Kerala, India. The ACF and PACF plots for confirmed, recovered, and deceased cases show strong autocorrelation and slow decay, indicating non-stationarity and the need for differencing.

Smoothed periodogram of Weekly Confirmed, Recovered, Deceased COVID-19 Cases

Figure 3: Smoothed periodogram of Weekly Confirmed, Recovered, Deceased COVID-19 Cases

Figure 3 is the smoothed periodogram of weekly confirmed, recovered, deceased COVID-19 cases and vaccination administered. The smoothed periodograms show no strong seasonal peaks for any series, suggesting that trend, rather than seasonality, is the dominant feature across confirmed, recovered, and deceased cases.

Model 1: ARIMA

The first model we fit was an ARIMA model using the confirmed cases time series. This model focuses solely on the confirmed cases, and helps establish a baseline understanding of the pandemic progression before incorporating additional variables or building more complex models.

AIC- Based Model Selection

Firstly, we selected the best ARIMA model based on AIC values [2, 3].

AIC Table for ARIMA(p,1,q) Models
	MA0	MA1	MA2	MA3	MA4	MA5
AR0	2768.886	2713.977	2682.276	2684.174	2680.517	2677.687
AR1	2718.279	2703.313	2684.218	2673.930	2673.850	2675.826
AR2	2688.899	2690.125	2682.829	2673.985	2675.799	2677.797
AR3	2689.952	2687.935	2684.444	2675.966	2677.788	2678.246
AR4	2690.873	2688.484	2683.913	2676.762	2678.758	2680.760
AR5	2681.468	2683.464	2684.146	2678.761	2680.629	2675.033

The AIC table shows the model fit quality for different combinations of ARIMA(p,1,q) models (where differencing order d = 1 is fixed according to the ACF/PACF in EDA).

The model with the lowest AIC value is ARIMA(5,1,5) (AIC ≈ 2675.033), followed closely by several other models like ARIMA(3,1,3) and ARIMA(5,1,2).
In general, lower AIC indicates a better model fit, balancing goodness of fit and model complexity.
Conclusion: Based on the AIC values, ARIMA(5,1,5) would be a strong candidate for modeling confirmed cases, although slightly simpler models (like ARIMA(3,1,3)) could also be considered if parsimony is preferred.

ARIMA(5,1,5) Model Specification

Let $Y_t$ represent the original time series (weekly confirmed cases).

Since $d = 1$, the series is differenced once:

\[ \nabla Y_t = Y_t - Y_{t-1} \]

The ARIMA(5,1,5) model for the differenced series is specified as:

\[ \nabla Y_t = \phi_1 \nabla Y_{t-1} + \phi_2 \nabla Y_{t-2} + \phi_3 \nabla Y_{t-3} + \phi_4 \nabla Y_{t-4} + \phi_5 \nabla Y_{t-5} + \epsilon_t + \theta_1 \epsilon_{t-1} + \theta_2 \epsilon_{t-2} + \theta_3 \epsilon_{t-3} + \theta_4 \epsilon_{t-4} + \theta_5 \epsilon_{t-5} \]

where:

$\phi_1, \phi_2, \phi_3, \phi_4, \phi_5$ are the autoregressive (AR) coefficients,
$\theta_1, \theta_2, \theta_3, \theta_4, \theta_5$ are the moving average (MA) coefficients,
$\epsilon_t$ is white noise (i.i.d. with mean 0 and constant variance).

Model Fitting

## Series: ts_confirmed 
## ARIMA(5,1,5) 
## 
## Coefficients:
##           ar1     ar2     ar3      ar4      ar5     ma1     ma2      ma3
##       -0.7968  0.1543  0.6487  -0.0750  -0.0990  1.7467  0.9849  -0.9971
## s.e.   0.1227  0.1488  0.1450   0.1478   0.1168  0.1031  0.1792   0.1692
##           ma4      ma5
##       -1.4622  -0.7648
## s.e.   0.1695   0.0987
## 
## sigma^2 = 350631431:  log likelihood = -1326.52
## AIC=2675.03   AICc=2677.52   BIC=2705.51
## 
## Training set error measures:
##                    ME     RMSE      MAE MPE MAPE      MASE        ACF1
## Training set 954.8393 17838.73 9741.407 NaN  Inf 0.1749791 -0.01154425

The fitted ARIMA(5,1,5) model achieved a log-likelihood of -1326.52, indicating the overall goodness of fit. In general, a higher log-likelihood suggests a better model fit to the data.

Figure 4: ARIMA(5,1,5) Fitted vs. Actual plot

The time series plot (Figure 5) comparing the actual confirmed cases (blue) and the ARIMA(5,1,5) fitted values (red) shows that the model tracks the major peaks and overall trends quite closely. Although there are some deviations at sharp turning points — especially during rapid spikes and declines — the model generally captures the overall behavior of the pandemic progression well.

Model Diagnostics

## 
##  Box-Ljung test
## 
## data:  res
## X-squared = 5.8596, df = 20, p-value = 0.9991

$Figure 5: Model Diagnostics for ARIMA(5,1,5)$

Figure 5: Model Diagnostics for ARIMA(5,1,5)

Residual Diagnostics

The residual time series appears centered around zero with no clear patterns, although some variance increases during peaks.
The ACF of residuals shows no significant autocorrelation, supporting the model adequacy [2, 3].

Normality Check

The Q-Q plot shows slight deviations from normality at the tails, but residuals are generally close to the theoretical normal line, which is acceptable for ARIMA models [2, 3].

Box-Ljung Test

The Box-Ljung test yields a p-value of 0.9991, suggesting no significant autocorrelation remaining in the residuals (fail to reject the null hypothesis of independence) [2, 3].

Figure 6: Inverse Roots of Both AR and MA Polynomials

Stability Check

The AR inverse roots are all located within the unit circle, indicating that the model satisfies the stationarity condition. The MA inverse roots lie mostly within the unit circle, satisfying the invertibility condition. However, a few roots are located near or on the boundary, indicating that the model is close to non-invertibility. As a result, shocks may decay more slowly over time, although the process remains invertible. [4].

Overall, the ARIMA(5,1,5) model provides a reasonable fit to the confirmed cases data. The residuals show no significant autocorrelation, the model passes the Box-Ljung test, and the roots confirm stationarity and invertibility. While slight deviations from normality are observed, the model effectively captures the main trend and fluctuations, making it a suitable baseline for further analysis.

Model 2: VAR

The second model we aim to build is a Vector Autoregression (VAR) model, which incorporates confirmed, recovered, and deceased cases simultaneously. Unlike the ARIMA model that focuses on a single series, the VAR model allows us to capture the dynamic relationships between multiple time series. This modeling approach is inspired by the work presented in this study, where the authors successfully used VAR models to analyze the interactions among COVID-19 variables over time [5].

Model Selection

Select the VAR Model Using Optimal Lag

To build the VAR model, we included the confirmed, recovered, and deceased cases as jointly modeled time series. The optimal lag length was selected using information criteria (AIC, HQ, SC, and FPE). Based on the VARSelect output, most criteria suggested a higher lag order (around 20), but to balance model complexity and overfitting risk, we chose p = 9 for the final VAR model. This setup allows the model to capture dependencies across multiple past weeks while maintaining reasonable parsimony.

Model Specification

The VAR model captures the joint dynamics of confirmed ($y_1$), recovered ($y_2$), and deceased ($y_3$) cases.
The general form of a VAR($p$) model is:

\[ \mathbf{Y}_t = \mathbf{B}_0 + \mathbf{B}_1 \mathbf{Y}_{t-1} + \mathbf{B}_2 \mathbf{Y}_{t-2} + \cdots + \mathbf{B}_p \mathbf{Y}_{t-p} + \boldsymbol{\epsilon}_t \]

where:

$\mathbf{Y}_t$ is a $3 \times 1$ vector of the three time series at time $t$,
$\mathbf{B}_0$ is a $3 \times 1$ vector of intercepts,
$\mathbf{B}_1, \mathbf{B}_2, \ldots, \mathbf{B}_p$ are $3 \times 3$ coefficient matrices, - $\boldsymbol{\epsilon}_t$ is a $3 \times 1$ vector of error terms.

For example, a VAR(1) model (for illustration) would be written as:

\[ \begin{pmatrix} y_1(t) \\ y_2(t) \\ y_3(t) \end{pmatrix} = \begin{pmatrix} b_{10} \\ b_{20} \\ b_{30} \end{pmatrix} + \begin{pmatrix} b_{11} & b_{12} & b_{13} \\ b_{21} & b_{22} & b_{23} \\ b_{31} & b_{32} & b_{33} \end{pmatrix} \begin{pmatrix} y_1(t-1) \\ y_2(t-1) \\ y_3(t-1) \end{pmatrix} + \begin{pmatrix} \epsilon_1(t) \\ \epsilon_2(t) \\ \epsilon_3(t) \end{pmatrix} \]

In our case, $p = 9$, so the model includes lags up to nine weeks back for each series [5].

Model Fitting

## [1] -3229.098

The log-likelihood for the VAR(9) model was manually computed as -3229.10 due to the inclusion of a constant term, which prevented direct extraction from the model output. Given the relatively large sample size and the complexity of the data (confirmed, recovered, and deceased cases), this log-likelihood value suggests that the VAR(9) model provides a reasonable fit to the data, capturing the main dynamics without overfitting.

Figure 7: VAR(9) Model Fit to Confirmed, Recovered, and Deceased Cases

Figure 7 shows the VAR(9) model fit to the confirmed, recovered, and deceased cases over time. Across all three series, the fitted values (red) closely follow the actual observed data (blue), capturing the overall trends, peaks, and shifts quite well. Minor discrepancies appear around sharp transitions, particularly for deceased cases, where the fitted values slightly overshoot or lag behind the actual counts. Nevertheless, the model effectively tracks the general behavior of each time series, suggesting that the VAR(9) model provides a good approximation of the underlying dynamics.

Model Diagnostics

ACF

Figure 8: ACF of VAR(9) Model: Confirmed, Recovered, and Deceased Cases

The residual ACF plots for confirmed, recovered, and deceased cases show no significant autocorrelation, suggesting that the VAR(9) model adequately captured the dynamics of the series.

QQ Plot

Figure 9: Q-Q Plot of VAR(9) Model: Confirmed, Recovered, and Deceased Cases

Q-Q plots indicate that residuals are approximately normally distributed, with slight deviations at the tails, which is acceptable for practical purposes.

Box- Ljung Test

## 
##  Box-Ljung test
## 
## data:  residuals_var[, "Confirmed"]
## X-squared = 1.9394, df = 10, p-value = 0.9968

## 
##  Box-Ljung test
## 
## data:  residuals_var[, "Recovered"]
## X-squared = 3.2267, df = 10, p-value = 0.9756

## 
##  Box-Ljung test
## 
## data:  residuals_var[, "Deceased"]
## X-squared = 6.2883, df = 10, p-value = 0.7905

The Box-Ljung tests for all three series yield high p-values (all > 0.79), meaning we fail to reject the null hypothesis of no autocorrelation, supporting model adequacy.

Stability Check

The OLS-CUSUM test checks the stability of a VAR model by examining whether the cumulative sum of residuals stays within confidence bounds. If the plot crosses the bounds, it indicates possible structural breaks, meaning the VAR model may need to be re-specified [6].

Figure 10: OLS-CUSUM plots of VAR(9) Model: Confirmed, Recovered, and Deceased Cases

OLS-CUSUM plots for each equation show that the empirical fluctuation processes remain well within the boundaries, indicating model stability over time.

Overall, the VAR(9) model provides a good fit to the confirmed, recovered, and deceased case series. Model diagnostics indicate that the residuals are uncorrelated, approximately normally distributed, and stable over time. The fitted values closely capture the major trends and variations in the data, supporting the use of the VAR model for analyzing the dynamic relationships between these pandemic variables.

Model 3: SEIRS Model1

Introduction

Model Specification

We propose a SEIRS (Susceptible - Exposed - Infectious - Recovered - Susceptible) Model (Class notes 11-18 in STATS531 Modeling and Analysis of Time Series Data(2025) [7] and COVID_19 final project COVID-19 in Kent County, MI (2024) [3]), which is a compartmental epidemiological model that extends the classical SEIR model by including the transition from state $R$ to $S$, thus allowing re-infections.

SEIRS Diagram

The model can be characterized by the following system of differential equations,

\[\begin{align*} \frac{\mathrm{d}S}{\mathrm{d}t} &= -\mu_{SE} S(t) + \mu_{RS} R(t) \\ \frac{\mathrm{d}E}{\mathrm{d}t} &= \mu_{SE} S(t) - \mu_{EI} E(t) \\ \frac{\mathrm{d}I}{\mathrm{d}t} &= \mu_{EI} E(t) - \mu_{IR} I(t) \\ \frac{\mathrm{d}R}{\mathrm{d}t} &= \mu_{IR} I(t) - \mu_{RS} R(t) \\ \end{align*}\]

where

$\mu_{SE} = \beta(t)$ is the transition rate from state $S$ to state $E$, also known as the transmission rate;
$\mu_{EI}$ is the transition rate from state $E$ to state $I$;
$\mu_{IR}$ is the transition rate from state $I$ to state $R$;
$\mu_{RS}$ is the transition rate from state $R$ to state $S$;
$N(t) = S(t) + E(t) + I(t) + R(t)$ is the population size, which is assumed to be constant throughout the epidemic.

In order to discretize the model, we assume that each transition between compartments is a binomial random variable, i.e. \[\begin{align*} \Delta N_{SE} &= Binomial(S, 1 - e^{-\beta(t) \frac{I \Delta t}{N}}) \\ \Delta N_{EI} &= Binomial(E, 1 - e^{-\mu_{EI} \Delta t}) \\ \Delta N_{IR} &= Binomial(I, 1 - e^{-\mu_{IR} \Delta t}) \\ \Delta N_{RS} &= Binomial(R, 1 - e^{-\mu_{RS} \Delta t}) \\ \end{align*}\]

As demonstrated in earlier sections, the sample period can be segmented into three distinct phases. Accordingly, we allow the transmission rate, $\beta(t)$, the dispersion coefficient, $k$,and the reporting rate $\rho$, to vary across these phases to reflect changes in underlying epidemic dynamics. Specifically, let $\beta(t), k(t), \rho(t)$ be piecewise functions s.t. \[\begin{equation*} \beta(t) = \left\{ \begin{matrix} b_1, & t \in [1, 61] \\ b_2, & t \in [62, 96] \\ b_3, & t \in [63, 119] \end{matrix} \right.\end{equation*}\] \[\begin{equation*} k(t) = \left\{ \begin{matrix} k_1, & t \in [1, 61] \\ k_2, & t \in [62, 96] \\ k_3, & t \in [63, 119] \end{matrix} \right.\end{equation*}\]

\[\begin{equation*} \rho(t) = \left\{ \begin{matrix} \rho_1, & t \in [1, 61] \\ \rho_2, & t \in [62, 96] \\ \rho_3, & t \in [63, 119] \end{matrix}\right.\end{equation*}\]

Note that $k$ and $\rho$ controls the negative binomial distribution for the number of reported cases $Y(t)$, i.e. \[ Y(t) \sim NegBinom(\rho * H, \rho * H + \frac{(\rho H)^2}{k(t)}) \]

Why not SEIR

We do not start with the SEIR model since it assumes that recovered individuals gain permanent immunity. Consequently, the SEIR model would trap all individuals in $R$ forever with enough time, leading to depletion of susceptibles and underestimation of future waves when multiple outbreaks are observed, which is our case.

Alternatively, we consider adding the $R \rightarrow S$ route in order to allow recovered individuals to lose immunity over time and return to the susceptible state, thus enabling re-infections.

Why Time-varying $k$ and $\rho$

The problematic models mentioned in this section can be found in the Appendix.

At the very beginning, we began our analysis with a SEIRS model featuring time-varying transmission rates, while initially holding the dispersion parameter $(k)$ and the reporting rate $(\rho)$ fixed. This setup was meant for simplicity. This model fails to account for any shifts in amplitudes after the first outbreak, and instead exhibits dynamics similar to a stationary process. Besides, some parameters including $b_2, b_3$, $\mu_{IR}$ are either unstable or unrealistic.

Our next step was to allow the dispersion parameter $k$ in the negative binomial distribution to vary over time, in order to capture changes in the level of noise in reported case counts across different epidemic phases. Such variations may arise from factors like evolving testing policies, reporting delays, or improvements in data infrastructure.

When using a relatively narrow range for the global search, the parameter estimates remain stable, yet some values are still nonsensical, and the simulation exhibits an unexpected spike during the initial outbreak. Expanding the search to a wider range leads to increased parameter instability, and the profile likelihood plots for $\eta$ and $\rho$ become problematic. The plot for $\eta$ shows no significant pattern or structure, indicating weak identifiability, while the other plot contains a gap between $\rho = 0.4$ and $\rho=0.45$, which is typically a red flag suggesting either numerical instability or convergence failure.

Finally, varying only $k(t)$ allows the model to adjust the dispersion around the expected number of reported cases, but it alone does not affect the mean of the observation distribution. As a result, it cannot account for changes in the actual level of under reporting or detection bias over time.This is why we decided to introduce a time-varying reporting rate $\rho(t)$, and hence derived our final model.

Initial Guesses

We begin with a population size of $N = 34,530,000$, which is estimated based on the population size of 33,406,061 from the most previous Cencus of India [8]. For simplicity, $N$ is assumed to be constant throughout the sample time period. The initial guess for the parameters are as follows. \[\begin{equation*}\left\{\begin{matrix} b_1 = 3, b_2 = 50, b_3 = 20, \\ \mu_{EI} = 1.67, \mu_{IR} = 0.4, \mu_{RS} = 0.005, \\ \rho_1 = 0.12, \rho_2 =0.4, \rho_3 = 0.5 \\ k_1 = 2.5, k_2 = 4, k_3 = 0.5 \\ \eta = 0.1 \\ \end{matrix}\right.\end{equation*}\]

We expect $b_2$ to be greater than $b_1$ since the second outbreak features a larger magnitude, which suggest rising transmissibility. $b_3$ is assumed to be moderately less than $b_2$ to compensate for the heavy infections already caused by the 2nd outbreak, so that the third wave doesn’t unrealistically overshoot due to a reduced susceptible population.

Additionally, the median incubation period is roughly 4.5-5.8 days [9]. The infectious period is usually believed to be 1-2 days before and 8-10 days after symptoms begin [10] . To include a buffer for late recovery or reporting delays, we assume this period to be 2.5 weeks. Hence, we set $\mu_{EI} = 1.67$, and $\mu_{IR} = 0.5$. $\mu_{RS}$ is set to 0.005 for numerical stability, as we examine significant divergence results and worse local search and global search results when we try to increase $\mu_{RS}$.

Moreover, $\eta$ is set to $0.1$ considering the large population size in Kerala. Large $N$ implies high absolute infection counts in spite of small proportions, which would possibly sufficient for the early seeding of the outbreaks.

Before starting the local search, it still remains to check the particle filters and plot the simulations based the guess.

ESS Check and Simulations, Initial Guess

With the initial parameters, the model can closely capture the timing and structure of the second and the third major outbreak. However, it overestimates the sharpness of the third peak. The first outbreak has been significantly underestimated in terms of magnitude. Besides, the effective sample size drops sharply during the first outbreak (week 10-30), indicating poor particle diversity and possibly degeneracy in the particle filter. Admittedly, the initial guesses are far from perfect, but we will see significantly improved model fit after the local search and global search.

Local Search

We conducted the local search using 5,000 particle filters, 200 iterations. The diagnostics plot is as follows.

Local Search Diagnostics

Though we don’t include the effective sample size check here, but the ess close to 0 in the middle part issue of our initial guess is actually fixed after performing the global search. The log-likelihood trajectories here show rapid convergence towards roughly -1,250 and stabilization, indicating that the local search successfully identified a high-likelihood region even in the expanded parameter space with time-varying $k$ and $\rho$.

Parameter-wise, some estimates demonstrate stable convergence, including the time-varying dispersion terms $k_1, k_2, k_3$, reporting rates $\rho_1, \rho_2$, and the recover rate $\mu_{IR}$, while others may suffer poor convergence and weak identifiability, especially $b_2$, $\mu_{EI}$, and $\rho_3$.

In addition, unlike results from simpler models(they can be found in the appendix) where $\eta$ decreased steadily, here it increases over iterations, which may better reflect a realistic susceptible population size.

Finally, $b_3$ may still be problematic since it rapidly declines and converges toward zero, which could be the the model attempting to compensate for prior infection dynamics by downweighting $b_3$, despite the visible spike in observed cases.
Overall, despite the rich parameterization, the model remains well-behaved and converges stably.

The locally optimal parameters and the simulation based on them are as follows.

##         b1       b2          b3       rho1      rho2      rho3    mu_EI
## 1 2.334175 86.18567 0.006884004 0.09402705 0.4265065 0.5874697 1.515105
##       mu_IR mu_RS       k1       k2        k3       eta        N loglik.est
## 1 0.1439926 0.005 2.885612 4.302678 0.6978601 0.1569865 34530000  -1242.982
##   loglik.se.se
## 1  0.002233215

Simulations, Local Search

This simulation aligns closely with the observed data in terms of overall wave timing and peak locations, especially during the first two outbreak periods. Overall, the fit demonstrates strong agreement with the empirical trend while allowing for stochastic variation.

Global Search

Based on our local search convergence range, we give a wider global searching range with 800 random starting points as follows:

\[ b_1 \in(1, 5) \quad b_2 \in (5, 50) \quad b_3 \in (10, 50) \\ \rho_1 \in(0.05, 0.3) \quad \rho_2 \in (0.3, 0.6) \quad \rho_3 \in (0.4, 0.8)\\ k_1 \in(1, 5) \quad k_2\in(1, 5) \quad k_3\in(1, 5) \\ \mu_{EI} \in (1, 5) \quad \mu_{IR} \in (0.1, 0.5) \quad \eta \in(0, 1)\\ \]

Global search leads to slightly larger (around 3 log units) log-likelihood than local search with small Monte Carlo standard error. The sets of parameters corresponding to the top 10 log-likelihood stay relatively consistent. Therefore, we can just extract the model with the largest log-likelihood as our best model during current global search, and use the existing results in the literature to access the rationality of our best parameters.

We first focus on the non-multi-stage parameters, to be specific $\eta, \mu_{EI}, \mu_{IR}$:

At the onset of the COVID-19 pandemic in early 2020, the global population was considered to be entirely susceptible to SARS-CoV-2, as it was a novel virus with no prior immunity in humans (World Health Organization [WHO], 2020) [11], which suggests a very large initial fraction $\eta$ values 0.9 or even near 1.0. As the pandemic progressed, population-level immunity began to build through natural infections and widespread vaccination campaigns. By late 2021, seroprevalence data from the U.S. Centers for Disease Control and Prevention (CDC) indicated that over 90% of the U.S. population had detectable antibodies due to either infection or vaccination (Clarke et al., 2022) [12]. However, susceptibility to reinfection remained due to factors such as waning immunity and the emergence of immune-evasive variants like Omicron. Recent studies estimate that even with high levels of seroprevalence, approximately 30–50% of individuals may be susceptible to reinfection at any given time during the Omicron period, depending on prior exposure, vaccine status, and the circulating variant (Pulliam et al., 2022) [13]. Based on our data (from 2020 to 2022), in Kerala, the vaccination record started in 2021 (can be check in our data folder) and Omicron (the third peak) occurred in late 2021 and early 2022, so our best parameter initial fraction $\eta \approx 0.71$ is reasonable in the general setting, and our added $R \rightarrow S$ route can compensate our initial fraction $\eta$ by returning people from state $R$ to state $S$, which means over the entire time period, the proportion of people who once become susceptible or enter state $S$ is larger than initial fraction $\eta \approx 0.71$, may reach 0.9 or even near 1.0, but we fail to track the exact number in our model.

According to Lauer, S. A., et al. (2020) [14], in general, the median incubation period for COVID-19 is approximately 5 days, our best parameter $\mu_{EI} = 2.38$ corresponds to approximate 3 days incubation period, which is shorter than the 5 days, but is still a reasonable value considering the possible region difference (for example, health condition……) in Kerala, India.

According to World Health Organization (WHO) (2020) [15], people with mild infectious generally need 2 weeks to recover, and more severe cases may take 3-6 weeks or even more, our best parameter $\mu_{IR}$ falls around 0.15, corresponds to 7 weeks recovery time, we admit that 7 weeks is still longer than the actual COVID-19 recover time, but at least it becomes much better than our previous model results (See Appendix) where $\mu_{IR}$ actually converges towards 0.01 which corresponds to 100 weeks recovery time and probably it is the best our model can get during current global search.

Then we want to focus on our multi-stage parameters, to be specific $\rho_1, \rho_2, \rho_3, b_3$:

Early Phase (2020–early 2021) $\rho_1$: The first reporting rate, $\rho_1 \approx 0.05$, reflects the early months of the pandemic when both the public and governments were still coming to grips with the outbreak. Testing capacity was extremely limited, awareness was low, and many infections likely went undetected — particularly mild or asymptomatic cases. Studies suggest that early under-ascertainment was substantial, with some estimates indicating that fewer than 10% of infections were reported in early 2020 (Wu et al., 2020; Nishiura et al., 2020) [16] [17].

Middle Phase (mid-2021) $\rho_2$: The increase to $\rho_2 \approx 0.4$ coincides with expanded testing infrastructure, public health campaigns, and improved surveillance. During this period, rapid antigen tests and PCR testing became widely available in many regions, increasing the likelihood that symptomatic and even some asymptomatic cases were reported (Hellewell et al., 2021) [18].

Late Phase (2022 onward) $\rho_3, b_3$: The final reporting rate, $\rho_3 \approx 0.85$, reflects a period in which detection capacity was fully developed, testing became routine in many settings (e.g., workplaces, schools), and public awareness was high. Additionally, mandatory reporting systems and contact tracing efforts were more refined. While some under reporting likely persisted (especially for mild or asymptomatic cases), this value suggests a near-complete reporting of symptomatic infections during peak testing periods.

The final transmission rate $b_3 \approx 0.0024$ during Omicron outbreak is actually very small, this may can be explained by our model mechanism, but as the underlying model mechanism involves too many states and transmissions, we fail to track the number of people in each state at each time point in our model and conjecture reasonable explanations.

After analyzing the rationality of our best model’s parameters, we try use them to generate 5 simple random simulations.

Despite slight overestimation issue in the first outbreak and a bit early appearance of the third outbreak, our best global search model’s simulations suggest that our model fitting is pretty good and much better than previous models (See Appendix). It well captures three different outbreaks within three different time intervals and their relative magnitude from smallest (first outbreak) to largest(third outbreaks), which can be viewed as a good candidate for explaining our COVID_19 data in Kerala, India.

Then we want to dive deeper into the interpretation of the global search best parameter by looking into their profile likelihood results, especially $\rho, \eta,\mu_{IR}$.

Profile Likelihood

##           Lower      Upper
## Rho1 0.03805235 0.08148367
## Rho2 0.35667937 0.57063668
## Rho3 0.44796157 0.84804805

All three profile plots are consistent in the largest log-likelihood, which is between -1240 and -1239. The profile plots of $\rho_1$ and $\rho_3$ all have clear pattern and produce nice confidence interval without varying too much from our best global search model parameters; but the profile plot of $\rho_2$ looks much more messy, a possible reason is that the log-likelihood is very hard to calculate for some points and lead to potential maximization errors, despite that, the profile plot of $\rho_2$ has a very nice quadratic form in the top-left corner.

##           Lower     Upper
## Eta   0.4442709 0.9921901
## mu_IR 0.1106291 1.5182440

The profile of $\eta$ has clear pattern despite several possible points where log-likelihood is hard to compute and lead to maximization errors, it produces a wide confidence interval on $\eta$ which may coincide with our previous statement that added $R \rightarrow S$ route can compensate our initial fraction $\eta$ and makes initial fraction $\eta$ not very sensitive. The profile of $\mu_{IR}$ has a good quadratic from on the left side where $\mu_{IR}$ is between 0 and 0.25, but contains several isolated points with large log-likelihood on the right side where $\mu_{IR}$ is between 1.0 and 1.5 (one week recovery time), and 1.0 to 1.5 is a range we probably fail to incorporate during our previous global search, which further indicates another possible global optimum search range and another global optimum model that may improve our current global search best model by directly improving $\mu_{IR}$.

Model 4: SEIRS Model2

As discussed above, the profile likelihood plots reveal a data point with a remarkably high log-likelihood, corresponding to a set of stable and plausible parameters. We therefore decided to proceed with a follow-up investigation. Specifically, we have replace the initial parameters for the local search with those from the data point, while keeping all other configurations:

\[\begin{equation*}\left\{\begin{matrix} b_1 = 2.33, b_2 = 3.68, b_3 = 11.40, \\ \mu_{EI} = 1.67, \mu_{IR} = 0.4, \mu_{RS} = 0.005, \\ \rho_1 = 0.07, \rho_2 =0.51, \rho_3 = 0.11 \\ k_1 = 1.26, k_2 = 6, k_3 = 2.8 \\ \eta = 0.62 \\ \end{matrix}\right.\end{equation*}\]

Local Search 2

Let’s revisit the diagnostics. The log-likelihood reaches -1,240 after 200 iterations and still exibits an upward trend. Most parameter show strong signs of convergence except for $\eta$ and $\rho_1$. More importantly, the estimated parameters now have better intepretability, particularly $\mu_{IR}$, which is now centered around 1.0 $\mathrm{week}^{-1}$ rather than 0.15$\mathrm{week}^{-1}$.

Local Search Diagnostics

The simulations looks similar to the previous model, although there is still spike at the first outbreak which is not supposed to exist.

##         b1       b2       b3       rho1      rho2      rho3    mu_EI     mu_IR
## 1 1.733088 2.814144 9.765815 0.06187154 0.5179523 0.1040071 1.269608 0.9640089
##   mu_RS       k1       k2       k3       eta        N loglik.est loglik.se.se
## 1 0.005 1.371936 6.196671 2.694699 0.7374424 34530000  -1235.802  0.002025691

Simulations, Local Search

Global Search 2

Global search leads to slightly larger log-likelihood (2 log units) than local search with small Monte Carlo standard error. The sets of parameters corresponding to the top 10 log-likelihood stay relatively consistent. Therefore, we can just extract the model with the largest log-likelihood as our best model during current global search, and the best parameters is as follows,

We focus on the changes related to the previous global search, the log-likelihood becomes 6 log units larger and converges around -1233 with small Monte Carlo standard error. In terms of parameters interpretation, some of the parameters become better in this model, but new problems also appear.

For non-multi-stage parameters, to be specific $\eta, \mu_{EI}, \mu_{IR}$:

$\eta$ increase to 0.998, which is almost 1, coincides with our previous statement that global population was considered to be entirely susceptible to SARS-CoV-2, and indicates that the added $R \rightarrow S$ in this model mostly indicates second/third or even multiple times infectious situation.

$\mu_{EI} \approx1.1$, which corresponds to 6 days incubation period, which is pretty much close to our previous statement 5 days.

$\mu_{IR}\approx 0.84$ is significantly improved from 0.15 in the previous global search, and corresponds to 8.3 days recovery time on average. We admit 8.3 days is slight shorter than 2 weeks recovery time suggested in the literature, but it is more reasonable than 7 weeks on average in the previous global search.

For multi-stage parameters, to be more specific $\rho_1, \rho_2, \rho_3, b_3$ :

$\rho_1, \rho_2$ stay roughly the same as previous global search candidate, but a new problematic parameter is $\rho_3 \approx 0.09$ during the third(Omicron) outbreak, which is too small and beyond our expectation, we fail to explain what our best model suggests for $\rho_3$.

$b_3 \approx 9.14$, it is a coincident that $b_3$ is solved by this model, which maintains a reasonable transmission rate order, smallest transmission rate $b_1$ at the beginning of pandemic outbreak, slightly larger $b_2$ in the second(Delta) outbreak, and significantly larger $b_3$ during the third(Omicron) outbreak which coincides with the statement “Omicron has a basic reproduction number estimated to be 2–3 times higher than Delta, which itself was more transmissible than earlier variants” (Liu, Y., Rocklöv, J. , 2022) [19].

Similarly, we provide 5 simple random simulations based on our second global search candidate’s parameters.

The simulation results are slightly better than previous global search best model candidate, with the first and second outbreaks well captured as previous global search candidate, while solving the early appearance issue so that the third(Omicron) outbreak is roughly better captured. Then the best model in current global search 2 can be viewed as the second global search candidate

As the profile plot of $\mu_{IR}$ shows some interesting results in the previous global search candidate, we want to dive deeper to check the profile plot of $\mu_{IR}$ of our second global search candidate and try to combine two profile results to get a closer and nicer look of the full-range profile likelihood of $\mu_{IR}$.

Profile Likelihood on $\mu_{IR}$

## Combining two profile dataset, the 95% confidence interval of mu_IR is 0.7576469 1.126986

The profile plot (on the left) coincides with our previous guess that there is another global optimum range where $\mu_{IR}$ falls between 0.75 to 1.25 that we fail to explore during the first global search. After combining two profile datasets and make a profile plot (on the right), we have a very clear pattern which shows two quadratic-shape peaks with the first peak corresponds to our first global search and second peak corresponds to our second global search.

Conclusion & Limitations

Then we extract the log-likelihood for our benchmark $ARIMA(5, 1, 5)$ and our two $SEIRS$ global search candidates, though we believe compare log-likelihood is sufficient, we also provide $AIC$ values comparison [2].

##                  Log-Likelihood      AIC
## ARIMA(5, 1, 5)        -1326.516 2675.033
## SEIRS_Candidate1      -1239.449 2502.898
## SEIRS_Candidate2      -1233.212 2490.423

Compared with our benchmark ARIMA(5, 1, 5), we can see that both our SEIRS global search candidates (12 free parameters) have significantly larger log-likelihood and smaller AIC values. This means that the SEIRS global search candidates are actually better fit in terms of both log-likelihood (AIC) and potential mechanism interpretation values. However, between two SEIRS global search candidates, though the first candidate has smaller log-likelihood and larger AIC, consider the different parameter values and their interpretation values, we prefer to leave all of they here as our best SEIRS model in this project, which can be further explored in the future.

Also, we provide another option VAR model using the confirmed, recovery and deceased data at the same time, which is also an advantage of our well recorded dataset. Compared to ARIMA, the VAR model has the advantage of handling multiple interrelated time series simultaneously, allowing it to capture cross-variable dynamics that ARIMA cannot. Compared to POMP models, VAR models are simpler and more transparent, requiring fewer assumptions about the underlying latent processes. This makes VAR easier to estimate and interpret. But as the log-likelihood and AIC for VAR model is based on all confirmed, recovered and deceased data, while ARIMA and SEIRS only incorporate confirmed data, we can not directly compared with ARIMA and SEIRS model, which can be further discuss in the future.

One of the main limitation of our project is the fixed setting on $\mu_{RS} = 0.005$ for SEIRS model, corresponds to 200 weeks immunity and is generally too large. But as we examined significant divergence results and worse local search and global search results when we try to increase $\mu_{RS}$, so we give up on that and leave for future exploration. Another possible limitation is that due to the complexity of our SEIRS model, we may not have a very nice-looking global log-likelihood surface and due to our limited time and searching setting on Greatlakes (NP = 5000, NMIF = 200, Guesses = 400/800), it is really tricky and hard to find the actual global optimum. Even though we add an extra round of global search after we examine some interesting points occur in the $\mu_{IR}$ profile plot and our parameter and simulation results look nice, we may still get stuck into some local optimum, and fail to really touch the global optimum. We believe this a very common problem in carrying out maximization algorithm in complex models, which also is left for future explorations.

Appendix

SEIRS, with Constant $k$ and $\rho$

Initial Guesses

\[\begin{equation*} \left\{ \begin{matrix} b_1 = 5, b_2 = 10, b_3 = 20, \\ \mu_{EI} = 1.67, \mu_{IR} = 0.5, \mu_{RS} = 0.005, \\ \rho = 0.4, \\ k = 10, \\ \eta = 0.1 \\ \end{matrix} \right.\end{equation*}\]

With the initial guess, the model manages to capture the timeline for most outbreaks, but fails to estimate the magnitudes.

Local Search

The diagnostics plot is as follows. The log-likelihood shows rapid improvement in the early iterations, followed by clear stabilization around iteration 100, indicating good convergence of the local search. However, it eventually stays at roughly -1,800, which is lower than that of the benchmark by almost 500 units.

Most parameter trajectories exhibit some variability. Fixed parameters ($k$, $\mu_{RS}$, $N$) remain constant as expected.

The locally optimal parameters and the simulation based on them are as follows.

##         b1       b2       b3       rho    mu_EI     mu_IR mu_RS  k        eta
## 1 5.757514 48.26719 40.03208 0.2562741 1.411273 0.3811521 0.005 10 0.05511792
##          N loglik.est loglik.se.se
## 1 34530000  -1683.127   0.02625717

The model can closely capture the timing of the first outbreak, but it overestimates the amplitude. The second peak looks great. Unfortunately though, the model dies down and fails to capture any volatilities after that. Changes in model structure will be necessary if the global search cannot lead to significant improvements in model performance.

Global Search

SEIRS, with constant $\rho$ and Time-varying $k$

Initial Guesses

\[\begin{equation*} \left\{ \begin{matrix} b_1 = 5, b_2 = 10, b_3 = 20, \\ \mu_{EI} = 1.67, \mu_{IR} = 0.5, \mu_{RS} = 0.005, \\ \rho = 0.4, \\ k_1 = 8, k_2 = 5, k_3 = 2 \\ \eta = 0.1 \\ \end{matrix} \right.\end{equation*}\]

The timeframes of all three outbreaks are captured more accurately, while the magnitudes are still problematic.

Local Search

Compared to the previous plot, this diagnostics plot shows much faster and more stable convergence across both log likelihood and parameter trajectories. The log likelihood stabilizes quickly with minimal variability between chains, while improving to around -1,300, indicating better local optimization.

For parameters, the inclusion of $k_1$, $k_2$, and $k_3$ leads to a slightly higher-dimensional parameter space, yet their trajectories converge sharply, especially $k_1$ and $k_3$, which stabilize early. In contrast to the earlier plot, this suggests improved identifiability and reduced noise, possibly due to better initialization or more informative structure of the model. Parameters like $b_1$, $b_2$, and $\mu_{IR}$ show clearer and more uniform convergence trends across chains. However, the steady decline of $\eta$ across iterations is concerning, as it suggests the model increasingly favors a smaller initial susceptible population, which may be unrealistic. This could indicate issues with parameter identifiability or compensatory behavior to fit later outbreak waves. Overall, the model appears more stable and well-calibrated than before.

The locally optimal parameters and the simulation based on them are as follows.

##            b1            b2            b3           rho         mu_EI 
##  8.482112e+00  4.713658e+01  3.429103e+01  3.482484e-01  9.876440e-01 
##         mu_IR         mu_RS            k1            k2            k3 
##  5.700558e-01  5.000000e-03  1.024699e+00  4.612992e+00  4.465041e-01 
##           eta             N    loglik.est  loglik.se.se 
##  4.362946e-02  3.453000e+07 -1.276941e+03  1.795350e-03

While most of the simulation aligns reasonably well with the timing and scale of observed waves, the first spike shows a clear anomaly from one of the simulations, in which more than 1.5 million cases are reported in a single week, and that is far above the observed data. This outlier suggests instability or overdispersion in that interval, likely driven by parameter uncertainty or an extreme realization, while the rest of the simulations remain within a plausible range.

Global Search

Narrow Global Search

Wider Global Search

Wider Global Search Profile Likelihood

Acknowledgement

[1] Government of Kerala. (2022). COVID-19 Dashboard. Retrieved from https://dashboard.kerala.gov.in/covid/index.php

[2] Ionides, E. L. (2025). Modeling and Analysis of time series data: Chapter 5 - Parameter estimation and model identification for ARMA models. Retrieved from https://ionides.github.io/531w25/05/slides.pdf

[3] Final Project Example: Project 12. Retrieved from https://ionides.github.io/531w24/final_project/project12/blinded.html

[4] Ionides, E. L. (2025). Modeling and Analysis of Time Series Data Chapter 6 - Extending the ARMA model: Seasonality, integration and trend. Retrieved from https://ionides.github.io/531w25/06/slides.pdf

[5] Khan, F., Saeed, A., & Ali, S. (2020). Modelling and forecasting of new cases, deaths and recover cases of COVID-19 by using Vector Autoregressive model in Pakistan. Chaos, Solitons & Fractals, 140, 110189. https://doi.org/10.1016/j.chaos.2020.110189

[6] Markovska-Simoska, S., Pop-Jordanov, J., & Kapushevska, B. (2018). Meteorological multivariable approximation and prediction with classical VAR-DCC approach. ResearchGate. https://www.researchgate.net/publication/324153796_Meteorological_multivariable_approximation_and_prediction_with_classical_VAR-DCC_approach

[7] Ionides, E. L. (2025). Modeling and Analysis of time series data: Class notes 11:18. https://ionides.github.io/531w25/.

[8] Kerala State Planning Board. (2020). Economic Review 2020 – Volume 1. Government of Kerala. https://spb.kerala.gov.in/sites/default/files/2021-01/English-Vol-1_0.pdfLinks to an external site.

[9] Lauer, S. A., Grantz, K. H., Bi, Q., Jones, F. K., Zheng, Q., Meredith, H. R., Azman, A. S., Reich, N. G., & Lessler, J. (2020). The Incubation Period of Coronavirus Disease 2019 (COVID-19) From Publicly Reported Confirmed Cases: Estimation and Application. Annals of internal medicine, 172(9), 577–582. https://doi.org/10.7326/M20-0504

[10] Centers for Disease Control and Prevention. (2024). COVID-19. In CDC Yellow Book: Health Information for International Travel 2024. https://wwwnc.cdc.gov/travel/yellowbook/2024/infections-diseases/covid-19

[11] World Health Organization (WHO) (2020).Report of the WHO-China Joint Mission on Coronavirus Disease 2019 (COVID-19). https://www.who.int/publications/i/item/report-of-the-who-china-joint-mission-on-coronavirus-disease-2019-(covid-19)

[12] Clarke, K. E. N., Jones, J. M., Deng, Y., Nycz, E., Lee, A., Iachan, R., … & MacNeil, A. (2022). Seroprevalence of Infection-Induced SARS-CoV-2 Antibodies — United States, September 2021–February 2022. Morbidity and Mortality Weekly Report (MMWR), 71(17), 606–608. https://doi.org/10.15585/mmwr.mm7117e3

[13] Pulliam, J. R. C., van Schalkwyk, C., Govender, N., von Gottberg, A., Cohen, C., Groome, M. J., … & Moultrie, H. (2022). Increased risk of SARS-CoV-2 reinfection associated with emergence of the Omicron variant in South Africa. Science, 376(6593), eabn4947. https://doi.org/10.1126/science.abn4947

[14] Lauer, S. A., et al. (2020). The Incubation Period of Coronavirus Disease 2019 (COVID-19) From Publicly Reported Confirmed Cases: Estimation and Application. Annals of Internal Medicine, 172(9), 577–582. https://www.acpjournals.org/doi/10.7326/M20-0504

[15] World Health Organization. (2020). Q&A on coronaviruses (COVID-19). Retrieved from https://www.who.int/news-room/q-a-detail/coronavirus-disease-covid-19

[16] Wu, S. L., Mertens, A. N., Crider, Y. S., et al. (2020). Substantial underestimation of SARS-CoV-2 infection in the United States. Nature Communications, 11, 4507. https://doi.org/10.1038/s41467-020-18272-4

[17] Nishiura, H., Kobayashi, T., Miyama, T., et al. (2020). Estimation of the asymptomatic ratio of novel coronavirus infections (COVID-19). International Journal of Infectious Diseases, 94, 154–155. https://doi.org/10.1016/j.ijid.2020.03.020

[18] Hellewell, J., Abbott, S., Gimma, A., et al. (2021). Estimating the effectiveness of routine asymptomatic PCR testing at different frequencies for the detection of SARS-CoV-2 infections. BMC Medicine, 19, 106. https://doi.org/10.1186/s12916-021-01982-x

[19] Liu, Y., Rocklöv, J. (2022). The reproductive number of the Omicron variant is several times relative to Delta. Journal of Travel Medicine, 29(3). https://doi.org/10.1093/jtm/taac037

[20] Chatgpt. Used for debugging and proof-reading.

Final Project

2025-04-01

Introduction

About the Datasets

COVID-19 Case Data

Exploratory Data Analysis

Weekly Trends for Confirmed, Recovered, Deceased COVID-19 Cases and Vaccination Counts

ACF and PACF of Weekly Confirmed, Recovered, Deceased COVID-19 Cases

Smoothed periodogram of Weekly Confirmed, Recovered, Deceased COVID-19 Cases

Model 1: ARIMA

AIC- Based Model Selection

ARIMA(5,1,5) Model Specification

Model Fitting

Model Diagnostics

Residual Diagnostics

Normality Check

Box-Ljung Test

Stability Check

Model 2: VAR

Model Selection

Select the VAR Model Using Optimal Lag

Model Specification

Model Fitting

Model Diagnostics

ACF

QQ Plot

Box- Ljung Test

Stability Check

Model 3: SEIRS Model1

Introduction

Model Specification

Why not SEIR

Why Time-varying \(k\) and \(\rho\)

Initial Guesses

Local Search

Global Search

Profile Likelihood

Model 4: SEIRS Model2

Local Search 2

Global Search 2

Profile Likelihood on \(\mu_{IR}\)

Conclusion & Limitations

Appendix

SEIRS, with Constant \(k\) and \(\rho\)

Initial Guesses

Local Search

Global Search

SEIRS, with constant \(\rho\) and Time-varying \(k\)

Initial Guesses

Local Search

Global Search

Narrow Global Search

Wider Global Search

Wider Global Search Profile Likelihood

Acknowledgement