Middle-East respiratory syndrome (MERS-CoV, or MERS) is a virus which affects the respiratory system. It has a fatality rate of around 35%. The virus was first discovered in 2012 in Jeddah, Saudi Arabia. Since then, there have been several outbreaks of MERS in Saudi Arabia. Additionally, an outbreak took place in South Korea in 2015, which is believed to have been brought there by a traveler from Saudi Arabia. The virus is known to have crossed over into the human population from camels, among which the disease is endemic. It is clear that the virus is not particularly virulent among humans, as the total number of infections over more than a decade amounts to several thousands. However, the very high fatality rate and possibility of spreading to other places, as it did in 2015, motivate interest in modeling the spread of this virus. 1, 2
In this project, we perform analysis on MERS cases in Saudi Arabia from January 2014 to May 2016. We first analyze the data using ARIMA time series models, and then further explore the data with a Susceptible-Exposed-Infectious-Recovered-Susceptible (SEIRS) model to simulate the spreading of MERS among camels population.
The data, downloaded from Kaggle3, was originally gathered by the World Health Organization. The data contains, for each week, the number of reported MERS cases in each of three places: Saudi Arabia, South Korea, and other countries. The data begin in March 2012 and end in June 2019.
First, we examine the plot of the epidemic in both Saudi Arabia and South Korea.
Statistical summary of weekly cases in Saudi Arabia:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 3.000 5.468 6.000 93.000
Statistical summary of weekly cases in South Korea:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.4947 0.0000 86.0000
As we can see in the plots and summaries, the epidemic in Saudi Arabia spans a number of years, with occasional spikes to over 25 new cases per week. Otherwise, the weekly count is relatively small. The epidemic in South Korea has only one peak, but the virus didn’t originate there and the cases shrunk to zero, thus the entire time frame of the virus in Korea is only several weeks.
For this project, we consider only the spread in Saudi Arabia. The time-series available from the spread in South Korea is too short to work with. However, we added the plot of its spread to give more context about the spread of MERS.
For the cases in Saudi Arabia, we decide to focus on the cases from January 2014 to May 2016 for the following reasons:
## The time series has 125 observations on a weekly level from 2014-01-08 to 2016-05-27. Over this period, a total of 1218 cases of MERS were reported in Saudi Arabia with a major peak in May 2014 followed by smaller peaks in February 2015, August 2015, and March 2016.
In the model selection process for ARIMA models (which we will be using as a benchmark in our analysis), we will need to provide some preliminary values of p and q. One way of doing that is to look at the sample ACF (autocorrelation function) and PACF (partial autocorrelation function) plots 4.
The ACF tails off/decays to 0 over several lags while the PACF cuts off sharply after lag 1. This indicates that the process underlying the data could be modeled as AR(1) 6.
From the plot above, we see no clear yearly pattern, and the peaks in different years do not coincide.
Any stationary time series \(Y_t\) can be approximated by linear combinations of sines and cosines of different frequencies with random coefficients. The goal of spectral analysis is to estimate the variance of these coefficients based on the observed data so as to understand which frequencies contribute more to the total variance in \(Y_t\) than the others 7. The spectral density of \(Y_t\) is the frequency domain representation of \(Y_t\) given by the discrete Fourier transform of its autocovariance function. The periodogram calculated from a sample is used as an estimator of the spectral density of the population.
The raw periodogram is a rough estimate of the population spectral density. The estimate is “rough” partly because the periodogram only uses the discrete fundamental harmonic frequencies whereas the spectral density is defined over a continuum of frequencies. One way to improve the periodogram estimate is by smoothing it using localized centered moving average windows 8. This is a non-parametric method for estimating the spectral density.
## Dominant frequency = 0.032 frequency (1/week)
## Estimated Period = 7.19 months
From the periodogram, we discovered a dominant frequency of 0.032, which represents a period of 7 months. We decide not to apply this period into our ARIMA model as it does not match any particular length of period (i.e. yearly, quarterly) and we find no evidence in the literature that supports this length of period 9.
Linear Gaussian Auto Regressive Moving Average (ARMA) models provide a flexible non-mechanistic benchmark comparison 10. So we start by fitting a stationary Gaussian ARMA(p,q) model under the null hypothesis that there is no trend (which is not entirely unreasonable from looking at the data). We choose not to add integration and seasonality into the ARMA model as the previous parts show that the data has no clear linear trend and no meaningful seasonality to explore.
Model: \(\phi(B)(Y_n - \mu) = \psi(B) \epsilon_n\) 11
Parameter vector: \(\theta = (\phi_{1:p}, \psi_{1:q}, \mu, \sigma^2) \in \mathrm{R^D}\)
We fit multiple models for different p and q using maximum likelihood estimation and calculate the Akaike’s Information Criterion in each case.
## MA0 MA1 MA2 MA3 MA4 MA5
## AR0 1048.9805 947.5611 907.8865 886.7075 867.4818 862.6919
## AR1 869.7191 867.0515 867.6190 861.3297 859.5403 861.5391
## AR2 865.4950 861.4974 870.9159 860.4493 861.5387 863.5394
## AR3 864.0045 863.1244 865.3067 862.0548 863.5137 862.6670
## AR4 860.4740 862.4692 861.8464 863.7968 863.6527 864.4805
## AR5 862.4637 864.2127 864.0372 864.7263 864.6998 866.2463
##
## Minimum AIC = 859.54 for AR1 MA4
Best Model:
##
## Call:
## arima(x = saudi$New.Cases, order = c(1, 0, 4))
##
## Coefficients:
## ar1 ma1 ma2 ma3 ma4 intercept
## 0.5583 0.4358 0.4192 0.4516 0.2629 9.3686
## s.e. 0.1381 0.1498 0.1270 0.1160 0.1316 3.6002
##
## sigma^2 estimated as 49.91: log likelihood = -422.77, aic = 859.54
The best ARMA model based on the AIC criterion is
ARMA(1,4) with 5 parameters, log-likelihood of
-422.77, and AIC of 859.54. All the coefficients except
ma4
are significant at the 5% level (more than 2 Fisher
standard errors away from 0). We will use this as a benchmark to compare
our mechanistic SEIRS model against.
All the AR and MA roots lie outside the unit circle (since the inverse roots lie inside the unit circle). So the model is both causal and invertible.
We need to verify the model assumptions (\(\epsilon_n \sim \text{iid } N(0, \sigma^2)\)) to make sure the results of our analysis can be trusted.
We would like to consider a Partially Observed Markov Process (POMP) model to model the spread of MERS.
What is noticeable about the MERS virus is that isn’t very virulent 12 13. In Saudi Arabia it seems to have a static number of infections which generally number in the single or small double digits. This is punctuated by occasional incidents of greater scale which then decrease. Whereas measles, covid-19 and other diseases can be modeled such that they eventually spread throughout a population and are limited only by natural immunity of the population and the immunity of those already infected, MERS is clearly different. We need a model that accounts for these occasional spikes and drops.
After initially attempting to model the spread of MERS among humans as a POMP process, we were not able to find any model that satisfactorily simulated the actual observed spread. Essentially, where the rate of transmission - \(\beta\) - was sufficiently large, the disease spread throughout the population. Where it wasn’t sufficiently large, it didn’t really spread at all. The spread in Saudi Arabia, as shown above, persistently exists in very small case numbers, occasionally breaking out to numbers in the dozens. A model was proposed (Lin et al., 2018) 14 to model MERS as a POMP model which takes into account the primary spread of MERS among the camel population, treating that spread as the Markov process, and treats the human cases as a function of infectious camels. Using the camel population of Saudi Arabia which is known and assumed to be constant, the model uses a SEIRS model to model the spread of MERS among the camels and assumes that the human cases are from infectious camels at time \(t\). We based our work in this project on that model proposed by Lin et al. with some adjustments.
We have a SEIRS model where all camels are classified as either:
S : Susceptible to infection. Its initial value is estimated as the total population multiplied by the parameter \(\eta\).
E : Exposed to infection but not infectious. Its initial value is estimated as the total population multiplied by the parameter \(\eta_2\).
I : Infectious. Its initial value is estimated as the total population multiplied by the parameter \(\eta_2\). (we used the same parameter \(\eta_2\) for initial exposed and infectious camels. The values are relatively small, and in Lin et al, they assumed same initial values for E and I, which we find no reason to reject)
R : Recovered. Part of the recovered camels could turn back to being susceptible.
At all times, we have that \(S+E+I+R=N\) where \(N=270000\) is the total camel population of Saudi Arabia, which we assume to be constant.
Additionally, we have the following parameters:
The model can be exprssed as following:
\[\begin{align} \Delta N_{SE} &\sim \text{Binomial}(S, 1 − e^{−\beta \frac{I}{N} \Delta t}) \\ \Delta N_{EI} &\sim \text{Binomial}(E, 1 − e^{-\mu_{EI} \Delta t}) \\ \Delta N_{IR} &\sim \text{Binomial}(I, 1 − e^{-\mu_{IR} \Delta t}) \\ \Delta N_{RS} &\sim \text{Binomial}(R, 1 − e^{-\mu_{RS} \Delta t}) \\ \Delta N_{S} &\sim \text{Binomial}(S, 1 − e^{-\mu \Delta t}) \\ \Delta N_{E} &\sim \text{Binomial}(E, 1 − e^{-\mu \Delta t}) \\ \Delta N_{I} &\sim \text{Binomial}(I, 1 − e^{-\mu \Delta t}) \\ \Delta N_{R} &\sim \text{Binomial}(R, 1 − e^{-\mu \Delta t}) \\ \Delta N_{N} &\sim \text{Binomial}(N, 1 − e^{-\mu \Delta t}) \end{align}\]
\[\begin{align} S_{t+\Delta t} &= S_{t} - \Delta N_{SE} + \Delta N_{RS} - \Delta N_S + \Delta N_N \\ E_{t+\Delta t} &= E_{t} + \Delta N_{SE} - \Delta N_{EI} - \Delta N_E \\ I_{t+\Delta t} &= I_{t} - \Delta N_{IR} + \Delta N_{EI} - \Delta N_I \\ R_{t+\Delta t} &= R_{t} + \Delta N_{IR} - \Delta N_{RS} - \Delta N_R \end{align}\]
For primary cases, we have a compartment \(C\) which represents the number of primary cases caused by the infectious camels to human. It is modeled as:
\[\begin{align} Z_i &= \int_{\text{week}_i} \rho_{CH}\mu_{IR}I dt \\ C_i &= \text{Negative Binomial}(mean=Z_i\rho, variance=Z_i\rho+\frac{Z_i\rho}{k}) \end{align}\]
For total number of human cases, we have \(reports=4C_i\).
In our implementation, we include checks to make sure that the values of the different states do not become negative17.
## N k rho Beta eta eta2
## 2.700000e+05 1.000000e+01 1.000000e+00 3.000000e+00 9.000000e-01 3.000000e-04
## mu_IR mu_EI mu_RS mu rho_CH
## 1.750000e+00 1.750000e+00 3.101737e-02 1.373626e-03 6.000000e-04
We are able to simulate the first major outbreak and one of the later minor peaks. However, we are not able to simulate the peak around week 80. Hopefully, our search will lead us to better estimates of \(\mu_{RS}\) which will ensure that the pool of susceptibles is large enough to lead to another outbreak.
## Log-likelihood: -843.17
## SEIRS fixed parameters:
## N rho mu
## 2.700000e+05 1.000000e+00 1.373626e-03
Best parameters from the global search:
The best model from global search has a maximized log-likelihood of -378.33.
We can conduct a Likelihood Ratio Test to verify whether this is significantly higher than ARMA(1,4) with a log-likelihood of -422.77.
Assuming the Wilks approximation is valid, under \(H_0\), the likelihood ratio test statistic \(\ell_1 - \ell_0 \approx \frac{1}{2} \chi^2_{D_1 - D_0}\).
## p-value: 0
As the p-value < 0.05, we can reject \(H_0\) at the 5% significance level and conclude that the SEIRS model is significantly better than the ARMA(1,4) model.
Simulate the model at the best parameters:
The simulation fits the data relatively well. The first major peak and the peak around week 80 are captured.
The reproduction number (\(R_0\)) is defined as the average number of cases generated by a typical case and it quantifies the intensity of transmission (Lin et al., 2018). It can be calculated as:
\[ R_0 = \frac{\beta}{\mu_{IR}} \]
## Estimated reproduction number R0 = 2.6
Profile likelihood for \(\rho_{CH}\) is the likelihood maximized over all other parameters with \(\rho_{CH}\) held constant. This is repeated for multiple values of \(\rho_{CH}\) to get a profile likelihood plot.
## Approximate 95% C.I. for camel-to-human spill-over rate: 0.001 0.001
https://journals.sagepub.com/doi/pdf/10.1177/0962280217746442↩︎
https://www.kaggle.com/datasets/imdevskp/mers-outbreak-dataset-20122019↩︎
Shumway and Stoffer (2017), Section 3.7 (Building ARIMA Models)↩︎
Shumway and Stoffer (2017), Section 3.3 (Autocorrelation and Partial Autocorrelation)↩︎
Shumway and Stoffer (2017), Section 3.7 (Building ARIMA Models)↩︎
P.K. Bhattacharya and Prabir Burman, Chapter 13 - Time Series, Theory and Methods of Statistics (2016)↩︎
https://journals.sagepub.com/doi/pdf/10.1177/0962280217746442↩︎
Aaron A. King, Edward L. Ionides, Lesson 5. Case study: Measles in large and small towns↩︎
Edward L. Ionides, Chapter 5: Parameter estimation and model identification for ARMA models↩︎
https://www.sciencedirect.com/science/article/pii/S1755436514000607#fig0005↩︎
https://journals.sagepub.com/doi/pdf/10.1177/0962280217746442↩︎
Lin, Q., Chiu, A. P., Zhao, S., & He, D. (2018). Modeling the spread of Middle East respiratory syndrome coronavirus in Saudi Arabia. Statistical Methods in Medical Research, 27(7), 1968–1978. https://doi.org/10.1177/0962280217746442↩︎
Lin, Q., Chiu, A. P., Zhao, S., & He, D. (2018). Modeling the spread of Middle East respiratory syndrome coronavirus in Saudi Arabia. Statistical Methods in Medical Research, 27(7), 1968–1978. https://doi.org/10.1177/0962280217746442↩︎
Lin, Q., Chiu, A. P., Zhao, S., & He, D. (2018). Modeling the spread of Middle East respiratory syndrome coronavirus in Saudi Arabia. Statistical Methods in Medical Research, 27(7), 1968–1978. https://doi.org/10.1177/0962280217746442↩︎
Jesse’s Office Hours on Tuesday April 23↩︎