Analysis of Middle-East Respiratory Syndrome coronavirus in Saudi Arabia

Background

Middle-East respiratory syndrome (MERS-CoV, or MERS) is a virus which affects the respiratory system. It has a fatality rate of around 35%. The virus was first discovered in 2012 in Jeddah, Saudi Arabia. Since then, there have been several outbreaks of MERS in Saudi Arabia. Additionally, an outbreak took place in South Korea in 2015, which is believed to have been brought there by a traveler from Saudi Arabia. The virus is known to have crossed over into the human population from camels, among which the disease is endemic. It is clear that the virus is not particularly virulent among humans, as the total number of infections over more than a decade amounts to several thousands. However, the very high fatality rate and possibility of spreading to other places, as it did in 2015, motivate interest in modeling the spread of this virus. ¹, ²

In this project, we perform analysis on MERS cases in Saudi Arabia from January 2014 to May 2016. We first analyze the data using ARIMA time series models, and then further explore the data with a Susceptible-Exposed-Infectious-Recovered-Susceptible (SEIRS) model to simulate the spreading of MERS among camels population.

Data Source

The data, downloaded from Kaggle³, was originally gathered by the World Health Organization. The data contains, for each week, the number of reported MERS cases in each of three places: Saudi Arabia, South Korea, and other countries. The data begin in March 2012 and end in June 2019.

Exploratory Analysis

First, we examine the plot of the epidemic in both Saudi Arabia and South Korea.

Statistical summary of weekly cases in Saudi Arabia:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   3.000   5.468   6.000  93.000

Statistical summary of weekly cases in South Korea:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.4947  0.0000 86.0000

As we can see in the plots and summaries, the epidemic in Saudi Arabia spans a number of years, with occasional spikes to over 25 new cases per week. Otherwise, the weekly count is relatively small. The epidemic in South Korea has only one peak, but the virus didn’t originate there and the cases shrunk to zero, thus the entire time frame of the virus in Korea is only several weeks.

Data Selection

For this project, we consider only the spread in Saudi Arabia. The time-series available from the spread in South Korea is too short to work with. However, we added the plot of its spread to give more context about the spread of MERS.

For the cases in Saudi Arabia, we decide to focus on the cases from January 2014 to May 2016 for the following reasons:

The cases before January 2014 in Saudi Arabia are in small amount;
Saudi Arabia witnessed the first peak of reported cases during April 2014;
We intend to model this first great outbreak and the other smaller outbreaks in the next two years;
We discard the data after May 2016 to keep an appropriate length of data to analyze.

ARMA

Data Visualization

## The time series has 125 observations on a weekly level from 2014-01-08 to 2016-05-27. Over this period, a total of 1218 cases of MERS were reported in Saudi Arabia with a major peak in May 2014 followed by smaller peaks in February 2015, August 2015, and March 2016.

Auto-correlation

In the model selection process for ARIMA models (which we will be using as a benchmark in our analysis), we will need to provide some preliminary values of p and q. One way of doing that is to look at the sample ACF (autocorrelation function) and PACF (partial autocorrelation function) plots ⁴.

The ACF measures the correlation between \(y_{t+h}\) and \(y_t\).
The PACF measures the correlation between \(y_{t+h}\) and \(y_t\) with the linear dependence of the intermediate observations \(\{y_{t+1}, ... , y_{t+h-1}\}\) on each, removed ⁵.

The ACF tails off/decays to 0 over several lags while the PACF cuts off sharply after lag 1. This indicates that the process underlying the data could be modeled as AR(1) ⁶.

Seasonality

Seasonal Plot

From the plot above, we see no clear yearly pattern, and the peaks in different years do not coincide.

Spectral Analysis

Any stationary time series \(Y_t\) can be approximated by linear combinations of sines and cosines of different frequencies with random coefficients. The goal of spectral analysis is to estimate the variance of these coefficients based on the observed data so as to understand which frequencies contribute more to the total variance in \(Y_t\) than the others ⁷. The spectral density of \(Y_t\) is the frequency domain representation of \(Y_t\) given by the discrete Fourier transform of its autocovariance function. The periodogram calculated from a sample is used as an estimator of the spectral density of the population.

The raw periodogram is a rough estimate of the population spectral density. The estimate is “rough” partly because the periodogram only uses the discrete fundamental harmonic frequencies whereas the spectral density is defined over a continuum of frequencies. One way to improve the periodogram estimate is by smoothing it using localized centered moving average windows ⁸. This is a non-parametric method for estimating the spectral density.

## Dominant frequency = 0.032 frequency (1/week) 
## Estimated Period = 7.19 months

From the periodogram, we discovered a dominant frequency of 0.032, which represents a period of 7 months. We decide not to apply this period into our ARIMA model as it does not match any particular length of period (i.e. yearly, quarterly) and we find no evidence in the literature that supports this length of period ⁹.

ARMA Benchmark

Linear Gaussian Auto Regressive Moving Average (ARMA) models provide a flexible non-mechanistic benchmark comparison ¹⁰. So we start by fitting a stationary Gaussian ARMA(p,q) model under the null hypothesis that there is no trend (which is not entirely unreasonable from looking at the data). We choose not to add integration and seasonality into the ARMA model as the previous parts show that the data has no clear linear trend and no meaningful seasonality to explore.

Model: \(\phi(B)(Y_n - \mu) = \psi(B) \epsilon_n\) ¹¹

Parameter vector: \(\theta = (\phi_{1:p}, \psi_{1:q}, \mu, \sigma^2) \in \mathrm{R^D}\)

Mean function: \(E[Y_n] = \mu\)
AR(p) polynomial: \(\phi(x) = 1 - \phi_1 x - \phi_2 x^2 - ... - \phi_px^p\)
MA(q) polynomial: \(\psi(x) = 1 + \psi_1 x + \psi_2 x^2 + ... + \psi_qx^q\)
Gaussian white noise process: \(\epsilon_n \sim \text{iid } N(0, \sigma^2)\)

Model Selection

We fit multiple models for different p and q using maximum likelihood estimation and calculate the Akaike’s Information Criterion in each case.

\(AIC = -2 * \ell(\hat{\theta}_{MLE}) + 2 * D\) where \(l(\theta)\) is the log-likelihood and \(D\) is the no. of parameters.
This balances goodness of fit vs. model complexity, favoring models that are less prone to overfitting and have a lower prediction error.
The “best” model is the one with the lowest AIC.
NOTE: The “best” model chosen by AIC is not necessarily the best approximation to the true model. It is just the one with the best predictive power among a set of candidate ARMA(p,q) models.

##           MA0      MA1      MA2      MA3      MA4      MA5
## AR0 1048.9805 947.5611 907.8865 886.7075 867.4818 862.6919
## AR1  869.7191 867.0515 867.6190 861.3297 859.5403 861.5391
## AR2  865.4950 861.4974 870.9159 860.4493 861.5387 863.5394
## AR3  864.0045 863.1244 865.3067 862.0548 863.5137 862.6670
## AR4  860.4740 862.4692 861.8464 863.7968 863.6527 864.4805
## AR5  862.4637 864.2127 864.0372 864.7263 864.6998 866.2463

## 
## Minimum AIC = 859.54 for AR1 MA4

Best Model:

## 
## Call:
## arima(x = saudi$New.Cases, order = c(1, 0, 4))
## 
## Coefficients:
##          ar1     ma1     ma2     ma3     ma4  intercept
##       0.5583  0.4358  0.4192  0.4516  0.2629     9.3686
## s.e.  0.1381  0.1498  0.1270  0.1160  0.1316     3.6002
## 
## sigma^2 estimated as 49.91:  log likelihood = -422.77,  aic = 859.54

The best ARMA model based on the AIC criterion is ARMA(1,4) with 5 parameters, log-likelihood of -422.77, and AIC of 859.54. All the coefficients except ma4 are significant at the 5% level (more than 2 Fisher standard errors away from 0). We will use this as a benchmark to compare our mechanistic SEIRS model against.

Model Diagnostics

All the AR and MA roots lie outside the unit circle (since the inverse roots lie inside the unit circle). So the model is both causal and invertible.

Residual Analysis

We need to verify the model assumptions (\(\epsilon_n \sim \text{iid } N(0, \sigma^2)\)) to make sure the results of our analysis can be trusted.

Constant Mean & Variance

plot(saudi$date, saudiARMA$residuals)

The histogram suggests that the distribution of residuals could be non-Gaussian;
There is no trend in the residuals so the constant mean assumption is valid;
Further, there is some indication of non-constant variance as the residuals are scattered with different variance along the time frame (i.e. the residuals have larger variance in middle-2014, middle-2015, and early-2016).

Independence

As there are no significant correlations at any lags, our assumption of independent errors seems to be valid.

Normality

The bulk of the residuals mostly follow the normal line, but with large deviations at the tails;
This matches what we observe in the histogram. The Gaussian assumption may need further investigation.

SEIRS POMP model

Model Considerations and Assumptions

We would like to consider a Partially Observed Markov Process (POMP) model to model the spread of MERS.

What is noticeable about the MERS virus is that isn’t very virulent ¹² ¹³. In Saudi Arabia it seems to have a static number of infections which generally number in the single or small double digits. This is punctuated by occasional incidents of greater scale which then decrease. Whereas measles, covid-19 and other diseases can be modeled such that they eventually spread throughout a population and are limited only by natural immunity of the population and the immunity of those already infected, MERS is clearly different. We need a model that accounts for these occasional spikes and drops.

After initially attempting to model the spread of MERS among humans as a POMP process, we were not able to find any model that satisfactorily simulated the actual observed spread. Essentially, where the rate of transmission - \(\beta\) - was sufficiently large, the disease spread throughout the population. Where it wasn’t sufficiently large, it didn’t really spread at all. The spread in Saudi Arabia, as shown above, persistently exists in very small case numbers, occasionally breaking out to numbers in the dozens. A model was proposed (Lin et al., 2018) ¹⁴ to model MERS as a POMP model which takes into account the primary spread of MERS among the camel population, treating that spread as the Markov process, and treats the human cases as a function of infectious camels. Using the camel population of Saudi Arabia which is known and assumed to be constant, the model uses a SEIRS model to model the spread of MERS among the camels and assumes that the human cases are from infectious camels at time \(t\). We based our work in this project on that model proposed by Lin et al. with some adjustments.

The Model

We have a SEIRS model where all camels are classified as either:

S : Susceptible to infection. Its initial value is estimated as the total population multiplied by the parameter \(\eta\).
E : Exposed to infection but not infectious. Its initial value is estimated as the total population multiplied by the parameter \(\eta_2\).
I : Infectious. Its initial value is estimated as the total population multiplied by the parameter \(\eta_2\). (we used the same parameter \(\eta_2\) for initial exposed and infectious camels. The values are relatively small, and in Lin et al, they assumed same initial values for E and I, which we find no reason to reject)
R : Recovered. Part of the recovered camels could turn back to being susceptible.

At all times, we have that \(S+E+I+R=N\) where \(N=270000\) is the total camel population of Saudi Arabia, which we assume to be constant.

Additionally, we have the following parameters:

\(\rho_{CH}\), the spillover rate from camels to the human population. It was estimated ¹⁵ that each crossover primary infection results in four human infections, so the total number of infected humans is expected to be \(4\) times the number of primary-infected (from camels) humans.
\(\beta\), the transmission rate among camels.
\(\mu_{EI}\), the rate at which exposed camels become infectious.
\(\mu_{IR}\), the rate at which infectious camels recover.
\(\mu_{RS}\), the rate at which the recovered cohort loses its immunity due to evolution of the virus and becomes susceptible again.
\(\eta\), the proportion of the population that is susceptible. We expect this to be quite large as our data starts right before the first human outbreak, which implies a substantial amount of susceptible camels.
\(\eta_2\), the proportion of the population that is exposed and infectious.
\(N\), the regional population of camels, which is assumed to be 270,000 according to Lin et at.
\(\mu\), the birth and death rate of camels. As we assumed constant population, we also assume that the camels are being born and dying at the same rate. This will bring new-born susceptible camels into the process. The rate is fixed as \(\mu^{-1}=14years\) ¹⁶ This reflects that the older cohort is being replaced at a constant rate by newborn calves.
\(\rho\), the reporting rate for primary cases. We fix it as \(\rho=1\) as almost all camel-infected human cases are recorded.
\(k\), the over-dispersion rate of the measurement model.

The model can be exprssed as following:

\[\begin{align} \Delta N_{SE} &\sim \text{Binomial}(S, 1 − e^{−\beta \frac{I}{N} \Delta t}) \\ \Delta N_{EI} &\sim \text{Binomial}(E, 1 − e^{-\mu_{EI} \Delta t}) \\ \Delta N_{IR} &\sim \text{Binomial}(I, 1 − e^{-\mu_{IR} \Delta t}) \\ \Delta N_{RS} &\sim \text{Binomial}(R, 1 − e^{-\mu_{RS} \Delta t}) \\ \Delta N_{S} &\sim \text{Binomial}(S, 1 − e^{-\mu \Delta t}) \\ \Delta N_{E} &\sim \text{Binomial}(E, 1 − e^{-\mu \Delta t}) \\ \Delta N_{I} &\sim \text{Binomial}(I, 1 − e^{-\mu \Delta t}) \\ \Delta N_{R} &\sim \text{Binomial}(R, 1 − e^{-\mu \Delta t}) \\ \Delta N_{N} &\sim \text{Binomial}(N, 1 − e^{-\mu \Delta t}) \end{align}\]

\[\begin{align} S_{t+\Delta t} &= S_{t} - \Delta N_{SE} + \Delta N_{RS} - \Delta N_S + \Delta N_N \\ E_{t+\Delta t} &= E_{t} + \Delta N_{SE} - \Delta N_{EI} - \Delta N_E \\ I_{t+\Delta t} &= I_{t} - \Delta N_{IR} + \Delta N_{EI} - \Delta N_I \\ R_{t+\Delta t} &= R_{t} + \Delta N_{IR} - \Delta N_{RS} - \Delta N_R \end{align}\]

SEIRS Model Structure

For primary cases, we have a compartment \(C\) which represents the number of primary cases caused by the infectious camels to human. It is modeled as:

\[\begin{align} Z_i &= \int_{\text{week}_i} \rho_{CH}\mu_{IR}I dt \\ C_i &= \text{Negative Binomial}(mean=Z_i\rho, variance=Z_i\rho+\frac{Z_i\rho}{k}) \end{align}\]

For total number of human cases, we have \(reports=4C_i\).

Simulation

In our implementation, we include checks to make sure that the values of the different states do not become negative¹⁷.

##            N            k          rho         Beta          eta         eta2 
## 2.700000e+05 1.000000e+01 1.000000e+00 3.000000e+00 9.000000e-01 3.000000e-04 
##        mu_IR        mu_EI        mu_RS           mu       rho_CH 
## 1.750000e+00 1.750000e+00 3.101737e-02 1.373626e-03 6.000000e-04

We are able to simulate the first major outbreak and one of the later minor peaks. However, we are not able to simulate the peak around week 80. Hopefully, our search will lead us to better estimates of \(\mu_{RS}\) which will ensure that the pool of susceptibles is large enough to lead to another outbreak.

## Log-likelihood: -843.17

Compared to the best ARMA(1,4) model with log-likelihood of -422.77, the log-likelihood at the simulated model is -843.17, which is much lower.
From the ESS plot, we see that the effective sample size becomes low at time steps corresponding to some peaks. But the value is still quite high (>500), so this should not be of much concern.

Local Search

Likelihood Maximization

The log-likelihood converges to above -400, which is better than the ARMA(1,4) model.
Some parameters (\(\eta\), \(\eta_2\)) do not converge even after 100 iterations – this could be a sign of weak identifiability in the model. As the likelihood seems to be maximized correctly, this should not be too problematic.
Other parameters (\(\beta\), \(\mu_{EI}\), \(\mu_{IR}\), \(k\)) seem to be converging while \(\mu_{RS}\) continues to increase and could benefit from more iterations.

Likelihood Evaluation

Global Search

## SEIRS fixed parameters:

##            N          rho           mu 
## 2.700000e+05 1.000000e+00 1.373626e-03

Best parameters from the global search:

The best model from global search has a maximized log-likelihood of -378.33.

We can conduct a Likelihood Ratio Test to verify whether this is significantly higher than ARMA(1,4) with a log-likelihood of -422.77.

\(H_0\): ARMA(1,4), \(D_0 = 5\)
\(H_1\): SEIRS, \(D_1 = 8\)

Assuming the Wilks approximation is valid, under \(H_0\), the likelihood ratio test statistic \(\ell_1 - \ell_0 \approx \frac{1}{2} \chi^2_{D_1 - D_0}\).

## p-value: 0

As the p-value < 0.05, we can reject \(H_0\) at the 5% significance level and conclude that the SEIRS model is significantly better than the ARMA(1,4) model.

Simulate the model at the best parameters:

The simulation fits the data relatively well. The first major peak and the peak around week 80 are captured.

The reproduction number (\(R_0\)) is defined as the average number of cases generated by a typical case and it quantifies the intensity of transmission (Lin et al., 2018). It can be calculated as:

\[ R_0 = \frac{\beta}{\mu_{IR}} \]

## Estimated reproduction number R0 = 2.6

Profile Likelihood for Camel-to-human Spill-over Rate

Profile likelihood for \(\rho_{CH}\) is the likelihood maximized over all other parameters with \(\rho_{CH}\) held constant. This is repeated for multiple values of \(\rho_{CH}\) to get a profile likelihood plot.

## Approximate 95% C.I. for camel-to-human spill-over rate: 0.001 0.001

The \(\rho_{CH}\) with the largest log-likelihood is on the edge of the interval (0.001) we choose to construct the plot.
The 95% confidence interval is approximately around 0.001, which is quite narrow.
We would argue that we may choose a wider range of \(\rho_{CH}\) and increase the number of iterations to locate the best value.
The value of \(\rho_{CH}\) is the spill-over rate rate for camels infecting human. As we know that there are 270000 camels in Saudi Arabia, this \(\rho_{CH}\) would give us the idea that on average 1000 camels would create a primary human MERS case. In our data we have 1218 reported human cases in total, and without camels dying, being born, and turning back to susceptible, we would expect a total case of \(270000\times0.001\times4=1080\). The numbers are matching, which means our obtained best \(\rho_{CH}\) makes sense.

Conclusion

In this project, we performed analysis on MERS cases in Saudi Arabia from January 2014 to May 2016. We first analyzed the data using ARIMA time series models, and concluded that a ARMA(1,4) model fits the data well.
Using the best ARMA model as the benchmark, we further explored the data with a Susceptible-Exposed-Infectious-Recovered-Susceptible (SEIRS) model to simulate the spreading of MERS among camels population (Lin et al, 2018). The model treats the transition of stages in camels as the hidden Markov process, and reflects the spill-over of camels infecting human, which leads to primary human MERS cases, making up about a quarter of the total human cases.
The estimated basic reproduction number \(R_0\) is 2.6, which is close to the range estimated in Lin et al.
We were able to find parameters that simulate and capture most of the major outbreaks and peaks in the data.
We constructed profile likelihood for the spill-over rate \(\rho_{CH}\) with narrow confidence interval, and the value explained the overall number of the cases well.

References

https://journals.sagepub.com/doi/pdf/10.1177/0962280217746442 ↩︎
https://www.cdc.gov/coronavirus/mers/about/index.html ↩︎
https://www.kaggle.com/datasets/imdevskp/mers-outbreak-dataset-20122019 ↩︎
Shumway and Stoffer (2017), Section 3.7 (Building ARIMA Models)↩︎
Shumway and Stoffer (2017), Section 3.3 (Autocorrelation and Partial Autocorrelation)↩︎
Shumway and Stoffer (2017), Section 3.7 (Building ARIMA Models)↩︎
P.K. Bhattacharya and Prabir Burman, Chapter 13 - Time Series, Theory and Methods of Statistics (2016)↩︎
Estimating the Spectral Density ↩︎
https://journals.sagepub.com/doi/pdf/10.1177/0962280217746442 ↩︎
Aaron A. King, Edward L. Ionides, Lesson 5. Case study: Measles in large and small towns ↩︎
Edward L. Ionides, Chapter 5: Parameter estimation and model identification for ARMA models ↩︎
https://www.sciencedirect.com/science/article/pii/S1755436514000607#fig0005 ↩︎
https://journals.sagepub.com/doi/pdf/10.1177/0962280217746442 ↩︎
Lin, Q., Chiu, A. P., Zhao, S., & He, D. (2018). Modeling the spread of Middle East respiratory syndrome coronavirus in Saudi Arabia. Statistical Methods in Medical Research, 27(7), 1968–1978. https://doi.org/10.1177/0962280217746442 ↩︎
Lin, Q., Chiu, A. P., Zhao, S., & He, D. (2018). Modeling the spread of Middle East respiratory syndrome coronavirus in Saudi Arabia. Statistical Methods in Medical Research, 27(7), 1968–1978. https://doi.org/10.1177/0962280217746442 ↩︎
Lin, Q., Chiu, A. P., Zhao, S., & He, D. (2018). Modeling the spread of Middle East respiratory syndrome coronavirus in Saudi Arabia. Statistical Methods in Medical Research, 27(7), 1968–1978. https://doi.org/10.1177/0962280217746442 ↩︎
Jesse’s Office Hours on Tuesday April 23↩︎