Introduction

General Overview

From the beginning of January in 2020, several patients have been diagnosed as pneumonia in Hubei, China. Until January the 7th, people found out they actually got infected by a novel virus called COVID-19. By the time of January the 22ed, 2020, COVID-19 outbreak have become an epidemic event in China. And by March, it has evolved into a world-wide pandemic event. COVID-19 is also known as SARS-CoV-2, it belongs to the same family as SARS-CoV, which caused the 2003 SARS outbreak in China. These two viruses have very similar biological characteristics and infectious features. For example, they cause similar symptoms and have very similar spreading patterns, mortality rate and transmission rate. Another similarity of these two viruses is that they all started from China, and the Chinese government used similar strategies in disease control. Then due to these similarities, it is reasonable to study the pattern of SARS, how it started evolved and ended, and apply the knowledge on this novel COVID-19 virus outbreak.\(^{[3],[4]}\)

In this report, I seek to find a way to locate the stage of this COVID-19 outbreak we are at now to a specific time in the previous SARS outbreak. And then predict the end of this virus outbreak, assuming COVID-19 and SARS have similar epidemic patterns. This COVID-19 outbreak has already made great damage on economics and medical systems all around the world, therefore knowing an approximate end date is of great interest.

I first modeled the SARS outbreak with ARMA model with trend, to see whether ARMA model is a good fit to these epidemic data or not. And then I fitted many parts of SARS data to COVID-19 data, and found the part of SARS data, which has the best associate with COVID-19 data, using a Linear Regression with ARMA errors model. Then I used loess smoother to remove the noise and cycles from the data. Based on only the large patterns, I did the match again, and found something inconsistent with the result from Linear regression with ARMA errors model.

Data Introduction and preprocessing

I used the number of new diagnosed patients around the world each day as the time series data.

The cumulative diagnosed patient data for COVID-19 came originally from Johns Hopkins University, and I downloaded from Kaggle\(^{[1]}\). This dataset consists data from 22ed Jan to 7th Mar, 2020. It is well organized, and only need to convert from cumulative patient to new patients each day.

The cumulative diagnosed patient data for SARS came from World Health Organization\(^{[2]}\), from 17th Mar to 11th July, 2003. This dataset has missing data for Sunday and sometimes for Saturday as well, I filled the missing values with the average of their closest neighbors. This dataset sometimes have wrong data, for example, number of cumulative patient decreasing with time, I also corrected errors like these. And then transformed the data from cumulative to new patients each day.

Descriptive Analysis

Followings are the descriptive statistics for SARS and COVID-19 data after preprocessing:

## summary statistic for SARS:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    3.75   42.50   71.67   93.38  836.00
## summary statistic for COVID-19:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      98     809    2040    2340    2762   15148

Followings are the plots for SARS and COVID-19 time series data:

For the plot of SARS data, after around 80 days rarely did anyone get diagnosed with the disease, so in this study I will focus only on the data before 80 days. This data has two peaks one peak is around 10 days and the other one is around 40 days.

Based on the plot of COVID-19 data, there is one peak before 30 days, and a tendency to increase afterwards.

And we can see that the first peak for both of the virus outbreak are really extreme. The possible explanation is that, in the beginning the virus is really strong and the infection rate is high during this period. Or in the beginning the diagnostic method is poor and have a really low sensitivity, by the time the diagnostic method is improved many real patients will be found simultaneously. Based on this year’s situation the second explanation is more reasonable, in 13th Feb. the standard for COVID-19 diagnosis is changed in China, and this led to an increase of around 15000 new patients.\(^{[5]}\) This is consistent with the COVID-19 data, the largest peak indeed happened around 13th Feb. As a result, in the following model fitting, this phenomenon will cause some of the model assumptions being not hold. But this is more of a problem of the data rather than the model.

Followings are the sample Autocorrelation Function for SARS and COVID-19 data:
Based on the sample Autocorrelation Function and the plot, the data is obviously not mean stationary, mean is quadratic with t. But besides several extreme points, caused by data collection as discussed above, these two data look like variance stationary and an ARMA model with trend should be feasible for the data. And in the following section I will further look into the feasibility of ARMA model with trend on these data.

ARMA model

Firstly, I constructed a series of ARMA models on the SARS data, to see whether ARMA models are feasible for these epidemic data or not.

Model Construction

Based on the analysis from the previous chapter, besides several extreme point (four extreme points), other points look pretty variance stationary. Then an ARMA model with quadratic trend should be feasible for the data. ARMA(p,q) models with trend:
\[Y_n=\mu_n+\eta_n\]
Where:
\[\mu_n=\sum_{k=1}^{K}Z_{n,k}\beta_k\]
\[K=2\] \[Z_{n,1}\ is\ number\ of\ days\ after\ Mar.\ 17th,\ Z_{n,2}=Z_{n,1}^2,\ \beta_1\ \beta_2\ are\ the\ corresponding\ coefficient\]

\[\eta_{1:n}\ is\ stationary,\ causal,\ invertible\ ARMA(p,q)\ process\ defined\ as\ followings:\]
\[\eta_n = \phi_1\eta_{n-1}+\phi_2\eta_{n-2}+\dots+\phi_p\eta_{n-p} + \epsilon_n +\psi_1 \epsilon_{n-1} +\dots+\psi_q\epsilon_{n-q}\]
For the sake of simplicity I assume \(\epsilon_{1:n}\) is gaussian white noise process. And I will check this assumption in model diagnostic section.

Model Selection

After Constructing the framework of the model, in the following step I will use the Akaike Information Criterion(AIC) to select the best performing model.
AIC is defined as followings: \[AIC=-2\ell(\hat{\theta})+2D\]
\[D\ is\ the\ dimension\ of\ unknown\ parameter\]
Note that AIC is not a decisive criterion, but works as a suggestion.

## AIC for signal plus noise model:
MA0 MA1 MA2 MA3
AR0 982.40 984.31 986.31 988.30
AR1 984.31 986.31 988.31 989.75
AR2 986.31 988.31 987.69 989.20
AR3 988.31 989.75 989.20 988.74

The result suggests that an Linear Regression with ARMA(0,0) error model is the best under AIC, which is the ordinary Linear regression model. Since this is the simplest model and have the smallest AIC value, then I will choose this model. following is the model detail:

arima(daily_sars[1:80],order=c(0,0,0),method = "ML",xreg = x)
## 
## Call:
## arima(x = daily_sars[1:80], order = c(0, 0, 0), xreg = x, method = "ML")
## 
## Coefficients:
##       intercept     day         
##         61.2343  5.0937  -0.0757
## s.e.    36.7394  2.0934   0.0250
## 
## sigma^2 estimated as 11408:  log likelihood = -487.2,  aic = 982.4

Model diagnostics

Based on the above residual plot and ACF, the residuals are with mean zero, and besides the several extreme points, other points are pretty variance stationary. Then most part of the data is consistent with the assumption that errors are white noise. And For those data points, which don’t follow the white noise assumption, is a problem of the data collection rather than a problem of the model, as discussed in the first section.

The quantile- quantile plot showed similar results: besides the several extreme points, other points lies quite neatly on the line. This means most part of the data is consistent with the assumption that the error follows gaussian distribution.

In conclusion, most part of the data is concordant with the assumption that error follows gaussian white noise distribution. And the problem from data collection indeed undermines both of the model assumption that errors are white noise and follows gaussian distribution. Even though the model assumption is violated by these four extreme points, but since it’s a problem of the data collection rather than a problem of the model, ARMA model should still be feasible to carry out the following analysis.

Linear Regression with ARMA errors model

Because it is impossible to identify an epidemic process right from its beginning, so the start point of the SARS and COVID-19 data might not be at the same stage. Therefore, for meaningfully comparing these two time series, we need to adjust them to the same stage. In this section, I divided the SARS data into 36 continuously overlapping segments and fitted each with a Linear Regression with ARMA error model using COVID-19 data as covariate. And because of the model fitted above, we know the best performance model should be around p=0 and q=0. Then in this section I choose to fit models with p=0 q=0, p=0 q=1,p=1 q=0,p=1 q=1.

Model Construction

The Model Construction is mostly the same as the model Construction above. Linear regression with ARMA error model:
\[Y_n=\mu_n+\eta_n\]
Where:
\[\mu_n=\sum_{k=1}^{K}Z_{n,k}\beta_k\]
\[K=1\] \[Z_{n,1}\ is\ COVID-19\ data,\ \beta_1\ is\ the\ corresponding\ coefficient\]

\[\eta_{1:n}\ is\ stationary,\ causal,\ invertible\ ARMA(p,q)\ process\ defined\ as\ followings:\]
\[\eta_n = \phi_1\eta_{n-1}+\phi_2\eta_{n-2}+\dots+\phi_p\eta_{n-p} + \epsilon_n +\psi_1 \epsilon_{n-1} +\dots+\psi_q\epsilon_{n-q}\]
For the sake of simplicity I assume \(\epsilon_{1:n}\) is gaussian white noise process. And I will check this assumption in model diagnostic section.

Model Selection

## AIC for each model
1 2 3 4 5 6 7 8 9 10
ARMA(0,0) error 578.84 578.73 578.23 577.99 577.48 577.05 575.53 573.52 572.19 534.06
ARMA(0,1) error 580.83 580.73 580.23 579.99 579.47 578.96 577.05 574.31 574.00 534.92
ARMA(1,0) error 580.83 580.73 580.23 579.99 579.47 578.98 577.21 574.44 574.06 534.90
ARMA(1,1) error 582.83 582.73 582.23 581.99 581.20 580.61 578.38 575.86 575.27 535.53
11 12 13 14 15 16 17 18 19 20
ARMA(0,0) error 534.98 528.77 534.90 532.85 532.06 533.81 522.75 522.70 523.43 523.32
ARMA(0,1) error 536.01 528.93 535.73 534.11 532.81 534.23 520.22 520.60 521.50 521.18
ARMA(1,0) error 535.82 529.07 535.53 534.05 532.42 533.29 517.64 518.52 519.23 518.02
ARMA(1,1) error 536.38 530.91 536.17 534.51 532.34 531.90 513.09 513.67 513.43 509.90
21 22 23 24 25 26 27 28 29 30
ARMA(0,0) error 524.07 505.69 523.55 522.75 523.32 525.11 524.00 527.62 528.21 527.84
ARMA(0,1) error 522.32 500.38 522.30 521.07 521.98 523.59 521.95 525.14 525.40 524.54
ARMA(1,0) error 519.51 495.26 517.87 519.73 519.88 521.01 519.48 521.89 521.91 521.01
ARMA(1,1) error 513.71 489.39 513.48 514.13 514.01 514.24 513.14 514.44 514.40 513.08
31 32 33 34 35 36
ARMA(0,0) error 530.54 531.84 533.02 533.61 534.68 535.65
ARMA(0,1) error 526.86 527.41 526.74 525.94 526.90 526.02
ARMA(1,0) error 522.37 522.77 520.74 517.63 518.30 517.49
ARMA(1,1) error 513.65 513.48 513.02 506.62 506.98 506.91

Based on the above AIC table, the lowest AIC belongs to Linear Regression with ARMA(1,1) error fitted at 22 days of SARS data. So under AIC criteria the best association between SARS data and COVID-19 data are SARS data from 22 days to 66 days.

## sigma2 for each model
1 2 3 4 5 6 7 8 9 10
ARMA(0,0) error 19768.02 19720.99 19505.47 19400.58 19181.30 18998.94 18368.01 17566.57 17052.24 7308.60
ARMA(0,1) error 19765.43 19719.50 19504.38 19399.84 19177.30 18959.70 18167.41 17086.82 16975.59 7120.80
ARMA(1,0) error 19765.56 19719.59 19504.47 19399.90 19177.90 18966.94 18233.34 17138.12 17002.05 7117.50
ARMA(1,1) error 19765.49 19719.54 19504.41 19399.87 19055.47 18804.73 17883.32 16902.68 16682.65 6887.01
11 12 13 14 15 16 17 18 19 20
ARMA(0,0) error 7459.28 6497.98 7446.95 7115.32 6990.76 7268.60 5684.38 5677.88 5770.76 5756.58
ARMA(0,1) error 7295.80 6230.86 7250.12 6995.46 6795.60 7011.43 5129.89 5175.50 5280.24 5243.86
ARMA(1,0) error 7263.62 6251.77 7215.79 6986.27 6734.48 6858.23 4834.12 4934.18 5012.11 4876.26
ARMA(1,1) error 7020.20 6228.17 6987.98 6738.44 6415.90 6337.23 4145.45 4202.97 4176.25 3853.51
21 22 23 24 25 26 27 28 29 30
ARMA(0,0) error 5854.05 3890.50 5786.59 5684.13 5756.28 5989.77 5844.76 6334.29 6418.11 6364.37
ARMA(0,1) error 5375.99 3298.23 5371.79 5229.32 5337.93 5531.39 5333.40 5723.20 5755.31 5646.50
ARMA(1,0) error 5039.38 2933.55 4842.41 5069.74 5084.98 5211.06 5038.17 5309.90 5310.98 5206.43
ARMA(1,1) error 4200.81 2442.68 4179.59 4242.03 4229.27 4245.40 4141.07 4257.10 4255.26 4127.56
31 32 33 34 35 36
ARMA(0,0) error 6758.19 6956.65 7141.44 7236.07 7410.06 7571.68
ARMA(0,1) error 5944.37 6015.35 5922.50 5818.56 5943.64 5821.07
ARMA(1,0) error 5361.43 5407.88 5160.44 4809.06 4880.53 4788.87
ARMA(1,1) error 4178.41 4155.30 4112.94 3543.01 3564.39 3547.79

The \(\sigma^2\) table showed the same result, that the best model is Linear Regression with ARMA(1,1) error fitted at 22 days of SARS data.

In conclusion, since two criteria both points to the same best performance model, and these two criteria is outstandingly better for this model than the others. As a result, I will choose this p=q=1 model and test the association between COVID-19 data and SARS data from 22 days to 66 days.

## 
## Call:
## arima(x = daily_sars[22:66], order = c(1, 0, 1), xreg = daily_cov, method = "ML")
## 
## Coefficients:
##          ar1      ma1  intercept  daily_cov
##       0.9000  -0.5538    65.1284     0.0187
## s.e.  0.0827   0.1339    30.3957     0.0033
## 
## sigma^2 estimated as 2443:  log likelihood = -239.7,  aic = 489.39

Hypothesis test for association

\[H_0: \beta_{cov}=0, H_a:\beta_{cov}\neq0\] Construct T statistic: \[T=\frac{\hat{\beta_1}-0}{SE}\sim N(0,1)\]
\[Reject\ null\ hypothesis\ if\ T>1.96\ \] \[T=\frac{0.0187-0}{0.0033}=5.6667>1.96\]
Statistical conclusion: reject null. This means we have enough evidence to show that \(\beta_1\neq0\), the association between COVID-19 and SARS from 22 days to 66 days are significantly associated.

Model diagnostics

Based on the above plots, we can see except for 2 extreme points, the residuals are following a gaussian white noise distribution. This is consistent with original assumption. In conclusion, the best fit for COVID-19 data are from 22 days to 66 days in SARS data, as shown below:

We can see that the COVID-19 data is mapped to the second peak of SARS data

Spectrum analysis

Now I want to do the matching of COVID-19 and SARS again with the noise removed.

The above spectrum showed that the dominant frequencies for both COVID-19 and SARS are very low frequencies, then I used loess smoother to only extract the low frequencies from the data.

First check the loess smoother indeed extracted the low frequency components:

For the above plot, frequencies beyond 0.1 cycles per day have spectrum ratio smaller than 0.1, and only frequencies smaller than 0.1 cycles per day have spectrum ratio larger than 0.1, then as a result, loess smoother indeed only captured the low frequency component of the data.
Then based on the low frequency plot, it is obvious that the SARS data have two peaks and COVID-19 data have one peak and have a tendency to increase at the end. So based on this observation, COVID-19 data should be mapped to the first peak of SARS data, but this result is not consistent with the result from section 2. The reason behind the inconsistency should still be a problem of the diagnostic standard, because the late diagnostic tends to make a taller and skinnier peak than the true peak. Then It is less likely to map COVID-19 to this very skewed peak under the exact fitting method. But once we deleted the noises and only left with the larger trend, it is easier to see the true pattern of both of the data, and can make a better conclusion.

Conclusion

Based on the analysis above, the COVID-19 data should be mapped to the first peak of SARS data. And Mar. 7th, 2020 should correspond to around 40 days of SARS data. So if the general assumption, which is the epidemic pattern between SARS and COVID-19 are similar, is true. Then in the following 3 to 5 days we will experience an increase of infection diagnostics, after that it will decrease to zero and end in about 40 days, around the middle of April.

References

[1]. COVID-19 data from kaggle website. (https://towardsdatascience.com/why-quality-sense-checks-are-so-important-in-data-science-7ef80da760c3)
[2]. SARS data from WHO website. (https://www.who.int/csr/sars/country/en/)
[3]. General knowledge about COVID-19 from wikipedia. (https://en.wikipedia.org/wiki/Coronavirus)
[4]. General knowledge about SARS outbreak from wikipedia. (https://en.wikipedia.org/wiki/Severe_acute_respiratory_syndrome)
[5]. Feb. 12 situation from New York post (https://nypost.com/2020/02/12/coronavirus-242-deaths-and-nearly-15000-cases-reported-in-one-day-in-china/)
[6]. Annotated lecture notes from Stats531.