Dengue was recognized as a tropical disease, a disease that is transmitted by most a specific species of mosquitoes called Aedes aegyopti and several other species. [1] Dengue, as a disease that came out of Africa around the 15th to 19th centuries, was first documented was around 18th centuries in Asia, Africa, and North America. Dengue has been in history for a very long time and started spreading with the help of the slave trade and the second world war. [2] It now has been the second most diagnosed cause of fever after malaria, which is also a mosquito-borne viral infection. As of today, there is no specific cure for Dengue and it has continued being a major cause of child mortality [3][4]. Even today, severe Dengue is still a major cause of death in many tropical countries. Dengue, according to WHO, could be influenced by many different variables such as temperature, humidity, or other unknown factors. It also has been a disease that has rather specific documentation over a range of time, which makes possible for researchers to find pattern and correlations via data. Or, if possible, predict when would be the peak so that it could help locals to prevent spreading.
In this project, we would like to use a classical time-series ARIMA analysis to capture much of the variation in our data. Since Aedes Aegypti mosquitoes breed in stagnant water and these mosquitoes are the major cause of dengue, thus rainy period in a tropical country would be hazardous. Therefore, in order to predict and help to prevent more people from such disease, we think understanding the correlation between precipitation, temperature and breeding level of the mosquitoes would be essential.
This project generally gets inspired by the Dengue Forecasting project from the National Oceanic and Atmospheric Administration (NOAA). [5] Indeed, the two datasets we leveraged also provided by NOAA. Data for dengue is historical surveillance data provided for Iquitos, Peru from July 1st, 2000 to June 25th, 2009. It provides weekly data for laboratory-confirmed cases of dengue and, luckily, provides data for all four serotypes of dengue. The other data is the daily environmental data from the weather station which included temperature and precipitation data for Iquitos, Peru from January 2nd, 1973 to March 14th, 2015. In this project, we would like to use the subset of the two datasets, which is, we would like to restrict the dates to from January 1st, 2002 to June 29th, 2009. Since the data is detailed to work, we are then able to convert our daily data to weekly data and then match the two datasets together. By using both datasets, we might be able to see the clear correlation between temperature, precipitation, and the cases count.
## YYYY MM DD TMAX
## Min. :1973 Length:14554 Length:14554 Min. :-9999.0
## 1st Qu.:1983 Class :character Class :character 1st Qu.:-9999.0
## Median :1994 Mode :character Mode :character Median : 31.0
## Mean :1994 Mean :-4205.7
## 3rd Qu.:2004 3rd Qu.: 33.0
## Max. :2015 Max. : 42.2
## TMIN TAVG TDTR PRCP
## Min. :-9999.0 Min. :-9999.0 Min. :-9999.0 Min. :-9999.0
## 1st Qu.:-9999.0 1st Qu.:-9999.0 1st Qu.:-9999.0 1st Qu.: 0.0
## Median : 21.5 Median :-9999.0 Median :-9999.0 Median : 0.0
## Mean :-2526.4 Mean :-5840.1 Mean :-5847.0 Mean :-1720.4
## 3rd Qu.: 22.3 3rd Qu.: 27.1 3rd Qu.: 10.2 3rd Qu.: 6.1
## Max. : 25.3 Max. : 32.3 Max. : 20.0 Max. : 461.3
Based on the summary of the station data, we would find several data for temperature and precipitation that have been missed. To avoid misunderstanding and interference, we would simply insert the mean as the missing value.
The following plots show the relationships between weekly data and the number of total cases. From the first plot, we would see that there is a peak around the end of 2005. From 2008 to 2009, there are two small peaks. Since the original plot occasional large peak and for making patterns more visible, we take log transformation of the number of total cases and plot the relationship between it and weekly data.
Literally, frequency variation is one of the most important things we would like to discover during time series analysis. For the new dengue data, high frequency variation might be considered as noise and low frequency variation might be considered as a trend. Moreover, a band of mid-range frequencies might be considered to correspond to the business cycle. Indeed, we would apply the local linear regression approach to simulate the high frequency and low-frequency variation and then extract the business cycle.
Based on the plot below, there does not exsit any obvious business cycles. Indeed, we would like to apply more approaches to study our data.
From the previous class, we know that it is highly likely that the value of a variable observed in the current time period will be similar to its value in the previous period, or even the period before that in time series analysis. Thus when we fitting a regression model to time series data, it is common to find autocorrelation in the residuals.
In this case, the estimated model violates the assumption of no autocorrelation in the errors, and our forecasts may be inefficient — there is some information left over which should be accounted for in the model in order to obtain better forecasts. The forecasts from a model with autocorrelated errors are still unbiased, and so are not “wrong”, but they will usually have larger prediction intervals than they need to. [6] Therefore we should always look at an ACF plot of the residuals. Based on the plot below, the first several lags indicate that we need to consider ARMA models and the seasonally oscillating autocorrelations indicate the suitability of a SARMA model.
To estimate the spectral density of the dengue time series of given data, we would find the frequency to compute the period of oscillation by periodogram. The predominant frequencies occur at 0.02 cycles per week, which is as same as 50 weeks per cycle.
## [1] 0.02
Recall that a more general approach is to compare likelihoods of different models by penalizing the likelihood of each model by a measure of its complexity, which is Akaike’s information criterion(AIC). It is given by \[AIC=-2 \times l(\theta)+2D,\] which means minus twice the maximized log likelihood plus twice the number of parameters.
MA0 | MA1 | MA2 | MA3 | MA4 | MA5 | MA6 | |
---|---|---|---|---|---|---|---|
AR0 | 1060.89 | 899.68 | 819.32 | 785.80 | 761.06 | 743.15 | 741.32 |
AR1 | 746.33 | 710.69 | 711.09 | 712.97 | 714.58 | 716.38 | 718.23 |
AR2 | 711.43 | 711.22 | 707.21 | 708.46 | 710.46 | 712.43 | 714.38 |
AR3 | 711.20 | 713.16 | 708.45 | 709.74 | 715.40 | 712.71 | 714.53 |
AR4 | 713.14 | 715.14 | 707.33 | 715.86 | 716.94 | 719.11 | 721.10 |
AR5 | 715.12 | 716.83 | 715.29 | 717.01 | 718.53 | 713.53 | 716.84 |
AR6 | 715.59 | 717.44 | 717.06 | 712.63 | 722.09 | 716.38 | 718.34 |
Based on the above table of AIC values, we could observe that \(ARMA(2, 2)\) indicates the lowest AIC. However, \(ARMA(4, 2)\) also provides a low AIC value which is close to \(ARMA(2, 2)\). Since we don’t want to make strong claims about having found the best model using AIC, then we would also look at details for these modes. Without loss of generality, we would like to apply hypothesis tests on \(ARMA(2, 2)\) and \(ARMA(4, 2)\) to observe which one is significantly better than the other by Wilks’ approximation.
## [1] 3.884555
## [1] 0.143377
For this hypotheis test, my null hypothesis correpsonds to \(ARMA(2, 2)\), and mu alternative hypothesis corresponds to \(ARMA(4, 2)\). Then we would have \(\Lambda=3.88\), by comparing this value to the cutoff value \(5.99\) for a \(95%\) siginificance level with \(2\) degree of freedom, which indicates that the null hypothesis cannot be reject. Thus, the model \(ARMA(2, 2)\) would be more appropriate for our data. Generally, intentionally fitting a larger model than required would capture the dynamics of the data as identified, which as known as overfitting. Since the difference between \(ARMA(2, 2)\) and \(ARMA(4, 2)\) is significant, then this would be another reason for us to choose the model \(ARMA(2,2)\).
Furthermore, we would like to fit a \(SARMA(2,0,2) × (1,0,1)\) and \(SARMA(2,0,2) × (2,0,1)\) to investigate the parameter estimates.
##
## Call:
## arima(x = log(df$total_cases + 1), order = c(2, 0, 2), seasonal = list(order = c(1,
## 0, 1), period = 52), xreg = df[c("avg_temp", "total_precip")])
##
## Coefficients:
## Warning in sqrt(diag(x$var.coef)): NaNs produced
## ar1 ar2 ma1 ma2 sar1 sma1 intercept
## 0.8592 0.0243 -0.3792 0.0839 0.8697 -0.7736 0.2924
## s.e. 0.0072 NaN 0.0471 0.0508 NaN NaN 2.4083
## avg_temp total_precip
## 0.0574 2e-04
## s.e. 0.0872 4e-04
##
## sigma^2 estimated as 0.3366: log likelihood = -343.87, aic = 707.73
##
## Call:
## arima(x = log(df$total_cases + 1), order = c(2, 0, 2), seasonal = list(order = c(2,
## 0, 1), period = 52), xreg = df[c("avg_temp", "total_precip")])
##
## Coefficients:
## ar1 ar2 ma1 ma2 sar1 sar2 sma1 intercept
## 1.9137 -0.9174 -1.4329 0.4329 0.0417 0.1691 0.0303 0.2387
## s.e. 0.0240 0.0240 0.0551 0.0547 0.4591 0.0701 0.4668 2.4142
## avg_temp total_precip
## 0.0584 2e-04
## s.e. 0.0878 4e-04
##
## sigma^2 estimated as 0.3302: log likelihood = -340.44, aic = 702.87
Based on the results of \(SARMA(2,0,2) × (1,0,1)\), we would find that the AR parameter, seasonal AR parameter, and MA parameters are all significant. From the results of \(SARMA(2,0,2) × (2,0,1)\), we would see that the coefficient of AR parameter is close to 2. It might because of \(SARMA(2,0,2) × (2,0,1)\) way over-modeled as the sum of the coefficients is approximately 1. Thus, we would say that the \(SARMA(2,0,2) × (1,0,1)\) is enough to capture the dependence. Indeed, we would use some diagnostics for our SARMA model.
##
## Shapiro-Wilk normality test
##
## data: sarma22$residuals
## W = 0.99322, p-value = 0.07663
Based on the time plot, we would observe that the residuals seem to be mean-zero station. The QQ plot shows that the residuals seem to be approximately normal. To verify this, we apply the Shapiro test, p-value = 0.07663 > 0.05, which indicates that we failed to reject the null hypothesis and the residuals follow the normal distribution. Moreover, the ACF uncorrelated (we expect 5 out of 100 lines to cross the blue lines and 6 do, which isn’t too bad) and approximately Gaussian, as our model dictates. Meanwhile, Residuals ACF shows uncorrelated and has no trend.
In this project, we take temperature and precipitation as signals and the ARMA process as the noise process to build a signal-plus-noise model. The SARMA model shows that our datasets should have a remarkable amount of seasonality in our noise process because of the relationship among the Aedes mosquitoes’ lifecycle, seasonal weather patterns, and human-mosquito interaction is highly dynamic and rather impossible to fully explain using only weather variables.
Another reason that complicates the data analysis, and causes the data rather obscure is the complex cyclical nature of the system is the existence of temporary cross-immunity.[1] The hospital or government should create a clever system for dengue patients to be documented. Since once this temporary cross-immunity fades away, then subsequent infections from other serotypes make individuals more susceptible to severe dengue, which indicates the recovery dengue patient is not “safety” enough to be documented as a recovery patient. [1]
Indeed, the reasons for the virus spread are much more complex than we thought and definitely nonlinear. Therefore, it is necessary and important for us to include more variables and try more detailed approaches to investigate the latent processes that result in the dengue data.