1. Introduction and Motivation
2 Exploratory Data Analysis
3. Modeling
- 3.1 Model Selection
- 3.2 Model diagnostics
4. Conclusion
5. Reference and Coding Support

1. Introduction and Motivation

Dengue was recognized as a tropical disease, a disease that is transmitted by most a specific species of mosquitoes called Aedes aegyopti and several other species. ^[1] Dengue, as a disease that came out of Africa around the 15th to 19th centuries, was first documented was around 18th centuries in Asia, Africa, and North America. Dengue has been in history for a very long time and started spreading with the help of the slave trade and the second world war. ^[2] It now has been the second most diagnosed cause of fever after malaria, which is also a mosquito-borne viral infection. As of today, there is no specific cure for Dengue and it has continued being a major cause of child mortality ^[3][4]. Even today, severe Dengue is still a major cause of death in many tropical countries. Dengue, according to WHO, could be influenced by many different variables such as temperature, humidity, or other unknown factors. It also has been a disease that has rather specific documentation over a range of time, which makes possible for researchers to find pattern and correlations via data. Or, if possible, predict when would be the peak so that it could help locals to prevent spreading.

In this project, we would like to use a classical time-series ARIMA analysis to capture much of the variation in our data. Since Aedes Aegypti mosquitoes breed in stagnant water and these mosquitoes are the major cause of dengue, thus rainy period in a tropical country would be hazardous. Therefore, in order to predict and help to prevent more people from such disease, we think understanding the correlation between precipitation, temperature and breeding level of the mosquitoes would be essential.

2 Exploratory Data Analysis

2.1 Data Overview

This project generally gets inspired by the Dengue Forecasting project from the National Oceanic and Atmospheric Administration (NOAA). ^[5] Indeed, the two datasets we leveraged also provided by NOAA. Data for dengue is historical surveillance data provided for Iquitos, Peru from July 1st, 2000 to June 25th, 2009. It provides weekly data for laboratory-confirmed cases of dengue and, luckily, provides data for all four serotypes of dengue. The other data is the daily environmental data from the weather station which included temperature and precipitation data for Iquitos, Peru from January 2nd, 1973 to March 14th, 2015. In this project, we would like to use the subset of the two datasets, which is, we would like to restrict the dates to from January 1st, 2002 to June 29th, 2009. Since the data is detailed to work, we are then able to convert our daily data to weekly data and then match the two datasets together. By using both datasets, we might be able to see the clear correlation between temperature, precipitation, and the cases count.

##       YYYY           MM                 DD                 TMAX        
##  Min.   :1973   Length:14554       Length:14554       Min.   :-9999.0  
##  1st Qu.:1983   Class :character   Class :character   1st Qu.:-9999.0  
##  Median :1994   Mode  :character   Mode  :character   Median :   31.0  
##  Mean   :1994                                         Mean   :-4205.7  
##  3rd Qu.:2004                                         3rd Qu.:   33.0  
##  Max.   :2015                                         Max.   :   42.2  
##       TMIN              TAVG              TDTR              PRCP        
##  Min.   :-9999.0   Min.   :-9999.0   Min.   :-9999.0   Min.   :-9999.0  
##  1st Qu.:-9999.0   1st Qu.:-9999.0   1st Qu.:-9999.0   1st Qu.:    0.0  
##  Median :   21.5   Median :-9999.0   Median :-9999.0   Median :    0.0  
##  Mean   :-2526.4   Mean   :-5840.1   Mean   :-5847.0   Mean   :-1720.4  
##  3rd Qu.:   22.3   3rd Qu.:   27.1   3rd Qu.:   10.2   3rd Qu.:    6.1  
##  Max.   :   25.3   Max.   :   32.3   Max.   :   20.0   Max.   :  461.3

Based on the summary of the station data, we would find several data for temperature and precipitation that have been missed. To avoid misunderstanding and interference, we would simply insert the mean as the missing value.

The following plots show the relationships between weekly data and the number of total cases. From the first plot, we would see that there is a peak around the end of 2005. From 2008 to 2009, there are two small peaks. Since the original plot occasional large peak and for making patterns more visible, we take log transformation of the number of total cases and plot the relationship between it and weekly data.

2.2 Cycle Study by a Band Pass Filter

Literally, frequency variation is one of the most important things we would like to discover during time series analysis. For the new dengue data, high frequency variation might be considered as noise and low frequency variation might be considered as a trend. Moreover, a band of mid-range frequencies might be considered to correspond to the business cycle. Indeed, we would apply the local linear regression approach to simulate the high frequency and low-frequency variation and then extract the business cycle.

Based on the plot below, there does not exsit any obvious business cycles. Indeed, we would like to apply more approaches to study our data.

2.3 Autocorrelation of Residuals for Regression

From the previous class, we know that it is highly likely that the value of a variable observed in the current time period will be similar to its value in the previous period, or even the period before that in time series analysis. Thus when we fitting a regression model to time series data, it is common to find autocorrelation in the residuals.

In this case, the estimated model violates the assumption of no autocorrelation in the errors, and our forecasts may be inefficient — there is some information left over which should be accounted for in the model in order to obtain better forecasts. The forecasts from a model with autocorrelated errors are still unbiased, and so are not “wrong”, but they will usually have larger prediction intervals than they need to. ^[6] Therefore we should always look at an ACF plot of the residuals. Based on the plot below, the first several lags indicate that we need to consider ARMA models and the seasonally oscillating autocorrelations indicate the suitability of a SARMA model.

2.4 Seasonality Study by Spectrum Analysis

To estimate the spectral density of the dengue time series of given data, we would find the frequency to compute the period of oscillation by periodogram. The predominant frequencies occur at 0.02 cycles per week, which is as same as 50 weeks per cycle.

## [1] 0.02

3. Modeling

3.1 Model Selection

Recall that a more general approach is to compare likelihoods of different models by penalizing the likelihood of each model by a measure of its complexity, which is Akaike’s information criterion(AIC). It is given by \[AIC=-2 \times l(\theta)+2D,\] which means minus twice the maximized log likelihood plus twice the number of parameters.

	MA0	MA1	MA2	MA3	MA4	MA5	MA6
AR0	1060.89	899.68	819.32	785.80	761.06	743.15	741.32
AR1	746.33	710.69	711.09	712.97	714.58	716.38	718.23
AR2	711.43	711.22	707.21	708.46	710.46	712.43	714.38
AR3	711.20	713.16	708.45	709.74	715.40	712.71	714.53
AR4	713.14	715.14	707.33	715.86	716.94	719.11	721.10
AR5	715.12	716.83	715.29	717.01	718.53	713.53	716.84
AR6	715.59	717.44	717.06	712.63	722.09	716.38	718.34

Based on the above table of AIC values, we could observe that \(ARMA(2, 2)\) indicates the lowest AIC. However, \(ARMA(4, 2)\) also provides a low AIC value which is close to \(ARMA(2, 2)\). Since we don’t want to make strong claims about having found the best model using AIC, then we would also look at details for these modes. Without loss of generality, we would like to apply hypothesis tests on \(ARMA(2, 2)\) and \(ARMA(4, 2)\) to observe which one is significantly better than the other by Wilks’ approximation.

## [1] 3.884555

## [1] 0.143377

For this hypotheis test, my null hypothesis correpsonds to \(ARMA(2, 2)\), and mu alternative hypothesis corresponds to \(ARMA(4, 2)\). Then we would have \(\Lambda=3.88\), by comparing this value to the cutoff value \(5.99\) for a \(95%\) siginificance level with \(2\) degree of freedom, which indicates that the null hypothesis cannot be reject. Thus, the model \(ARMA(2, 2)\) would be more appropriate for our data. Generally, intentionally fitting a larger model than required would capture the dynamics of the data as identified, which as known as overfitting. Since the difference between \(ARMA(2, 2)\) and \(ARMA(4, 2)\) is significant, then this would be another reason for us to choose the model \(ARMA(2,2)\).

Furthermore, we would like to fit a \(SARMA(2,0,2) × (1,0,1)\) and \(SARMA(2,0,2) × (2,0,1)\) to investigate the parameter estimates.

## 
## Call:
## arima(x = log(df$total_cases + 1), order = c(2, 0, 2), seasonal = list(order = c(1, 
##     0, 1), period = 52), xreg = df[c("avg_temp", "total_precip")])
## 
## Coefficients:

## Warning in sqrt(diag(x$var.coef)): NaNs produced

##          ar1     ar2      ma1     ma2    sar1     sma1  intercept
##       0.8592  0.0243  -0.3792  0.0839  0.8697  -0.7736     0.2924
## s.e.  0.0072     NaN   0.0471  0.0508     NaN      NaN     2.4083
##       avg_temp  total_precip
##         0.0574         2e-04
## s.e.    0.0872         4e-04
## 
## sigma^2 estimated as 0.3366:  log likelihood = -343.87,  aic = 707.73

## 
## Call:
## arima(x = log(df$total_cases + 1), order = c(2, 0, 2), seasonal = list(order = c(2, 
##     0, 1), period = 52), xreg = df[c("avg_temp", "total_precip")])
## 
## Coefficients:
##          ar1      ar2      ma1     ma2    sar1    sar2    sma1  intercept
##       1.9137  -0.9174  -1.4329  0.4329  0.0417  0.1691  0.0303     0.2387
## s.e.  0.0240   0.0240   0.0551  0.0547  0.4591  0.0701  0.4668     2.4142
##       avg_temp  total_precip
##         0.0584         2e-04
## s.e.    0.0878         4e-04
## 
## sigma^2 estimated as 0.3302:  log likelihood = -340.44,  aic = 702.87

Based on the results of \(SARMA(2,0,2) × (1,0,1)\), we would find that the AR parameter, seasonal AR parameter, and MA parameters are all significant. From the results of \(SARMA(2,0,2) × (2,0,1)\), we would see that the coefficient of AR parameter is close to 2. It might because of \(SARMA(2,0,2) × (2,0,1)\) way over-modeled as the sum of the coefficients is approximately 1. Thus, we would say that the \(SARMA(2,0,2) × (1,0,1)\) is enough to capture the dependence. Indeed, we would use some diagnostics for our SARMA model.

3.2 Model diagnostics

## 
##  Shapiro-Wilk normality test
## 
## data:  sarma22$residuals
## W = 0.99322, p-value = 0.07663

Based on the time plot, we would observe that the residuals seem to be mean-zero station. The QQ plot shows that the residuals seem to be approximately normal. To verify this, we apply the Shapiro test, p-value = 0.07663 > 0.05, which indicates that we failed to reject the null hypothesis and the residuals follow the normal distribution. Moreover, the ACF uncorrelated (we expect 5 out of 100 lines to cross the blue lines and 6 do, which isn’t too bad) and approximately Gaussian, as our model dictates. Meanwhile, Residuals ACF shows uncorrelated and has no trend.

4. Conclusion

In this project, we take temperature and precipitation as signals and the ARMA process as the noise process to build a signal-plus-noise model. The SARMA model shows that our datasets should have a remarkable amount of seasonality in our noise process because of the relationship among the Aedes mosquitoes’ lifecycle, seasonal weather patterns, and human-mosquito interaction is highly dynamic and rather impossible to fully explain using only weather variables.

Another reason that complicates the data analysis, and causes the data rather obscure is the complex cyclical nature of the system is the existence of temporary cross-immunity.^[1] The hospital or government should create a clever system for dengue patients to be documented. Since once this temporary cross-immunity fades away, then subsequent infections from other serotypes make individuals more susceptible to severe dengue, which indicates the recovery dengue patient is not “safety” enough to be documented as a recovery patient. ^[1]

Indeed, the reasons for the virus spread are much more complex than we thought and definitely nonlinear. Therefore, it is necessary and important for us to include more variables and try more detailed approaches to investigate the latent processes that result in the dengue data.

5. Reference and Coding Support

“Dengue and severe dengue”. WHO. March 2020.
Gubler DJ (July 1998). “Dengue and dengue hemorrhagic fever”. Clinical Microbiology Reviews. 11(3): 480–96. 3.Simmons CP, Farrar JJ, Nguyen vV, Wills B (April 2012). “Dengue” (PDF). The New England Journal of Medicine. 366 (15): 1423–32. 4.Ranjit S, Kissoon N (January 2011). “Dengue hemorrhagic fever and shock syndromes”. Pediatric Critical Care Medicine. 12 (1): 90–100. 5.National Oceanic and Atmospheric Administration, Dengue Forecasting, http://dengueforecasting.noaa.gov/
https://otexts.com/fpp2/regression-evaluation.html
https://ionides.github.io/531w18/midterm_project/index.html
https://ionides.github.io/531w16/midterm_project/index.html
https://ionides.github.io/531w18/midterm_project/project18/midterm_proj.html
https://ionides.github.io/531w20/#class-notes
https://ionides.github.io/531w20/#homework-assignments

STATS531 Midterm Project

3/3/2020