1 Introduction

1.1 Dengue fever

Dengue fever is a tropical disease that is spread by the bite of a species of mosquito called Aedes Aegypti. The cause of the disease is the dengue virus, of which there are four serotypes (sub-types).[1] The mosquitoes which act as a vector for the disease prefer to breed in stagnant water. [2]

1.2 Motivation for the work

In this project, we are going to try to model dengue case counts in San Juan, Puerto Rico using observed case counts between 1990 and 2009 as well as weather data for the corresponding time period. One may wonder that even though there may be much more sophisticated ways to model such case counts, a classical time-series ARIMA analysis could capture much of the variation in our data. For instance, we know that the Aedes Aegypti mosquitoes breed in stagnant water. Since we expect to see stagnant water after rainy periods, we may think that temperature and rainfall could be used as predictors of the breeding levels of the mosquitoes, which could be predictive of dengue case counts. This is a similar approach to Zhang et al.[3] where minimum temperature, relative humidity and rainfall were used to model dengue case counts in Zhongshan, China. A key difference between that approach and the following is that the former uses a GLM, which is an extension of independent errors assumption in OLS. In our case, we allow our errors to be correlated. The motivation for this project is, therefore, to explore a time-series model where we consider temperature and rainfall as our signal and an ARMA process as our noise process. Our case counts (log-transformed, as we’ll see later) will then be the response variable.

2 Data Preparation

The two datasets we’ll be using were obtained from the National Oceanic and Atmospheric Administration. [4] One of the datasets includes Dengue case counts in San Juan, Puerto Rico, between April 30, 1990 and April 23, 2009. Although this dataset provides weekly counts for each serotype, we will, for simplicity’s sake, consider totalized counts per week (summing up the counts for each serotype) as our variable of interest.

The other dataset includes temperature and precipitation data from a station in the same city. This dataset has daily data. In order to have the same frequency of measurement as our weekly case counts, we use average weekly temperature and total weekly precipitation as our weekly weather measurements. There were three days of precipitation data that were missing and we imputed the precipitation average for the entire dataset for these days.

For all future analysis we will refer to the variable \(\texttt{full_dengue}\) as a dataset that has combined weekly weather and dengue data.

3 Preliminary analysis

Now that we have weekly weather and disease data, let’s start by fitting an OLS regression (with independent errors) with temperature and precipitation as predictors. In preparation for ARMA modeling, I consider the log-transformed total case count as my response variable. This is because, as we saw in the midterm exams of 2016 and 2018, when our data have occasional large spikes (as is clear in our first figure), the log-transformed data is better modeled using ARMA models than is the original-scale data.

cases.lm = lm(log(total_cases + 1) ~ avg_temp + total_precip, data = full_dengue)
cases.lm.resid = resid(cases.lm)
par(mfrow=c(1,3))
plot(full_dengue$avg_temp, cases.lm.resid,
     ylab="Residuals", xlab="Average temperature", 
     main="Residuals vs. Avg. temperature")
abline(0,0)
plot(full_dengue$total_precip, cases.lm.resid,
     ylab="Residuals", xlab="Weekly precipitation", 
     main="Residuals vs. Weekly precipitation")
abline(0,0)
# autocorrelation plot
acf(cases.lm.resid, lag = 100, main = 'Autocorrelation of residuals')

Unsurprisingly, we have found that the data are not modeled well by a linear regression with independent gaussian errors. We notice two things from the plots above: first, we notice some small heteroskedasticity in the residuals with respect to the precipitation variable. Second, the residuals show evidence of significant oscillating autocorrelation.

The autocorrelations with the first 17 lags suggests that we should consider ARMA models and the seasonally oscillating autocorrelations indicate the suitability of a SARMA model. We shall use an AIC table to help us get started with parameters of an ARMA model. But first, let’s look at a periodogram of the log-transformed case counts.

Note that I’ve restricted the x-axis limits so that I could focus on the portion of the plot with non-negligible power. The predominant frequencies occur at 0.019 cycles per week and 0.007 cycles per week, which translate to about 52 weeks per cycle and 2.75 years per cycle.

4 A signal-plus-noise model with SARMA errors

Let us construct an AIC table with possible ARMA models (leaving out seasonality for now):

MA0 MA1 MA2 MA3 MA4 MA5 MA6 MA7 MA8 MA9 MA10
AR0 2850.24 2071.62 1732.35 1511.68 1385.43 1297.37 1190.36 1136.55 1113.43 1063.11 1023.48
AR1 1059.39 943.45 944.96 919.98 909.98 905.57 899.58 901.44 900.95 889.88 891.76
AR2 969.12 945.19 945.31 872.17 872.08 874.08 875.79 901.87 903.81 876.50 893.80
AR3 933.02 934.99 874.06 932.80 874.10 876.00 901.23 895.52 893.54 900.00 892.30
AR4 934.97 936.61 938.61 877.19 874.43 866.45 879.81 897.65 899.19 895.53 899.65
AR5 932.49 886.03 904.05 876.10 869.31 876.88 867.50 868.51 875.27 908.53 876.87

Let’s put aside seasonality for a moment. If we use AIC as our criterion, it looks like ARMA(4,5) is the model that we are invited to pick. Consider, however, ARMA(3,2). Although ARMA(4,5) is better in terms of prediction accuracy, we know that larger ARMA models are prone to numerical problems. We can, therefore, perform a hypothesis test on whether ARMA(4,5) is significantly better than ARMA(3,2) by comparing the log-likelihoods under the two models.

arma45_ll = arima(log(full_dengue$total_cases+1),order=c(4,0,5), xreg = full_dengue[c('avg_temp','total_precip')])$loglik

arma32_ll = arima(log(full_dengue$total_cases+1),order=c(3,0,2), xreg = full_dengue[c('avg_temp','total_precip')])$loglik

chi_sq_diff = 2*(arma45_ll - arma32_ll)
1 - pchisq(chi_sq_diff, df = 4)
## [1] 0.003600093

The difference between ARMA(4,5) and ARMA(3,2) is statistically significant. Given the seasonality we’ve seen, and the suitability of ARMA(4,5), let us fit a SARMA(4,0,5) \(\times\) (1,0,1) (we set our seasonal AR and MA parameter counts to 1 so as to capture both forms of dependence without adding many more parameters to an already large ARMA model) and investigate our parameter estimates. Note that I’ve supplied a period of 52 for the seasonality since this is the predominant period according to our periodogram.

# estimate ARMA(4,5) and beta parameters
sarma45 <- arima(log(full_dengue$total_cases + 1),order=c(4,0,5), seasonal=list(order=c(1,0,1),period=52), xreg = full_dengue[c('avg_temp', 'total_precip')])
sarma45
## 
## Call:
## arima(x = log(full_dengue$total_cases + 1), order = c(4, 0, 5), seasonal = list(order = c(1, 
##     0, 1), period = 52), xreg = full_dengue[c("avg_temp", "total_precip")])
## 
## Coefficients:
##          ar1     ar2     ar3      ar4     ma1      ma2      ma3     ma4
##       0.1541  1.3302  0.3317  -0.8505  0.3483  -0.9913  -0.5731  0.5657
## s.e.  0.0513  0.0394  0.0363   0.0482  0.0597   0.0611   0.0426  0.0314
##          ma5    sar1     sma1  intercept  avg_temp  total_precip
##       0.0883  0.9925  -0.9625     0.7984    0.0812        -4e-04
## s.e.  0.0360  0.0582   0.1558     0.5918    0.0200         4e-04
## 
## sigma^2 estimated as 0.1301:  log likelihood = -409.21,  aic = 848.42

4.1 Discussion of our SARMA parameter estimates

It seems that our seasonal AR and MA parameters (\(\texttt{sar1}\) and \(\texttt{sma1}\)) are both significant. One might mistakenly think that there might not be seasonality in our noise because our regressors (temperature and rainfull) capture what we traditionally call “seasonal” behavior. However, it makes sense that there is still a remarkable amount of seasonality in our noise process because the relationship among the Aedes mosquitoes’ life-cycle, seasonal weather patterns and human-mosquito interaction is highly dynamic and difficult to explain using only weather variables.

As an example of complex cyclicality, consider the following. Recovery from infection by one serotype of the virus leads to lifelong immunity against that particular serotype[1]. Therefore, as a massive epidemic from a specific serotype of the virus ends, we expect the population to have herd immunity (a phenomenon in which the immunity of a large segment of the population prevents large breakouts of an epidemic) against that serotype. This immunity will protect the entire population from another massive epidemic even though we still expect annual flare-ups as seen in our first figure (these flare-ups could be caused by other serotypes of the virus). As the herd immunity wanes (due to death, migration, new births or evolution of the virus) we see the population becomes susceptible to massive epidemics again, leading to the massive spikes that we see occasionally in our first figure. This is an example of a cyclical component of the disease that is not captured by annual seasonal weather patterns.

Another factor that adds to the complex cyclical nature of the system is the existence of temporary cross-immunity (immunity of a person to a serotype of the virus that is different from the one that infected them).[1] Paradoxically, however, once this temporary cross-immunity fades away, subsequent infections from other serotypes make individuals more susceptible to severe dengue (a form of dengue fever associated with more severe symptoms and possibly death).

Finally, we also notice that our total precipitation covariate surprisingly has a coefficient that is not significant. This could be because the information contained in this variable is already encoded by the average temperature of the week.

4.2 Model Diagnostics

Below are the time plot, sample auto-correlation function and QQ plot for the residuals after fitting our SARMA model.

The residuals seem to be mean-zero, uncorrelated (we expect 5 out of 100 lines to cross the blue lines and 6 do, which isn’t too bad) and approximately Gaussian, as our model dictates.

5 Simulation vs. Observed Data

We got a hint from section 4.1 that the “signal” in our signal-plus-noise model failed to capture some large portion of the variability in our data. If we believe that most of the variation in the case count data is captured in the temperature and precipitation of the region, then we expect that using the weekly temperature and precipitation values as covariates will get us close to the observed data and the noise process will be small. Let’s visually inspect if a simulation from our model resembles our observed data.

The key takeaway from the plot above is that most of the variation in our data is not captured by temperature and rainfall. We had hoped that since these covariates can be used as a proxy to predict mosquito population (since the mosquitoes breed in stagnant waters), we could perhaps recover the variation in our case counts for dengue using these covariates as our regressors. However, as discussed in section 4.1, the systems and laws that dictate the interplay among the mosquitoes, the humans, the virus (which itself has four strains) and the weather are highly complex and not easily modeled by only looking at prevailing weather conditions. Therefore, we can say that while a signal-plus-noise model with temperature and rainfall as regressors and a SARMA(4,5) process as a noise process is a model whose assumptions seem to be met by the data, this model is still quite inadequate at accounting for the major sources of variation in the data.

6 Disentangling case counts as trend, noise and cycle

Despite not finding the holy grail of covariates that explain the variation in our data, we’ve been able to make important advances in understanding which ARMA model fits our data best, what covariates are significant, and how much seasonality is left after we take into account the seasonality in our covariates. Using the tools from smoothing, we can also try to tease out the trend, noise and cycle in our data.

Using the appropriate \(\texttt{span}\) window, we can see the trend of the log-transformed data (using the colloquial version of “trend” here):

library(lubridate)
full_dengue = full_dengue %>% mutate(wsd_decimal = decimal_date(week_start_date))
log_dengue_loess = loess(log(total_cases + 1)~wsd_decimal,data = full_dengue, span=0.4)
ggplot() +
  geom_line(mapping = aes(x = full_dengue$wsd_decimal, y = log(full_dengue$total_cases + 1), color = "Observed Log Cases")) +
  geom_line(mapping = aes(x = full_dengue$wsd_decimal, y = log_dengue_loess$fitted, color = "Trend")) + 
  scale_colour_manual("", 
                      breaks = c("Observed Log Cases", "Trend"),
                      values = c("red", "black")) +
  labs(x = 'Week start date', y = 'Log Case Counts')

Between the annual high-frequency seasonality we see in the observed case counts and the low-frequency long-term trend (in black above), we see that there is a mid-frequency seasonality corresponding to the cycle of large spikes in the log-tranformed data. Let’s try to disentagle this important piece from the rest of the data and get a sense of its frequency.

We can now plot a frequency response plot for these mid-range frequency cycles and pinpoint the frequencies that this decomposition identifies as spike cycles.

## [1] 0.048 0.144

Therefore, the work we’ve done by separating the mid-range frequency cycles has taught us that the spike cycles in our log-transformed data (corresponding to the business cycles example we saw in class) have frequency between 0.048 cycles per month and 0.144 cycles per month corresponding to 6.94-20.83 months per cycle. This range is, of course, sensitive to what value we use for our frequency ratio cutoff.

7 Summary and future works

8 References

  1. World Health Organization, Dengue and severe dengue, http://www.who.int/mediacentre/factsheets/fs117/en/, April 2017

  2. Wikipedia, Aedes Aegypti, https://en.wikipedia.org/wiki/Aedes_aegypti, February 2018

  3. Zhang Y., Wang T., et al., Developing a Time Series Predictive Model for Dengue in Zhongshan, China Based on Weather and Guangzhou Dengue Surveillance Data, PLOS, 2016

  4. National Oceanic and Atmospheric Administration, Dengue Forecasting, http://dengueforecasting.noaa.gov/