1 Introduction

In the previous project, I found that rainfall could be a potential driver for recent cholera epidemics in Hati using preliminary cross-correlation analysis. In order to obtain better insights into the interaction between rainfall and cholera transmission, in this study, I further examine the cross-correlation between transmission (using difference of log incidence as an approximate) and rainfall (smoothened by different functions). As follow-up, I implement a mechanistic model in an effort to explain the possible mechanism regarding how rainfall could drive cholera epidemics in Haiti.

2 Data

Case Data Cholera daily incidence by departments is available on Ministere de la Sante Publique et de la Population (MSPP) website. Report is published everyday in pdf formate. From 2012 to 2015, There are 120 dates missing in the database. Half of the data points appear to be missing at random; however, some of the data missing throughout large segments in time, particularly in October 2013 and June 2014. These missing data may bias the outcomes given the non-systematic distribution of the size of the missing data in time and the fact that the missing points located when the cholera outbreak usually occurs. Data crawling and parsing are managed using python 2.7.10.

Rainfall Data Corresponding daily precipitation estimates are pulled from National Aeronautics and Space Administration (NASA) Tropical Rainfall Measuring Mission (TRMM). The coordinate of geographical coverage is longitude [-73.120633, -72.226589] and latitude [18.838958, 19.805066] with resolutions of 0.25° x 0.25°. Fig 1 below shows rainfall and incidence data in recent outbreaks (01/01/2012-12/31/2015).

Fig 1: daily cholera incidence and rainfall time series

Fig 1: daily cholera incidence and rainfall time series


3 Analysis

3.1 Cross-correlation analysis

Transmission approximate

  • Difference of natural log on incidence is taken as an approximate of cholera transmission (Fig 2) \[ ln(t_{n+1}) - ln(t_n), n=1,2,3,…\]
  • Unlike the incidence, we can see there are very sharp and narrow peaks from the plot. Most of the peaks occur during summer time/raining season, which is consistent with our hypothesis. However, the abrupt peaks might be owing to some artificial effect of reporting behavior or the setup of the surveillance system. For example, hospitals might tend to report cases once every week; even though the incidence reports publish daily.
  • Fig 3 is the zoomed plot by cutting off those huge peaks. It is still hard to tell if there is any pattern given all the noises.


Fig 2: difference of log(incidence)

Fig 2: difference of log(incidence)


Rainfall data smoothing

I have tried three smoothing methods here:

  • Fourier Filtering (FFT)
    • I first use Fast Fourier Transform fft in order to see the magnitude of frequency in the rainfall data (Re(fft_rain), Fig 3). We can sort of see that there are some relative stronger signals before y = 20; however, it is very hard to determine the cutoff of noise. Therefore, I just decide the range by visualizing various cutoff, and here is the smoothing I come out with (Fig4) which will be used in the following analysis.
Fig 3: daily cholera incidence and rainfall time series

Fig 3: daily cholera incidence and rainfall time series


Fig 4: FFT-smoothed rainfall

Fig 4: FFT-smoothed rainfall


  • Linear interpolation (LI)
    • I sum up the rainfall data by week and then use linear interpolate function approxfun to interpolate the daily points (Fig5).
Fig 5: linear-interpolation-smoothed rainfall

Fig 5: linear-interpolation-smoothed rainfall


  • Cubic spline interpolation (SP)
    • The procedure is the same as previous one, but I use sp_fun to interpolate instead. (Fig6)
Fig 6: spline-interpolation-smoothed rainfall

Fig 6: spline-interpolation-smoothed rainfall


  • Both linear interpolation and cubic spline interpolation give very similar outcomes. FFT-processed rainfall is smoother given the cutoff point I choose.


CCF and pre-whitening CCF

  • Cholera transmission approximates and three smoothed rainfall sereis are used for following cross-correlation analysis. As the first step, I plot the rainfall seires against cholera transmission series (here i only show the one with FFT-smoothing Fig 7). It seems that the peaks of transmission coincide pretty well with peaks in FFT-smoothed rainfall.
Fig 7: cholera transmission and smoothed rainfall

Fig 7: cholera transmission and smoothed rainfall


  • I then run ccf on all three rainfall-transmission pairs (Fig 8). The CCFs of all three pairs look very similar, and all show very obvious sinusoidal pattern. However the strongest signals from the three plots all suggest that future rainfall (~45 days) is positively correlated with cholera transmission, which is contradicted against what we thought earlier.
Fig 8: cross-correlation between smoothed rainfall and cholera transmission

Fig 8: cross-correlation between smoothed rainfall and cholera transmission


  • Pre-whitening CCF is also applied to all the rainfall and cholera transmission series (Fig 9). All pre-whitening CCF of three smoothed rainfall series and cholera transmission again demonstrate very similar pattern. We can see there are strong (both positive and negative) correlation peaks every other ~100 days among three plots, even thought the the pattern of the strength of the peaks are different (eg. for FFT, the peaks get bigger when lag is longer; however for LI, the peaks are smaller when the lag gets bigger.)
Fig 9: pre-whitening cross-correlation between smoothed rainfall and cholera transmission

Fig 9: pre-whitening cross-correlation between smoothed rainfall and cholera transmission


  • I also checked the ACF of residual for three rainfall series from pre-whitening CCF (Fig 10). Although FFT and SP both have less than 5% of ACFs outside the horizontal lines (LI is around 5%), we can see there are very obvious sinusoidal pattern in all three plots indicating the AR model used in pre-whitening is not very appropriate for modeling the rainfall series. It thus could be misleading to interpret the pre-whitening CCFs.
Fig 10: AR model diagnostics (Residual ACF)

Fig 10: AR model diagnostics (Residual ACF)


3.2 Partially observed Markov process models (POMP) analysis

SIRW model overview

  • Cross-correlation analysis suggests that rainfall has a time-dependent association with cholera transmission, however, it does not give as details about how and in what stage rainfall affects cholera transmission. In order to better understand the process, I adapted a SIRW model from (Tien & Earn 2010; Eisenberg et al. 2013) to illustrate the transmission process of cholera. As the model diagram shows,cholera can either be transmitted directly from one individual to another or through contaminated water source. Since increasing rainfall is usually related to more frequent contact with contaminated water, and therefore increase the transmission of cholera, I include rainfall as a forcing function to drive the force of infection (\(\beta_w I\)) in waterborne transmission pathway. In order to do so, I make force of infection proportional to rainfall by directly multiplying the them together (\(\beta_w I R(t_i)\).All the three smoothed rainfall series from previous section have been visually evaluated by fitting each rainfall driven SIWR model into incidence data. It seems the one with LI series fits better, so I only show the results using linear interoperation rainfall function here. Below are the deterministic ODE model equations and SIWR model diagram:SIRW model


\[\frac{dS}{dt}=-\beta _ISW-\beta _WSW\] \[\frac{dI}{dt}=\beta _ISW+\beta _WSW-\gamma I\] \[\frac{dR}{dt}=\gamma I\] \[\frac{dW}{dt}=\xi I-\xi W\]


  • The SIRW model is then implemented into a stochastic model using pomp with daily incidence as output. In order to account for measurement error, we assume each data point is drawn based on poisson random variables with mean = \(Ik\), where \(k\) is reporting rate.

  • The initial condition of each state is: \[ S(0)=1000000 \] \[ I(0)=100\] \[W=5\]

  • \(S(0)\) is about 2/3 of Artibonite’s total population (~1,727,524), and I(0) is based on the incidence data with the estimated reporting rate from WHO (5%~10%)(‘WHO | the global burden of cholera’ 2016). W(0) is rather arbitrary, which is decided by fitting the model several times. I do not take R into account here since I assume there is no lose of immunity. \(/gamma\) is the recovery rate which is fixed at 0.25 given the estimates from WHO (~4 days)(2016).

  • I play with other parameter values using deterministic model to have better picture of the model’s behavior. The SIRW model with rainfall forcing function appears to be able to capture the seasonality of choler incidence with proper \(\beta_I\), \(\beta_W\) ratio and low \(\xi\) (to make sure the rainfall effect from waterborne pathway plays a role) as long as the early epidemics do not burn out the susceptible populations (therefore low \(\beta_I\) and \(\beta_W\) too). I then fit the deterministic model to daily incidence data from MSPP using Nelder-Mead method with traj.match, Using the fitted parameter values (\(\beta_I=2.28e-07\)s,\(\beta_W=3.23e-07\), \(\xi=6.30e-07\),$ k= 3.02e-02$), Fig 11, and Fig 12 show the results of trajectory and simulation produced by deterministic and stochastic process respectively.

Fig 11: SIRW trajectory

Fig 11: SIRW trajectory

Fig 12: SIRW simulation

Fig 12: SIRW simulation


  • The fits look good (but may have a little bit lag) except for where the data is missing. Especially, SIRW model without forcing function or introduced cases is not able to simulate this kind of seasonality (not shown here but can be verified with embedded code) or multiple peaks. To better evaluate the model fitting, I also estimate the MLE using pfilter with the same fitted parameter set. The result of likelihood is shown below:



##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  52.833  52.977  53.018  53.068  53.171  53.411


Likelihood estimation

  • I also run both local and global optimization to see if i can find better model and meanwhile to check the likelihood surface and parameter behaviors around MLE. However, The optimization haven’t finished until now (I did not run it on Flux–my bad–but ran it on one of the machines in our lab using 12 cores.I will send the results as soon as it finishes.) Nevertheless, I provide a toy example by running only {sirw_Np=200; sirw_Nmif=5; sirw_Neval=10; sirw_Nglobal=10; sirw_Nlocal=10} to just get a sense what might look like.


Optimization and parameter estimation

  • Local optimization
    • The likelihood has been improved here (~55, it was around 53 before optimizing with iterated filtering algorithm IF2 mifs (Ionides et al. 2015). However, while look at the likelihood plot, it is hard to tell if the optimization is successful or the likelihood surface given the small number of iterations.


##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  52.988  54.905  55.793  55.416  56.293  56.808
Fig 13: Loglike and parameters (local optimization)

Fig 13: Loglike and parameters (local optimization)


  • Global optimization
    • it seems that from the likelihood plot, the \(\beta_I\) has two very close local optimums very (we can see two clusters of dots in those plots), but still, since the iteration and the particle number are too small (also the filter diagnostics is terrible), we still need to wait for the actual job to finish.
Fig 14: Loglike and parameters (global optimization)

Fig 14: Loglike and parameters (global optimization)


Fig 15: Filtering diagnostics (global optimization)

Fig 15: Filtering diagnostics (global optimization)

Fig 15: Filtering diagnostics (global optimization)

Fig 15: Filtering diagnostics (global optimization)


4 Conclusions


5 Packages

  1. knitcitations (Boettiger 2015)
  2. RCurl (Temple Lang & CRAN team 2016)
  3. xts (Ryan & Ulrich 2014)
  4. TSA (Chan & Ripley 2012)
  5. bibtex (Francois 2014)
  6. pomp (King et al. 2015, 2016)
  7. doMC (Analytics & Weston 2015a)
  8. foreach (Analytics & Weston 2015b)
  9. ggplot2 (Wickham 2009)

References

Analytics, R. & Weston, S. (2015a). DoMC: Foreach parallel adaptor for ’parallel’. Retrieved from https://CRAN.R-project.org/package=doMC

Analytics, R. & Weston, S. (2015b). Foreach: Provides foreach looping construct for r. Retrieved from https://CRAN.R-project.org/package=foreach

Boettiger, C. (2015). Knitcitations: Citations for ’knitr’ markdown files. Retrieved from https://CRAN.R-project.org/package=knitcitations

Chan, K.-S. & Ripley, B. (2012). TSA: Time series analysis. Retrieved from https://CRAN.R-project.org/package=TSA

Eisenberg, M.C., Kujbida, G., Tuite, A.R., Fisman, D.N. & Tien, J.H. (2013). Examining rainfall and cholera dynamics in haiti using statistical and dynamic modeling approaches. Epidemics, 5, 197–207. Retrieved from http://dx.doi.org/10.1016/j.epidem.2013.09.004

Francois, R. (2014). Bibtex: Bibtex parser. Retrieved from https://CRAN.R-project.org/package=bibtex

Ionides, E.L., Nguyen, D., ’e, Y.A., Stoev, S. & King, A.A. (2015). Inference for dynamic and latent variable models via iterated, perturbed bayes maps. Proceedings of the National Academy of Sciences, 112, 719–724. Retrieved from http://dx.doi.org/10.1073/pnas.1410597112

King, A.A., Ionides, E.L., Bretó, C.M., Ellner, S.P., Ferrari, M.J., Kendall, B.E., Lavine, M., Nguyen, D., Reuman, D.C., Wearing, H. & Wood, S.N. (2016). pomp: Statistical inference for partially observed Markov processes. Retrieved from http://kingaa.github.io/pomp

King, A.A., Nguyen, D. & Ionides, E.L. (2015). Statistical inference for partially observed Markov processes via the R package pomp. Journal of Statistical Software, in press.

R Core Team. (2016). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Retrieved from https://www.R-project.org/

Ryan, J.A. & Ulrich, J.M. (2014). Xts: EXtensible time series. Retrieved from https://CRAN.R-project.org/package=xts

Temple Lang, D. & CRAN team. (2016). RCurl: General network (hTTP/FTP/.) client interface for r. Retrieved from https://CRAN.R-project.org/package=RCurl

Tien, J.H. & Earn, D.J.D. (2010). Multiple transmission pathways and disease dynamics in a waterborne pathogen model. Bulletin of Mathematical Biology, 72, 1506–1533. Retrieved from http://dx.doi.org/10.1007/s11538-010-9507-6

WHO | the global burden of cholera. (2016).WHO. Retrieved from http://www.who.int/bulletin/volumes/90/3/11-093427/en/

Wickham, H. (2009). Ggplot2: Elegant graphics for data analysis. Springer-Verlag New York. Retrieved from http://ggplot2.org

(2016). Retrieved from http://www.who.int/mediacentre/factsheets/fs107/en/