\[ \ \ \ \]
Early prediction and prevention is important for influenza (flu for short), a common epidemic disease. When people find some symptoms of influenze-like illness (ILI for short) on themselves or people around them, they will probabily go the the internet for help before they go to a clinic. Therefore, we can use the proportion of people searching on the internet about flu to predict how many people are getting flu, and can use it to predict the number of people that go to a clinic and report a flu.
A widely used source of internet searching data is the Google Flu Trends [1]. The Google Flue Trends is no longer publishing current data now, but they are putting some data of the past years on the website. The Google Flu Trends data is the number of flu-related search queries in 50 million search queries every week [2].
The number of reports from clinics is tracked by the Centers for Disease Control and Prevention (CDC for short) [3]. It publishes the data collected from clinics across the United States every week. One of the data published is the Influenza-like illnesses, which includes all types of flu, as well as influenza on animals (like H1N1, poultry flu).
A lot of studies have been done to analyze the two time series. Some studies focus on either the Google Trend data or the CDC visitor data and fitted Autoregressive Moving Average (ARMA) Models to them. Some studies try to predict number of visitors to clinics using the Google Flu Trends data with General Linear Models. In this analysis, we are going to try all the three approaches, and compare our results with the already published ones.
The data set used in out analysis comes from Data Dryad [4]. The data was from a published study by Preis T and Moat HS (2014) [5][6]. The data set contains three variables, which we name week, visitors and google. The variable week is the starting Sunday of the week when the data point is collected, from January 03 2010 to September 15 2013. The variable visitors is the percentage of influenza-like illness reports out of all the reports in the week in the United States. The variable google is the number of queries about flu in this week on the Google search engine. The histograms of the variable visitor and google are shown below.
We can see from the plots that the two time series are similar, except for some peaks. The Pearson correlation coefficient between the two series is 0.883, which is even higher than in Ortiz et al., 2011 [7]. It is reasonable that one can predict the other.
The autocorrelation function of the visitor and google series is plotted below. We can see from the plot that both the series are highly autocorrelated. So it is reasonable that we fit the ARMA models to the series.
The Autoregressive Moving Average model is represented as: \[ Y_n = \phi_1 Y_{n-1}+\phi_2Y_{n-2}+\dots+\phi_pY_{n-p} + \epsilon_n +\psi_1 \epsilon_{n-1} +\dots+\psi_q\epsilon_{n-q}. \]
We fit ARMA models with different \(p\) and \(q\) to the visitor series, and select one with the smallest AIC. From the AIC table shown below, we can see that the ARMA(3,2) and ARMA(4,5) models have the smallest AIC. However, for a simpler model, we can consider the ARMA(3,2) model the best fit for the CDC ILI visitor data, which repeats the result by Preis T and Moat HS (2014) [5], and is similar to Dugas et al., 2013 [8], which proposed a GARMA(3,0) model to fit the CDC ILI data.
## MA0 MA1 MA2 MA3 MA4 MA5
## AR0 540.965518 304.2246264 221.243441 92.17457 74.33809 43.98809
## AR1 23.416645 0.9254922 2.642721 -11.54413 -10.16287 -21.12926
## AR2 -2.604057 -3.2514975 -5.716416 -13.10560 -11.16092 -19.20715
## AR3 -2.018566 -9.8044431 -23.582312 -13.92051 -12.00221 -17.62094
## AR4 -6.864013 -17.8221615 -23.215239 -22.10131 -11.63324 -25.66386
The coefficient of the ARMA(3,2) model is shown below:
## ar1 ar2 ar3 ma1 ma2 intercept
## 1.3240791 0.2559128 -0.5977508 -0.0871465 -0.9128491 1.6550799
We can do the same analysis to fit the Google Flu Trends data. From the AIC table we can see that ARMA(3,0) and ARMA(2,1) have the smallest AIC.
## Warning in arima(data, order = c(p, 0, q)): possible convergence problem:
## optim gave code = 1
## MA0 MA1 MA2 MA3 MA4 MA5
## AR0 3416.758 3173.789 2974.430 2875.558 2791.676 2743.560
## AR1 2827.618 2714.561 2679.520 2674.127 2659.566 2661.564
## AR2 2662.618 2660.147 2662.135 2663.966 2661.561 2663.989
## AR3 2660.358 2662.132 2663.494 2664.511 2661.849 2665.253
## AR4 2662.259 2663.758 2660.901 2662.592 2663.589 2665.110
The coefficient of the ARMA(3,0) model is shown below.
## ar1 ar2 ar3 intercept
## 1.8265380 -1.0099211 0.1467419 1977.4845478
However, when we look at the descriptive plot of the two time series, the value can only be positive. The CDC ILI visitor series can only have values from 0 to 100, and the Google Flu Trends series can only have values from 0 to 50000000. In addition, there are some peaks in the plot. Therefore, we can perform a logit transformation on these two time series.
For the CDC ILI visitor data, the logit transformation is \(log(\frac{p}{100-p})\). Fitting an ARMA model, we can find that still an ARMA(3,2) model can be preferred. However, the AIC of the model on logit scale is much smaller than on the original scale, which is proposed by Preis T and Moat HS (2014) [5]. The coefficients of the model are shown below
## MA0 MA1 MA2 MA3 MA4 MA5
## AR0 277.7698 46.48985 -71.34551 -202.2739 -225.6581 -271.3577
## AR1 -326.0820 -339.73114 -345.97231 -349.0496 -347.0658 -373.8613
## AR2 -346.5220 -333.33290 -376.25343 -350.7875 -372.9929 -371.8672
## AR3 -354.3606 -376.27713 -377.74230 -359.3542 -358.6299 -370.1279
## AR4 -359.8451 -360.18260 -375.81823 -387.1836 -331.7121 -387.5884
## ar1 ar2 ar3 ma1 ma2 intercept
## 1.15194062 0.61171091 -0.78750349 -0.04112428 -0.83863395 -4.20642316
For the Google Flu Trends data, the logit transformation is \(log(\frac{q}{50000000-q})\). Fitting an ARMA model, we can find that an ARMA(3,1) model can be preferred. Still, the AIC of the model on logit scale is much smaller than on the original scale,
## MA0 MA1 MA2 MA3 MA4 MA5
## AR0 351.7622 108.0024 -83.38861 -201.4664 -273.0291 -333.1074
## AR1 -362.1269 -432.7016 -458.62643 -467.1181 -484.6928 -487.4992
## AR2 -487.8798 -498.9681 -497.19881 -495.8553 -495.6063 -500.0141
## AR3 -496.0681 -499.3451 -499.34393 -497.4240 -491.5949 -498.0758
## AR4 -496.9691 -499.2473 -497.39092 -495.4020 -493.4244 -493.9565
## ar1 ar2 ar3 ma1 intercept
## 2.3672392 -1.7789320 0.4042649 -0.8689690 -10.3139055
Since there is some evidence showing that the Google Flu Trend is correlated with the CDC ILI visitor data, and we can it is reasonable that we fit an ARMA model with trend. The model is represented as:
## Warning in arima(data, order = c(p, 0, q), xreg = data2): possible
## convergence problem: optim gave code = 1
## MA0 MA1 MA2 MA3 MA4 MA5
## <b> AR0</b> 249.55330 60.93842 16.28810 -54.12505 -53.23770 -60.86297
## <b> AR1</b> -53.43269 -61.10140 -59.76052 -60.72735 -72.45898 -77.28245
## <b> AR2</b> -60.22442 -68.27132 -66.62270 -67.10153 -74.55785 -75.28584
## <b> AR3</b> -59.07898 -66.74478 -59.34749 -78.21495 -78.13270 -77.19327
## <b> AR4</b> -57.55068 -65.71965 -59.59983 -77.20832 -76.78991 -75.72089
By looking at the AIC table, we can suggested the ARMA(3,3) model with trend is the best. The coefficients of the ARMA(3,3) model with trend are:
## ar1 ar2 ar3 ma1 ma2
## -0.3750662286 0.2513401019 0.5646704928 1.6095731056 1.3235750453
## ma3 intercept dat$google
## 0.5233259515 0.7324757636 0.0004590002
Ginsberg et al. (2009) [9] brought up a logit regression model to describe the relationship between CDC ILI visitor data and the Google Flu Trends search query data. The model is represented as:
\[ log(\frac{p}{1-p})=\beta_0+\beta_1log(\frac{q}{1-q})+\epsilon,\ \epsilon\sim N(0,\sigma^2) \] In the model, \(p\) denotes the proportion of influenza-related reports in all clinical reports, which is visitors/100 in our data. \(q\) denotes the proportion of influenza-related internet search queries in all queries, which is google/50000000 in out data.
In this analysis we want to look at this model in a time series way. We want to fit an ARMA model with trend to mimic this model. Therefore, we calculat \(P=log(\frac{p}{1-p})\) and \(Q=log(\frac{q}{1-q})\) and fit an ARMA model with trend.
## MA0 MA1 MA2 MA3 MA4 MA5
## <b> AR0</b> -119.0395 -267.8854 -326.2992 -365.3667 -366.8642 -391.6460
## <b> AR1</b> -414.4684 -413.8740 -411.9045 -410.0045 -408.0817 -419.4044
## <b> AR2</b> -413.8171 -411.8957 -411.6011 -409.9493 -409.1113 -417.9724
## <b> AR3</b> -411.9216 -411.9807 -421.9834 -411.2964 -418.4268 -417.2049
## <b> AR4</b> -409.9660 -410.5469 -411.3655 -410.2523 -418.1743 -420.8305
By looking at the AIC table, we can suggest an ARMA(3,2) model with trend. The error of the linear regression model is autocorrelated, unlike in the model Ginsberg proposed. Fitting the model we proposed we can get the coefficients:
##
## Call:
## arima(x = datp$vis, order = c(3, 0, 2), xreg = datp$pop)
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 intercept datp$pop
## -0.3881 0.1894 0.8868 1.2606 1.0000 2.8226 0.6812
## s.e. 0.0354 0.0426 0.0345 0.0250 0.0314 0.5797 0.0558
##
## sigma^2 estimated as 0.005926: log likelihood = 218.99, aic = -421.98
This model has a much smaller AIC than the ARMA(3,3) model with trend in original scale. Therefore, an ARMA(3,2) model with trend in logit scale is the better model to fit the relationship between CDC ILI data and the Google Flu Trend data.
Partially observed Markov process (POMP for short) is a process with a latent or hidden markov process, and an observation process [10]. The correlation between the Google Flu Trend data and CDC ILI visitor data shows that the two have some relationship. The regression model tells us that you can even well predict the visitor data by Google Flu Trend data. However, obviously the relationship between the two variables are not causal. People going to clinic for flu cannot directly cause people to search about flu on the internet, and people searching about flu on the internet cannot directly cause people to go to clinic with flu. A better interpretation for the covariation of the two variable is there is a third variable, such as the number of people getting flu, causing both CDC ILI visitor number and Google Flu Trend number to change.
Therefore, we can propose a POMP model, with the Google Flu Trend data as observation process, and the number of people getting flu as the hidden markov process. The hidden process of number of people getting flu is not available from the data. However, we can probably use the CDC ILI visitor number to reflect it. Since it is not easy to fit an ARMA(3,2) model in POMP analysis, we would suggest that the number of people getting flu is appropriate for a Ricker model. The Ricker model can be represented as:
\[ P_{n+1}=rP_ne^{(-P_n+\epsilon_n)},\ \epsilon_n\sim N(0,\sigma^2) \]
In our POMP model, we will use the ricker model as the skeleton for the hidden markov process, and the logit regression function as our measurement model:
\[ Q=\frac{P-\beta_0-\phi}{\beta1}, \phi\sim N(0,\tau^2) \]
In the measurement model, \(Q\) denotes the logit of the Google Flue Trend \(log(\frac{q}{1-q})\), and \(P\) denotes the number of people getting flu from the hidden markov process.
In the POMP model, there are 5 parameters \(\Theta=(\beta_0,\beta_1,\sigma,\tau,r)\). We will run a program with the POMP package in R to simulate to find the parameters giving us the maximum likelihood.
## Loading required package: pomp
## Warning: in 'mif2.pfilter': 5 filtering failures occurred.
## b0 b1 sigma tau r
## 17.561956 1.563190 1.088736 1.518067 9.274160
The coefficient of the parameters are shown above. Looking at the diagnostic plots below, the parameter estimates have not come to convergence yet. Because in the measurement process, the normal distribution density is not reliable, we cannot run more iterations.
The two time series, CDC ILI visitor data and the Google Flu Trends data are highly correlated. The CDC ILI visitor data can be appropriately fitted by an ARMA(3,2) model, while the Google Flu Trends data can be appropriately fiited by an ARMA(3,0) or an ARMA(2,1) model. We can use a logit regression to fit the relationship between the two series, using the Google Flu Trends data to predict the CDC ILI visitor data. However, when we take the variable time into account, the residuals are autocorrelated. Instead, the CDC ILI visitor data can be represented as an ARMA(3,2) model with the Google Flu Trends data as trend.
For the POMP model, we propose a model with 5 parameters: 2 for the Ricker model as skeleton for the hidden markov process, and 3 for the logit regression for the measurement model. The model is not very stable, and can be adjusted to run more iterations. After 200 iterations should the parameter estimates become stable, and we can estimate the likelihood better.
[1]https://www.google.org/flutrends/about/
[2]https://en.wikipedia.org/wiki/Google_Flu_Trends
[3]https://www.cdc.gov/flu/weekly/fluactivitysurv.htm
[4]https://datadryad.org/resource/doi:10.5061/dryad.r06h2
[5]Preis T, Moat HS (2014) Adaptive nowcasting of influenza outbreaks using Google searches. Royal Society Open Science 1: 140095. https://doi.org/10.1098/rsos.140095
[6]Preis T, Moat HS (2014) Data from: Adaptive nowcasting of influenza outbreaks using Google searches. Dryad Digital Repository. https://doi.org/10.5061/dryad.r06h2
[7]Ortiz, J. R., Zhou, H., Shay, D. K., Neuzil, K. M., Fowlkes, A. L., & Goss, C. H. (2011). Monitoring influenza activity in the United States: a comparison of traditional surveillance systems with Google Flu Trends. PloS one, 6(4), e18687.
[8]Dugas, A. F., Jalalpour, M., Gel, Y., Levin, S., Torcaso, F., Igusa, T., & Rothman, R. E. (2013). Influenza forecasting with Google flu trends. PloS one, 8(2), e56176.
[9]Ginsberg, J., Mohebbi, M. H., Patel, R. S., Brammer, L., Smolinski, M. S., & Brilliant, L. (2009). Detecting influenza epidemics using search engine query data. Nature, 457(7232), 1012.
[10]Class notes, https://ionides.github.io/531w18/09/notes09.html