This year has vitnessed a much more severe influenza outbreak than previous years. Influenza has always been the most common and most deadly disease of all times. It caused considerable economic burdens of medical visits and absence from school and work. Moreover, influenza is also a great threat to human lives, especially for the old and children. It is reported that pneumonia and influenza altogether ranked as the sixth leading cause of death in the United States. [1] Besides seasonal outbreak, there are also 3 influenza pandemics (marked by excess mortality) in the 20th century (Spanish influenza in 1918, Asian influenza in 1957, and Hong Kong influenza in 1968)[2]. The 1918 Spanish influenza alone caused about 50 million death, much greater than the total death of 18 million in World War I. [3] Therefore it is of extreme interest to study the temporal behavior of influenza mortality, so that we could get prepared for the next influenza epidemic / pandemic.
Except for the sporadic pandemic, the outbreak of influenza epidemic is highly periodic. Influenza attack rate is especially high during the cold winter months, which typically lasts around 9 weeks, the so-called “flu season”. [4] The reason of the highly seasonal behavior of influenza epidemics is still not quite clear but it was speculated that the cold weather provides a beneficial environment for influenza virus survival, including appropraite temperature, humudity, ultraviolet radiation, etc. [5]
In this project, we want to describe the temporal behavior of influenza mortality (find the typical peak time, flu season duration etc) and explore the suitable statistical models to describe this time series. Understanding the evolvement of influenza mortality might help with the possible reasons and maybe even better predictions for influenza outbreak.
The dataset for influenza mortality is acquired from National Center for Health Statistics Pneumonia and Influenza (P&I) Mortality Surveillance[6].
This dataset contains a continuous 436 weeks of influenza deaths in the US from the 40th week in 2009 to the 6th week in 2018, covering more than 8 years.
Prior to use any analysis method, let’s take a look at the visualization of the time series of influenza deaths.
From the figure, we could see that influenza deaths are fairly low and smooth in non-epidemic seasons, and that influenza epidemic seasons appear with high evidence of periodicity. But the peak deaths differ year from year. It is especially low in year 2012, and we are indeed experiencing a bad year now in 2018. But good news is that we seem to be getting out of the influenza season soon.
Another figure we want to check before analysis the the kernal density of the death rate.
We can see from the above figure that it has most part below 100 (non-epidemic deaths) and a long tail corresponding to the highly varied epidemic deaths. This density plot is also consistent with the observations in time series figure above.
This analysis require the following packages: “lomb”, “depmixS4”, “corrplot”, “astsa”.
We will utilize the sample autocorrelation and the periodogram to explore the seasonality of the data.
The sample autocorrelation is shown below: From the autocorrelation, we can see that there are a series of peaks around 50, which is almost one year and this fits our expectation.
The smoothed periodogram is plotted below: The dashed verticle line in the plot corresponds to 1/50, which is the maximum in all tested frequencies. It means that this data has a period of 50 weeks. It is not exactly 52 weeks in a year is because the default frequencies here are differed by 1/450 and 1/50 is the nearest frequency to 1/52.
From the theory of time series data analysis, the high frenquency could be regarded as noise, while the low frequency can approximate the trend. The middle frenquency could be examined for seasonality.
Therefore a local polynomial regression fitting with different choice of degree of smoothing could be used to estimate the noise, trend and cycles. This analysis shows an up-going trend with some cycles we want to examine and compre with original signal in the next part.
Lomb-Scargle periodogram is a efficient and widely used algorithm to detect seasonality and examine the significance. This algorithm can tolerate the uneven sampling with least-squares fitting of sinusoids.
The two figures below show the Lomb-Scargle Periodogram, the distribution of spectrual power and p-value of peak power for raw influenza deaths data and cycles after removal of trend and noise, respectively.
The peak frequency for raw data is 0.0183908, and the corresponding period is 54.375.
The peak frequency for cycle data after removal of trend and noise from raw data is 0.0183908, and the corresponding period is 54.375.
The two results agree with each other, and both are very significant (empirical p-value = 0 with 1000 randomization). The purified cycle data has a larger statistic, which could lead to a more significant p-value if sufficient permutation were performed.
After the identification of epidemic seasons, we are interested in the correlation of the seasons.
The features we extracted from the 7 complete influenza epidemics are:
feature name | feature explanation |
---|---|
start | start time (yearly) |
end | end time (yearly) |
max_time | peak weekly deaths time (yearly) |
duration | span |
deaths | total deaths |
max | peak weekly deaths |
mean_epi | mean weekly deaths |
mean_nonepi | mean weekly deaths of the lastest non-epidemic season |
mean_3 | mean weekly deaths of the first three weeks during the epidemic season |
The features marked with yearly in the table is the week of the year. Because epidemic seasons are in winter, I take the remainder of 30 to make the weeks to the end of each year comparable to the beginning weeks of the next year.(e.g. 30 -> 52, 35 -> 5, 52 -> 22, 1 -> 23) Without further explanation, the features in the above table refers to those of each epidemic season.
The features are designed this way: start, end, and max_time are attributes about the epidemic season; deaths, duration, max and mean are features measuring the severity of the epidemic; mean_nonepi and mean_3 are features that could have potential predictive power to suggest some information about the severity features of the epidemic. These features are not indepent in that:
Spearman correlation was used to capture the non-linear relationship between variables. The correlation plot of the above features is shown below:
From the figure, we can see that the four features (duration, deaths, max, mean_epi) to describe the severity of each epidemic are positively related with each other. (the middle \(4 \times 4\) red squares). Maximum time is highly related to epidemic start time, but not end time. Epidemic start time and mean_3 (mean weekly deadths in the first three weeks) are respectively positively and negatively related to all four features related to severity, indicating that earlier onset of epidemic and more deaths during the first three weeks of epidemic suggest a more exacerbated epidemic season.
To be more specific, spearman correlation of start time and epidemic season total death is -0.847 with p-value 0.0161971.
The scatter plot of aformentioned features are shown below:
SARIMA (seasonal autoregressive integrated moving average) model is a very flexible statistical model for explore the different dependency structure in time series data.
A SARIMA \((p,d,q) \times (P,D,Q)_{S}\) for a time series model \(Y_{1:N}\) is formulated as follows:
\({\phi}(B){\Phi}(B^{S}) \big( (1-B)^d(1-B^{S})^D Y_n-\mu \big) = {\psi}(B){\Psi}(B^{12}) \epsilon_n\),
where
Data were log-transformed prior to SARIMA fitting. The function sarima in R package astsa were used as the implementation of SARIMA. Combinations of parameters are tested for \(p,d,q,P,D,Q \in \{0,1,2\}\) and \(S = 52\) and the model with best AIC is the paramter set \(p = 1, d = 1, q = 0, P = 1, D = 1, Q = 1\), corresponding to a SARIMA model \((1,1,0) \times (1,1,1)_{52}\).
The estimated values and statistical tests of SARIMA model coefficients with above parameter set are shown below:
## Estimate SE t.value p.value
## ar1 -0.3894 0.0473 -8.2345 0.0000
## sar1 0.1276 0.0603 2.1149 0.0351
## sma1 -0.9997 0.2119 -4.7179 0.0000
All the coefficients are significant. Insterestingly, the seasonal MA coefficient is almost \(-1\).
The model residuals and the corresponding ACF are exhibited below.
Although the autocorrelation of residuals exceeds the dashed line (presumed 95% CI) at some lags, in general there is no strong autocorrelation in the residuals.
A comparison of fitted value and original value of weekly influenza deaths can be found below.
From the figure, we can see that the fitted value is quite close to the original value. The fitting of non-epidemic season is worse compared to fitting of epidemic seasons. It could be because of the randomness of non-epidemic influenza infections.
Influenza is a great threat to human health and understanding its temporal mortality is crutial. In this project I explored the behavior of time course of weekly influenza deaths over 8 years, revealing the following interesting conclusions:
Possible future works include comparing the shapes of epidemic seasons and test the predictive power of risk factors and statistical models with more data and cross validation.
[1] Centers for Disease Control and Prevention (CDC. “Pneumonia and influenza death rates–United States, 1979-1994.” MMWR. Morbidity and mortality weekly report 44.28 (1995): 535. https://jamanetwork-com.proxy.lib.umich.edu/journals/jama/fullarticle/389423
[2] Wikipedia: Influenza. https://en.wikipedia.org/wiki/Influenza
[3] Wikipedia: World War I casualties. https://en.wikipedia.org/wiki/World_War_I_casualties
[4] Paget, John, et al. “Influenza activity in Europe during eight seasons (1999–2007): an evaluation of the indicators used to measure activity and an assessment of the timing, length and course of peak activity (spread) across Europe.” BMC infectious diseases 7.1 (2007): 141.
[5] Schaffer, F. L., M. E. Soergel, and D. C. Straube. “Survival of airborne influenza virus: effects of propagating host, relative humidity, and composition of spray fluids.” Archives of virology 51.4 (1976): 263-273.
[6] Data source: https://www.cdc.gov/flu/weekly/weeklyarchives2017-2018/data/NCHSData08.csv