Influenza, which we nornally know as “the flu”, is a contagious disease caused by an influenza virus. Though it may seem very common that a lot of people may get flu in so-called “flu seasons” and most of them will recover completely in about one to two weeks, but others will develop life-threatening complications (such as pneumonia)[2]. According to Centers for Disease Control and Prevention (CDC), common causes of viral pneumonia are influenza and respiratory syncytial virus (RSV) in the United States and by convention, CDC combine the mortality rate of the two diseases together[3]. In this study I would like to conduct a time series analysis for the mortality caused by pneumonia in the state of Michigan.
The dataset was acquired from https://gis.cdc.gov/grasp/fluview/mortality.html and you may download a region-wise dataset of your interest. The dataset consists of 229 weekly records of deaths caused by influenza and by pneumonia and the proportion of deaths caused by the 2 diseases over all deaths, starting week 40 in 2015 to week 8 in 2020 for Michigan. Here we would only analyze the pneumonia data.
SEASON | WEEK | PERCENT.P.I | NUM.INFLUENZA.DEATHS | NUM.PNEUMONIA.DEATHS | TOTAL.DEATHS |
---|---|---|---|---|---|
2015-16 | 40 | 6.3 | 0 | 116 | 1,840 |
2015-16 | 41 | 6.2 | 0 | 108 | 1,742 |
2015-16 | 42 | 6.4 | 0 | 118 | 1,840 |
2015-16 | 43 | 6.7 | 0 | 121 | 1,801 |
2015-16 | 44 | 6.9 | 1 | 125 | 1,815 |
2015-16 | 45 | 5.6 | 0 | 100 | 1,773 |
Below is the time series plot for Number of Deaths Caused by Flu (black) and by Pneumonia (Red).
Though many people believe that there is no season for pneumonia, we can see from the plot that every year when there is an influenza outbreak, the nubmer of deaths caused by pneumonia would reach its peak. It is reasonable to assume that the period is about 1 year, or 52 weeks. To confirm its seasonality, we may make some plots to better understand this.
## Period of a cycle: 48
First, we can make a spectral density plot. Moving the crossbar to each point in both unsmoothed and smoothed periodogram, one may notice the dominant frequency is 0.02083333, suggesting a period of 48. Looking at the ACF plot of the data, we can see a cycle in every 50 lags, which confirms the seasonality, too.
Decomposing the time series data into three parts: trend + seasonal + random, besides an obvious seasonality, we can also see a slight trend of the data.
The plot of the original data suggests deaths could be modeled by a random process whose expected value \(\mu_n\) is relatively stable with time. The variation around the mean can then be considered as a random process, and thus we can suppose \(x_n-\mu_n\) is stationary, where \(\mu_n\) is the expected value of the number of deaths. We can use local regression to estimate \(\mu_n\).
Then the autocorrelation plot and periodogram plot of the residul of local regression can help confirm that there is no trend nor seasonality of the residuals.
Now we can fit a stationary Gaussian ARMA(p,q) model with parameter vector \(\theta=\left(\phi_{1: p}, \psi_{1: q}, \mu, \sigma^{2}\right)\) given by \[ \phi(B)\left(Y_{n}-\mu\right)=\psi(B) \epsilon_{n} \] where
\[ \begin{aligned} Y_n &= x_n-\mu_n \\ \mu &=\mathbb{E}\left[Y_{n}\right] \\ \phi(x) &=1-\phi_{1} x-\cdots-\phi_{p} x^{p} \\ \psi(x) &=1+\psi_{1} x+\cdots+\psi_{q} x^{q} \\ \epsilon_{n} & \sim \operatorname{iid} N\left[0, \sigma^{2}\right] \end{aligned} \]
MA0 | MA1 | MA2 | MA3 | MA4 | MA5 | |
---|---|---|---|---|---|---|
AR0 | 1775.537 | 1776.543 | 1777.735 | 1779.001 | 1780.932 | 1782.650 |
AR1 | 1776.671 | 1778.207 | 1763.493 | 1765.207 | 1767.068 | 1767.875 |
AR2 | 1777.575 | 1763.146 | 1760.002 | 1760.485 | 1761.616 | 1762.233 |
AR3 | 1779.227 | 1764.969 | 1767.142 | 1761.993 | 1766.366 | 1764.005 |
AR4 | 1781.222 | 1766.209 | 1777.931 | 1758.208 | 1765.414 | 1768.081 |
AR5 | 1783.027 | 1784.950 | 1782.220 | 1768.365 | 1782.293 | 1775.342 |
From the AIC table, one can see that ARMA(2,2) yields the lowest AIC value. Here is the output of ARMA(2,2):
##
## Call:
## arima(x = u1_loess$residuals, order = c(2, 0, 2))
##
## Coefficients:
## ar1 ar2 ma1 ma2 intercept
## 1.8166 -0.8485 -1.9189 0.9206 0.0009
## s.e. 0.1299 0.0891 0.2276 0.2245 0.0476
##
## sigma^2 estimated as 119.4: log likelihood = -874, aic = 1760
Now we can do some model diagnostics.
## AR Roots: 1.070555+0.180336i 1.070555-0.180336i
## MA Roots: 1.040568+0i 1.043918-0i
Here we checked that there is no parameter redundancy since the AR roots ar not very close to the MA roots.
##
## Shapiro-Wilk normality test
##
## data: arma_22$residuals
## W = 0.99574, p-value = 0.7808
Based on the QQ-plot as well as the result of shapiro test, we conclude that the distribution of the residuals is close to normal, which supports the normality assumption about \(\omega_t\). Besides, the ACF plot of the residuals can help conclude that the residuals appear to be uncorrelated, though lag 10 and lag 83 slightly deviate from the interval. Hence, the Gaussian iid assumption is met.
One can also align fitted value with real data, and we can see that the mean+ARMA(2,2) does a great job fitting the data, while the residual vs. fitted plot corroborates our conclusion.
One can also seek to fit a sarma model to the detrended data to capture the seasonality of the deaths, as the trend can be model with local regression with a larger span (0.5). The \(SARMA(p,q)×(P,Q)_{52}\) model is \(\phi(B) \Phi\left(B^{52}\right)\left(Y_{n}-\mu\right)=\psi(B) \Psi\left(B^{52}\right) \epsilon_{n}\), where \(\epsilon_{n}\) is a white noise process and \[ \begin{aligned} \mu &=\mathbb{E}\left[Y_{n}\right] \\ \phi(x) &=1-\phi_{1} x-\phi_{2} x^{2} \\ \psi(x) &=1+\psi_{1} x+\psi_{2} x^{2} \\ \Phi(x) &=1-\Phi_{1} x-\cdots-\Phi_{p} x^{P} \\ \Psi(x) &=1+\Psi_{1} x+\cdots+\Psi_{q} x^{Q} \end{aligned} \]
This time, we only used local regression to model the general trend rather than seasonality. Hence, the ACF plot of residuals appear a strong seasonality and we would capture this with a SARMA model.
MA0 | MA1 | MA2 | MA3 | MA4 | MA5 | MA6 | MA7 | MA8 | |
---|---|---|---|---|---|---|---|---|---|
AR0 | 1988.799 | 1923.397 | 1899.419 | 1890.523 | 1879.087 | 1871.869 | 1869.789 | 1855.871 | 1848.777 |
AR1 | 1875.966 | 1847.426 | 1849.366 | 1847.953 | 1844.640 | 1844.303 | 1846.245 | 1845.694 | 1847.580 |
AR2 | 1859.329 | 1849.384 | 1851.299 | 1809.872 | 1844.229 | 1846.091 | 1848.229 | 1847.617 | 1847.455 |
AR3 | 1851.742 | 1849.629 | 1842.027 | 1811.626 | 1842.962 | 1847.785 | 1849.740 | 1846.397 | 1849.024 |
AR4 | 1848.304 | 1850.033 | 1841.824 | 1842.973 | 1850.798 | 1815.122 | 1817.226 | 1847.695 | 1848.713 |
AR5 | 1849.746 | 1850.723 | 1851.847 | 1852.283 | 1844.475 | 1807.545 | 1816.573 | 1840.524 | 1847.804 |
AR6 | 1850.372 | 1852.228 | 1814.652 | 1815.658 | 1814.474 | 1810.012 | 1813.243 | 1813.604 | 1849.742 |
AR7 | 1852.372 | 1854.219 | 1816.130 | 1828.183 | 1846.236 | 1840.831 | 1846.520 | 1848.458 | 1850.385 |
AR8 | 1851.597 | 1818.416 | 1818.995 | 1817.416 | 1817.696 | 1842.817 | 1848.558 | 1848.398 | 1850.399 |
As we see, ARMA(3,3) yields the lowest AIC value. Hence, we could compare different \(SARMA(3,0,3)\times(P,0,Q)_{52}\) in the next step to capture the seasonality. Below are the outputs of \(SARMA(3,0,3)\times(1,0,0)_{52}\), \(SARMA(3,0,3)\times(0,0,1)_{52}\) and \(SARMA(3,0,3)\times(1,0,1)_{52}\).
##
## Call:
## arima(x = u1_loess_0.5$residuals, order = c(3, 0, 3), seasonal = list(order = c(1,
## 0, 0), period = 52))
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 ma3 sar1 intercept
## 1.9875 -1.0093 0.0076 -1.8375 0.7257 0.118 0.0610 -0.5504
## s.e. NaN NaN NaN NaN NaN NaN 0.0067 NaN
##
## sigma^2 estimated as 144.3: log likelihood = -897.02, aic = 1812.03
##
## Call:
## arima(x = u1_loess_0.5$residuals, order = c(3, 0, 3), seasonal = list(order = c(0,
## 0, 1), period = 52))
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 ma3 sma1 intercept
## 1.3752 0.1857 -0.5846 -1.2275 -0.3364 0.5640 0.0564 -0.2937
## s.e. 0.5461 1.0741 0.5364 0.5225 0.9592 0.4393 0.0700 0.1176
##
## sigma^2 estimated as 144.1: log likelihood = -897.49, aic = 1812.98
##
## Call:
## arima(x = u1_loess_0.5$residuals, order = c(3, 0, 3), seasonal = list(order = c(1,
## 0, 1), period = 52))
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 ma3 sar1 sma1
## 1.5335 -0.1191 -0.4357 -1.3886 -0.0701 0.4665 0.6638 -0.5687
## s.e. 0.0066 NaN NaN 0.0458 NaN NaN 0.0276 0.0422
## intercept
## -0.4632
## s.e. 0.3612
##
## sigma^2 estimated as 143.8: log likelihood = -896.59, aic = 1813.18
Since \(SARMA(3,0,3)\times(1,0,0)_52\) has the lowest AIC value, we will use it to model the local regression residual data.
##
## Shapiro-Wilk normality test
##
## data: sarma_303_100$residuals
## W = 0.9931, p-value = 0.3669
Similar as above, both autocorrelation and normality diagnosis suggests that our new mean+SARMA model fits the data well, too (although there are a few mild violations in the ACF plot). The 2 plots below confirms our finding.
In the report, we analyzed the Michigan Weekly Pneumonia Deaths time series. After the exploratory data analysis, models fitting and diagnostic analysis, we can get a few main conclusions:
1.Though pneumonia can happen any time in the year, it does appear to have a strong seasonality and it has a cyple of roughly 1 year.
2.Both Mean+ARMA (signal plus noise) and Mean+SARMA (signal plus noise) models fit the time series data well, and the residuals appear to be gaussian.
In the future, we may dig deeper on the relationship between flu and pneumonia and see how they are correlated and how one can predict deaths caused by pneumonia using flu data.
[1].Data source: https://gis.cdc.gov/grasp/fluview/mortality.html
[2].Wikipedia:Pneumonia. https://en.wikipedia.org/wiki/Pneumonia
[3].Centers for Disease Control and Prevention (CDC): Causes of Pneumonia. https://www.cdc.gov/pneumonia/causes.html
[4].All theorectical evidence can be found in notes from https://ionides.github.io/531w20/