Introduction

Influenza, which we nornally know as “the flu”, is a contagious disease caused by an influenza virus. Though it may seem very common that a lot of people may get flu in so-called “flu seasons” and most of them will recover completely in about one to two weeks, but others will develop life-threatening complications (such as pneumonia)[2]. According to Centers for Disease Control and Prevention (CDC), common causes of viral pneumonia are influenza and respiratory syncytial virus (RSV) in the United States and by convention, CDC combine the mortality rate of the two diseases together[3]. In this study I would like to conduct a time series analysis for the mortality caused by pneumonia in the state of Michigan.

The dataset was acquired from https://gis.cdc.gov/grasp/fluview/mortality.html and you may download a region-wise dataset of your interest. The dataset consists of 229 weekly records of deaths caused by influenza and by pneumonia and the proportion of deaths caused by the 2 diseases over all deaths, starting week 40 in 2015 to week 8 in 2020 for Michigan. Here we would only analyze the pneumonia data.

SEASON WEEK PERCENT.P.I NUM.INFLUENZA.DEATHS NUM.PNEUMONIA.DEATHS TOTAL.DEATHS
2015-16 40 6.3 0 116 1,840
2015-16 41 6.2 0 108 1,742
2015-16 42 6.4 0 118 1,840
2015-16 43 6.7 0 121 1,801
2015-16 44 6.9 1 125 1,815
2015-16 45 5.6 0 100 1,773

Below is the time series plot for Number of Deaths Caused by Flu (black) and by Pneumonia (Red).

Exploratory data analysis.

Though many people believe that there is no season for pneumonia, we can see from the plot that every year when there is an influenza outbreak, the nubmer of deaths caused by pneumonia would reach its peak. It is reasonable to assume that the period is about 1 year, or 52 weeks. To confirm its seasonality, we may make some plots to better understand this.

## Period of a cycle:  48

First, we can make a spectral density plot. Moving the crossbar to each point in both unsmoothed and smoothed periodogram, one may notice the dominant frequency is 0.02083333, suggesting a period of 48. Looking at the ACF plot of the data, we can see a cycle in every 50 lags, which confirms the seasonality, too.

Decomposing the time series data into three parts: trend + seasonal + random, besides an obvious seasonality, we can also see a slight trend of the data.

Fitting a model.

Mean + ARMA Model (signal plus noise)

The plot of the original data suggests deaths could be modeled by a random process whose expected value \(\mu_n\) is relatively stable with time. The variation around the mean can then be considered as a random process, and thus we can suppose \(x_n-\mu_n\) is stationary, where \(\mu_n\) is the expected value of the number of deaths. We can use local regression to estimate \(\mu_n\).

Then the autocorrelation plot and periodogram plot of the residul of local regression can help confirm that there is no trend nor seasonality of the residuals.

Model Assumption

Now we can fit a stationary Gaussian ARMA(p,q) model with parameter vector \(\theta=\left(\phi_{1: p}, \psi_{1: q}, \mu, \sigma^{2}\right)\) given by \[ \phi(B)\left(Y_{n}-\mu\right)=\psi(B) \epsilon_{n} \] where

\[ \begin{aligned} Y_n &= x_n-\mu_n \\ \mu &=\mathbb{E}\left[Y_{n}\right] \\ \phi(x) &=1-\phi_{1} x-\cdots-\phi_{p} x^{p} \\ \psi(x) &=1+\psi_{1} x+\cdots+\psi_{q} x^{q} \\ \epsilon_{n} & \sim \operatorname{iid} N\left[0, \sigma^{2}\right] \end{aligned} \]

Model Fitting

MA0 MA1 MA2 MA3 MA4 MA5
AR0 1775.537 1776.543 1777.735 1779.001 1780.932 1782.650
AR1 1776.671 1778.207 1763.493 1765.207 1767.068 1767.875
AR2 1777.575 1763.146 1760.002 1760.485 1761.616 1762.233
AR3 1779.227 1764.969 1767.142 1761.993 1766.366 1764.005
AR4 1781.222 1766.209 1777.931 1758.208 1765.414 1768.081
AR5 1783.027 1784.950 1782.220 1768.365 1782.293 1775.342

From the AIC table, one can see that ARMA(2,2) yields the lowest AIC value. Here is the output of ARMA(2,2):

## 
## Call:
## arima(x = u1_loess$residuals, order = c(2, 0, 2))
## 
## Coefficients:
##          ar1      ar2      ma1     ma2  intercept
##       1.8166  -0.8485  -1.9189  0.9206     0.0009
## s.e.  0.1299   0.0891   0.2276  0.2245     0.0476
## 
## sigma^2 estimated as 119.4:  log likelihood = -874,  aic = 1760

Model Diagnostics

Now we can do some model diagnostics.

Parameter Redundancy Checking

## AR Roots:  1.070555+0.180336i 1.070555-0.180336i
## MA Roots:  1.040568+0i 1.043918-0i

Here we checked that there is no parameter redundancy since the AR roots ar not very close to the MA roots.

Residual ACF Plot and Normality Checking

## 
##  Shapiro-Wilk normality test
## 
## data:  arma_22$residuals
## W = 0.99574, p-value = 0.7808

Based on the QQ-plot as well as the result of shapiro test, we conclude that the distribution of the residuals is close to normal, which supports the normality assumption about \(\omega_t\). Besides, the ACF plot of the residuals can help conclude that the residuals appear to be uncorrelated, though lag 10 and lag 83 slightly deviate from the interval. Hence, the Gaussian iid assumption is met.

Fitted Values vs. Real Data and Residuals against Fitted Values Plots

One can also align fitted value with real data, and we can see that the mean+ARMA(2,2) does a great job fitting the data, while the residual vs. fitted plot corroborates our conclusion.

Mean + SARMA Model (signal plus noise)

Model Assumption

One can also seek to fit a sarma model to the detrended data to capture the seasonality of the deaths, as the trend can be model with local regression with a larger span (0.5). The \(SARMA(p,q)×(P,Q)_{52}\) model is \(\phi(B) \Phi\left(B^{52}\right)\left(Y_{n}-\mu\right)=\psi(B) \Psi\left(B^{52}\right) \epsilon_{n}\), where \(\epsilon_{n}\) is a white noise process and \[ \begin{aligned} \mu &=\mathbb{E}\left[Y_{n}\right] \\ \phi(x) &=1-\phi_{1} x-\phi_{2} x^{2} \\ \psi(x) &=1+\psi_{1} x+\psi_{2} x^{2} \\ \Phi(x) &=1-\Phi_{1} x-\cdots-\Phi_{p} x^{P} \\ \Psi(x) &=1+\Psi_{1} x+\cdots+\Psi_{q} x^{Q} \end{aligned} \]

Model Fitting

This time, we only used local regression to model the general trend rather than seasonality. Hence, the ACF plot of residuals appear a strong seasonality and we would capture this with a SARMA model.

MA0 MA1 MA2 MA3 MA4 MA5 MA6 MA7 MA8
AR0 1988.799 1923.397 1899.419 1890.523 1879.087 1871.869 1869.789 1855.871 1848.777
AR1 1875.966 1847.426 1849.366 1847.953 1844.640 1844.303 1846.245 1845.694 1847.580
AR2 1859.329 1849.384 1851.299 1809.872 1844.229 1846.091 1848.229 1847.617 1847.455
AR3 1851.742 1849.629 1842.027 1811.626 1842.962 1847.785 1849.740 1846.397 1849.024
AR4 1848.304 1850.033 1841.824 1842.973 1850.798 1815.122 1817.226 1847.695 1848.713
AR5 1849.746 1850.723 1851.847 1852.283 1844.475 1807.545 1816.573 1840.524 1847.804
AR6 1850.372 1852.228 1814.652 1815.658 1814.474 1810.012 1813.243 1813.604 1849.742
AR7 1852.372 1854.219 1816.130 1828.183 1846.236 1840.831 1846.520 1848.458 1850.385
AR8 1851.597 1818.416 1818.995 1817.416 1817.696 1842.817 1848.558 1848.398 1850.399

As we see, ARMA(3,3) yields the lowest AIC value. Hence, we could compare different \(SARMA(3,0,3)\times(P,0,Q)_{52}\) in the next step to capture the seasonality. Below are the outputs of \(SARMA(3,0,3)\times(1,0,0)_{52}\), \(SARMA(3,0,3)\times(0,0,1)_{52}\) and \(SARMA(3,0,3)\times(1,0,1)_{52}\).

## 
## Call:
## arima(x = u1_loess_0.5$residuals, order = c(3, 0, 3), seasonal = list(order = c(1, 
##     0, 0), period = 52))
## 
## Coefficients:
##          ar1      ar2     ar3      ma1     ma2    ma3    sar1  intercept
##       1.9875  -1.0093  0.0076  -1.8375  0.7257  0.118  0.0610    -0.5504
## s.e.     NaN      NaN     NaN      NaN     NaN    NaN  0.0067        NaN
## 
## sigma^2 estimated as 144.3:  log likelihood = -897.02,  aic = 1812.03
## 
## Call:
## arima(x = u1_loess_0.5$residuals, order = c(3, 0, 3), seasonal = list(order = c(0, 
##     0, 1), period = 52))
## 
## Coefficients:
##          ar1     ar2      ar3      ma1      ma2     ma3    sma1  intercept
##       1.3752  0.1857  -0.5846  -1.2275  -0.3364  0.5640  0.0564    -0.2937
## s.e.  0.5461  1.0741   0.5364   0.5225   0.9592  0.4393  0.0700     0.1176
## 
## sigma^2 estimated as 144.1:  log likelihood = -897.49,  aic = 1812.98
## 
## Call:
## arima(x = u1_loess_0.5$residuals, order = c(3, 0, 3), seasonal = list(order = c(1, 
##     0, 1), period = 52))
## 
## Coefficients:
##          ar1      ar2      ar3      ma1      ma2     ma3    sar1     sma1
##       1.5335  -0.1191  -0.4357  -1.3886  -0.0701  0.4665  0.6638  -0.5687
## s.e.  0.0066      NaN      NaN   0.0458      NaN     NaN  0.0276   0.0422
##       intercept
##         -0.4632
## s.e.     0.3612
## 
## sigma^2 estimated as 143.8:  log likelihood = -896.59,  aic = 1813.18

Since \(SARMA(3,0,3)\times(1,0,0)_52\) has the lowest AIC value, we will use it to model the local regression residual data.

Model Diagnostics

Residual ACF Plot and Normality Checking

## 
##  Shapiro-Wilk normality test
## 
## data:  sarma_303_100$residuals
## W = 0.9931, p-value = 0.3669

Similar as above, both autocorrelation and normality diagnosis suggests that our new mean+SARMA model fits the data well, too (although there are a few mild violations in the ACF plot). The 2 plots below confirms our finding.

Fitted Values vs. Real Data and Residuals against Fitted Values Plots

Conclusion

In the report, we analyzed the Michigan Weekly Pneumonia Deaths time series. After the exploratory data analysis, models fitting and diagnostic analysis, we can get a few main conclusions:

In the future, we may dig deeper on the relationship between flu and pneumonia and see how they are correlated and how one can predict deaths caused by pneumonia using flu data.

References:

[1].Data source: https://gis.cdc.gov/grasp/fluview/mortality.html

[2].Wikipedia:Pneumonia. https://en.wikipedia.org/wiki/Pneumonia

[3].Centers for Disease Control and Prevention (CDC): Causes of Pneumonia. https://www.cdc.gov/pneumonia/causes.html

[4].All theorectical evidence can be found in notes from https://ionides.github.io/531w20/