Licensed under the Creative Commons attribution-noncommercial license, http://creativecommons.org/licenses/by-nc/3.0/. Please share and remix noncommercially, mentioning its origin.
Objectives
Discuss a data analysis comparable to your midterm projects and the midterm exam.
Review ARMA methods.
Consider how one can build causal arguments based on time series associations.
Ogburn and Thomas (1922) were among the first to report pro-cyclical mortality. This is a phenomenon when mortality tends to be above trend during periods when economic activity is above trend. Procyclical mortality, if it is present, indicates that one key measure of population health is worse in economic booms than in economic recessions.
No one disputes that both the economy and life expectancy have grown over the last century. However, these phenomena have not always come together. For example, 1950–1980 saw rapid growth in life expectancy in India and China, periods of relatively slow economic growth for these countries. The rate of improvement in life expectancies has tapered off during the recent economic surges in India and China.
The link between economic growth and health improvement is controversial, since it has political implications.
If our goal is population health and happiness, how much should our policies focus on gross domestic product (GDP) growth?
If you have evidence supporting the view that economic growth is the critical engine for other improvements in living conditions, the answer is: a lot.
If you have evidence that there are other major factors involved, GDP growth becomes only one consideration, and not necessary always a top political priority.
Economists and epidemiologists have argued both sides of this debate, using time series methods.
e_data <- read.table(file="life_expectancy_usa.txt",header=TRUE)
head(e_data)
## Obs Yr e0F e0M e0
## 1 1 1933 62.78 59.17 60.88
## 2 2 1934 62.34 58.34 60.23
## 3 3 1935 63.04 58.96 60.89
## 4 4 1936 62.60 58.35 60.35
## 5 5 1937 63.37 59.00 61.05
## 6 6 1938 64.54 60.45 62.39
The variable e0
here is civilian life expectancy at birth. This is an actuarial calculation based on a fictitious individual having mortality rates at each age matching the census age-specific mortality rates for the current year. It is a standard way to combine all the age-specific mortality rates into a single number.
One can also break down life expectancy by gender, race, geography, etc. In our data, e0F
and e0M
are female and male life expectancy at birth. It is interesting and relevant to investigate the consistency of any established pattern across sub-populations, but here we’ll focus on a single, national analysis, combining both genders.
u_data <- read.table(file="unadjusted_unemployment.csv",sep=",",header=TRUE)
head(u_data)
## Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1 1948 4.0 4.7 4.5 4.0 3.4 3.9 3.9 3.6 3.4 2.9 3.3 3.6
## 2 1949 5.0 5.8 5.6 5.4 5.7 6.4 7.0 6.3 5.9 6.1 5.7 6.0
## 3 1950 7.6 7.9 7.1 6.0 5.3 5.6 5.3 4.1 4.0 3.3 3.8 3.9
## 4 1951 4.4 4.2 3.8 3.2 2.9 3.4 3.3 2.9 3.0 2.8 3.2 2.9
## 5 1952 3.7 3.8 3.3 3.0 2.9 3.2 3.3 3.1 2.7 2.4 2.5 2.5
## 6 1953 3.4 3.2 2.9 2.8 2.5 2.7 2.7 2.4 2.6 2.5 3.2 4.2
These are the unemployment data that we looked at in Chapter 8.
Write \({e_n^*}\) for life expectancy in year \(t_n=1947+n\).
Write \({u_n^*}\) for mean unemployment in year \(t_n\).
These variables are coded as follows.
t <- intersect(e_data$Yr,u_data$Year)
e <- e_data$e0[e_data$Yr %in% t]
u <- apply(u_data[u_data$Year %in% t, 2:13],1,mean)
plot(ts(cbind(e,u),start=1948),main="Percent unemployment (u) and life expectancy (e) for USA",xlab="Year")
We are interested in changes over business cycle timescales, once trends have been removed.
To extract the cyclical component, we use an econometric method called the Hodrick-Prescott (HP) filter .
The HP filter is exactly a smoothing spline (see Section 9.6) with a particular choice of smoothing parameter.
Specifically, for a time series \({y_{1:N}^*}\), the HP filter is the time series \({s_{1:N}^*}\) constructed as
\[{s_{1:N}^*} = \arg\min_{s_{1:N}} \left\{ \sum^{N}_{n=1}\big({y_n^*}-s_{n}\big)^2 + \lambda\sum^{N-1}_{n=2}\big(s_{n+1}-2s_{n}+s_{n-1}\big)^2 \right\}.\]
A standard econometric choice of \(\lambda\) for removing nonlinear trend, and therefore extracting the business cycle component, in annual data is \(\lambda=100\).
An R implementation of the Hodrick-Prescott filter is hpfilter
in the R package mFilter
.
We use this to define the HP-detrended life expectancy, \(e^{HP*}_{1:N}\), and unemployment, \(u^{HP*}_{1:N}\).
require(mFilter)
e_hp <- hpfilter(e, freq=100,type="lambda",drift=F)$cycle
u_hp <- hpfilter(u, freq=100,type="lambda",drift=F)$cycle
plot(t,u_hp,type="l",xlab="Year",ylab="")
par(new=TRUE)
plot(t,e_hp,col="red",type="l",axes=FALSE,xlab="",ylab="")
axis(side=4, col="red")
Looking at this figure may suggest that detrended life expectancy and detrended unemployment cycle together.
We want to make a test to check that.
For example, we can analyze \(e^{HP*}_{1:N}\) using a regression with ARMA errors model, \[ E^{HP}_n = \alpha + \beta u^{HP*}_n + \epsilon_n,\] where \(\{\epsilon_n\}\) is a Gaussian ARMA process. We use an ARMA(1,0) model, as discussed in the supplementary analysis.
arima(e_hp,xreg=u_hp,order=c(1,0,0))
##
## Call:
## arima(x = e_hp, order = c(1, 0, 0), xreg = u_hp)
##
## Coefficients:
## ar1 intercept u_hp
## 0.4816 -0.0038 0.0678
## s.e. 0.1082 0.0320 0.0186
##
## sigma^2 estimated as 0.01859: log likelihood = 37.72, aic = -67.44
The standard errors, computed from the observed Fisher information approximation, suggest a statistically significant association between cyclical variation in unemployment and mortality.
We can also compute a p-value from a likelihood ratio test,
log_lik_ratio <- as.numeric(
logLik(arima(e_hp,xreg=u_hp,order=c(1,0,0))) -
logLik(arima(e_hp,order=c(1,0,0)))
)
LRT_pval <- 1-pchisq(2*log_lik_ratio,df=1)
This gives a p-value of \(5.1\times 10^{-4}\).
We may also notice from the plot that the relationship seems clearer before the mid 1990s, say in the first 45 years of the time series.
tt <- 1:45
arima(e_hp[tt],xreg=u_hp[tt],order=c(1,0,0))
##
## Call:
## arima(x = e_hp[tt], order = c(1, 0, 0), xreg = u_hp[tt])
##
## Coefficients:
## ar1 intercept u_hp[tt]
## 0.4927 0.0027 0.0809
## s.e. 0.1291 0.0417 0.0229
##
## sigma^2 estimated as 0.02103: log likelihood = 22.9, aic = -37.8
There is some suggestion that the association is stronger in the time period 1948–1992, but the difference is not large compared to the standard error on the coefficient.
It is quite plausible that the relationship between health, mortality, and the economy is changing over time. Why?
There is clear evidence of pro-cyclical mortality at a national level in the USA from 1948 to 2013.
For example, the Great Recession led to high unemployment in 2009-2010, but these two years had above-trend values of life expectancy at birth.
More data, perhaps a state-level or international panel analysis combining many time series, might be able to improve the signal to noise ratio and lead to clearer results. This might give us the statistical precision to compare time periods and sub-populations more accurately than can be done with just one national-level dataset.
We have been careful to talk about association, since observational data giving firm statistical evidence of an assocation between \(X\) and \(Y\) cannot readily distinguish between three possibilities:
\(X\) causes \(Y\).
\(Y\) causes \(X\).
Both \(X\) and \(Y\) are caused by a third variable \(Z\) that is unmeasured or has been omitted from the analysis. In this case, \(Z\) is called a confounding variable.
Here, it is not considered plausible that mortality fluctations drive economic fluctuations (the reverse causation possibility).
We think of unemployment as a proxy variable for economic fluctuations. We do not claim that increased unemployment itself is necessarily directly causing reduced mortality. Any other omitted variable that fluctuates with levels of economic activity should show similar associations.
A more problematic potential confounding variable is lagged economic activity. One could potentially find a pattern where the reduction in mortality for the current economic down-turn is actually caused by the previous economic boom.
It is hard to entirely dismiss this possibility. However, the association we have found is clearest with no time lag, and (as we have seen previously) economic fluctuations between periods of boom and bust have historically had quite variable duration. A stable lagged relationship between economic activity and life expectancy has not yet been discovered.
We have found empirical evidence to support a claim that above-trend economic growth CAUSES above-trend mortality.
aic_table <- function(data,P,Q,xreg=NULL){
table <- matrix(NA,(P+1),(Q+1))
for(p in 0:P) {
for(q in 0:Q) {
table[p+1,q+1] <- arima(data,order=c(p,0,q),xreg=xreg)$aic
}
}
dimnames(table) <- list(paste("<b> AR",0:P, "</b>", sep=""),paste("MA",0:Q,sep=""))
table
}
e_aic_table <- aic_table(e_hp,4,5,xreg=u_hp)
require(knitr)
kable(e_aic_table,digits=2)
MA0 | MA1 | MA2 | MA3 | MA4 | MA5 | |
---|---|---|---|---|---|---|
AR0 | -52.36 | -64.26 | -65.64 | -66.97 | -66.60 | -64.78 |
AR1 | -67.44 | -65.44 | -64.78 | -65.96 | -71.30 | -69.30 |
AR2 | -65.44 | -63.50 | -64.53 | -64.69 | -69.30 | -69.34 |
AR3 | -64.75 | -71.32 | -63.31 | -63.22 | -67.56 | -71.26 |
AR4 | -67.00 | -66.28 | -65.10 | -71.19 | -68.32 | -69.85 |
This suggests that the model with ARMA(1,0) errors is the best small model.
There are some larger models with impressive AIC values. For example, let’s look at the fitted model with ARMA(3,1) errors.
arima(e_hp,xreg=u_hp,order=c(3,0,1))
##
## Call:
## arima(x = e_hp, order = c(3, 0, 1), xreg = u_hp)
##
## Coefficients:
## ar1 ar2 ar3 ma1 intercept u_hp
## 1.3092 -0.3168 -0.1674 -1.0000 -0.0011 0.0764
## s.e. 0.1203 0.1981 0.1229 0.0406 0.0043 0.0187
##
## sigma^2 estimated as 0.0153: log likelihood = 42.66, aic = -71.32
We have a non-invertible solution, with a regression coefficient for \(U_{1:N}\) that is apparently statistically significant (according to the Fisher information standard errors).
Probably, the ARMA(3,1) analysis is not very stable, meaning that we might not expect to find a similar fit for similar data.
Note the inconsistencies in the AIC table.
r <- resid(arima(e_hp,xreg=u_hp,order=c(1,0,0)))
plot(r)
acf(r)
It is not disasterous to have one out of 18 lags narrowly outside the dashed lines showing pointwise acceptance regions at the 5% level under a null hypothesis of Gaussian white noise.
The presence of some small amount of sample autocorrelation is consistent with the AIC table, which finds the possibility of small gains by fitting some larger models to the regression errors.
One might model annual changes in life expectancy, rather than difference from a trend. In this case, we consider the variable \[ \Delta {e_n^*} = {e_n^*} - {e_{n-1}^*}.\]
We compute this as
delta_e <- e - e_data$e0[e_data$Yr %in% (t-1)]
plot(t,u,type="l",xlab="Year",ylab="")
par(new=TRUE)
plot(t,delta_e,col="red",type="l",axes=FALSE,xlab="",ylab="")
axis(side=4,col="red")
The relationship between unemployment and differenced life expectancy is harder to see than when HP-detrended.
The relationship is also harder to find by statistical methods. For example,
arima(delta_e,xreg=u,order=c(1,0,1))
##
## Call:
## arima(x = delta_e, order = c(1, 0, 1), xreg = u)
##
## Coefficients:
## ar1 ma1 intercept u
## 0.746 -0.6354 0.0804 0.0187
## s.e. 0.232 0.2451 0.1119 0.0184
##
## sigma^2 estimated as 0.03895: log likelihood = 13.43, aic = -16.86
Here, we do not see evidence for the relationship.
A scientific principle for interpreting experimental results is as follows: An experiment which finds evidence of an effect is usually a better foundation for future investigations than one which fails to find evidence. The experiment which found no evidence of an effect might have been a bad choice of experiment, or might have been carried out poorly.
The principle of giving preference to methods giving positive results must be balanced with consideration of multiple testing. There is a danger in trying many analyses and settling on the first that claims statistical significance. The generalizability of any result is tentative until confirmed in other studies.
It appears that temporal differencing has destroyed too much of the evidence we were hoping it would clarify.
Ogburn, W. F., and D. S. Thomas. 1922. The influence of the business cycle on certain social conditions. Journal of the American Statistical Association 18:324–340.