STATS 531 Midterm Project

Introduction
Dataset
Detrend the data using Hodrick-Prescott(HP) Filter
Model Selection
Fitted Model
Conclusions
References

Introduction

There were approximately 804,000 suicides globally in 2012, which is an age-standardized suicide rate of 11.4 per 100,000 people. The WHO estimates every 40 seconds someone dies of suicide. It is the second leading cause of death for people aged 15-29 globally.
Pinning down the exact causes of suicide is extremely difficult and controversial. It’s seems that the suicide rate has strong correlations with many social factors like economics, education, Drug and Alcohol use, gun prevalence and divorce law.
Here, we are interested in finding the correlation between the suicide rate and unemployment in U.S..

Dataset

First, let’s look at the historical suicide data in U.S..

s_data = read.csv(file="us_suicide.csv", header=T, sep=",")
head(s_data)

##    Year      U.S.
## 1 X1950 11.221060
## 2 X1951 10.248620
## 3 X1952  9.890234
## 4 X1953  9.985363
## 5 X1954 10.051450
## 6 X1955 10.125830

The suicide rate is 100,000 per year. The rate is adjusted for the changing age structure of the population.
One can also break down suicide rate by gender, race, geography, etc. The data can be found online. It is interesting and relevant to investigate the consistency of any established pattern across sub-populations, but here we’ll focus on a single, national analysis, combining both genders.

u_data <- read.table(file="adjusted_unemployment.csv",sep=",",header=TRUE)
head(u_data)

##   Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1 1948 3.4 3.8 4.0 3.9 3.5 3.6 3.6 3.9 3.8 3.7 3.8 4.0
## 2 1949 4.3 4.7 5.0 5.3 6.1 6.2 6.7 6.8 6.6 7.9 6.4 6.6
## 3 1950 6.5 6.4 6.3 5.8 5.5 5.4 5.0 4.5 4.4 4.2 4.2 4.3
## 4 1951 3.7 3.4 3.4 3.1 3.0 3.2 3.1 3.1 3.3 3.5 3.5 3.1
## 5 1952 3.2 3.1 2.9 2.9 3.0 3.0 3.2 3.4 3.1 3.0 2.8 2.7
## 6 1953 2.9 2.6 2.6 2.7 2.5 2.5 2.6 2.7 2.9 3.1 3.5 4.5

These are the unemployment data that we looked at in Chapter 8.
Denote the time \(t_n = 1949+n\), \(t_n=1,...,56\).
Write \({s_n^*}\) for suicide rate in year \(t_n\).
Write \({u_n^*}\) for mean unemployment in year \(t_n\).

t <- c(1950:2005)
s <- s_data$U.S.
u <- apply(u_data[u_data$Year %in% t, 2:13],1,mean)

To get a first feeling about the data, we plot them together in one diagram.

plot(ts(cbind(s,u),start=1950),main="Percent unemployment(u) and suicide rate(s) for U.S.",xlab="Year")

Detrend the data using Hodrick-Prescott(HP) Filter

We are interested in changes over business cycle timescales, once trends have been removed.
To extract the cyclical component, we use an econometric method called the Hodrick-Prescott(HP) filter.
Specifically, for a time series \({y_{1:N}^*}\), the HP filter is the time series \({s_{1:N}^*}\) constructed as

\[{s_{1:N}^*} = \arg\min_{s_{1:N}} \left\{ \sum^{N}_{n=1}\big({y_n^*}-s_{n}\big)^2 + \lambda\sum^{N-1}_{n=2}\big(s_{n+1}-2s_{n}+s_{n-1}\big)^2 \right\}.\]

A standard econometric choice of \(\lambda\) for removing nonlinear trend, and therefore extracting the business cycle component, in annual data is \(\lambda=100\).
We use this to define the HP-detrended suicide rate, \(s^{HP*}_{1:N}\), and unemployment, \(u^{HP*}_{1:N}\).

require(mFilter)
s_hp <- hpfilter(s, freq=100,type="lambda",drift=F)$cycle
u_hp <- hpfilter(u, freq=100,type="lambda",drift=F)$cycle

Now we plot the detrended data together.

plot(t,u_hp,type="l",xlab="Year",ylab="", main="Detrended unemployment and detrended suicide rate")
par(new=TRUE)
plot(t,s_hp,col="red",type="l",axes=FALSE,xlab="",ylab="")
axis(side=4, col="red")
legend(2000,0.85, c("u_hp","s_hp"),lwd=c(1,1),col=c("black","red"),cex=0.5)

Looking at this figure may suggest that detrended suicide rate and detrended unemployment cycle together.

Model Selection

AIC Table

We can make a test to check that the detrended suicide rate and unemployment are with the same cycle.
For example, we can analyze \(s^{HP*}_{1:N}\) using a regression with ARMA errors model, \[ S^{HP}_n = \alpha + \beta u^{HP}_n + \epsilon_n,\] where \(\{\epsilon_n\}\) is a Gaussian ARMA process.
To choose the parameters \(p\), \(q\) of ARMA(p,q) model, we give out the AIC table as below.

aic_table <- function(data,P,Q,xreg=NULL){
  table <- matrix(NA,(P+1),(Q+1))
  for(p in 0:P) {
    for(q in 0:Q) {
       table[p+1,q+1] <- arima(data,order=c(p,0,q),xreg=xreg)$aic
    }
  }
  dimnames(table) <- list(paste("<b> AR",0:P, "</b>", sep=""),paste("MA",0:Q,sep=""))
  table
}
e_aic_table <- aic_table(s_hp,4,4,xreg=u_hp)

## Warning in arima(data, order = c(p, 0, q), xreg = xreg): possible
## convergence problem: optim gave code = 1

require(knitr)
kable(e_aic_table,digits=2)

	MA0	MA1	MA2	MA3	MA4
AR0	13.73	9.32	10.67	11.45	13.35
AR1	8.62	10.61	12.56	13.38	15.35
AR2	10.61	12.62	-1.40	9.89	8.24
AR3	12.34	2.30	-1.00	10.12	0.71
AR4	12.17	1.92	2.51	0.56	2.07

We can see from the warning messages that we have numerical difficulties in optimizing the logliklihood function.
The AIC table suggests that the model with ARMA(2,2) erroes is the best small model.

arima(s_hp,xreg=u_hp,order=c(2,0,2))

## 
## Call:
## arima(x = s_hp, order = c(2, 0, 2), xreg = u_hp)
## 
## Coefficients:
##          ar1      ar2      ma1     ma2  intercept    u_hp
##       1.7669  -0.9033  -1.9978  0.9997     0.0056  0.0920
## s.e.  0.0525   0.0519   0.1444  0.1445     0.0014  0.0427
## 
## sigma^2 estimated as 0.03703:  log likelihood = 7.7,  aic = -1.4

To chcek the causality and invertibility of ARMA(2,2) model, we check the roots of the polynomials using polyroot function.

AR_root = polyroot(c(1,-1.7669,0.9033))
abs(AR_root)

## [1] 1.052165 1.052165

MA_root = polyroot(c(1,-1.9978,0.9997))
abs(MA_root)

## [1] 1.00015 1.00015

Finding the absolute value shows that we have the roots outside the unit circle, so this ARMA(2,2) model is causal and invertible.

Residual Analysis

Now, let’s check the residuals for the fitted model, and look at their sample autocorrelation.

r = resid(arima(s_hp, xreg=u_hp, order=c(2,0,2)))
plot(r, type ="p")
abline(h=0, lty=2)

We can see two extreme outliers in year 1950 and 1977.
There is a little evidence for fluctuations decreasing in amplitude over time. But not serious here.

qqnorm(r)

The QQ-plot shows a few heavy right tails.
We may want to check what happened in 1950 and 1977.

acf(r)

The residuals appear to be uncorrealated, since ACF at each lag falls betwwen the dotted lines.
We may conclude the residuals are Gaussian white noise.

Analysis of Temporal Differences

We may think of an ARIMA(p,1,q) model to fit the annual changes in suicide rate, rather than difference from a trend.
In this case, we consider the variable \[ \Delta {s_n^*} = {s_n^*} - {s_{n-1}^*}.\]

delta_s <- s[2:56] - s_data$U.S.[1:55]
delta_t <- c(1951:2005)

plot(t,u,type="l",xlab="Year",ylab="")
par(new=TRUE)
plot(delta_t,delta_s,col="red",type="l",axes=FALSE,xlab="",ylab="")
axis(side=4,col="red")

The relationship between unemployment and differenced life expectancy is harder to see than when HP-detrended.
The relationship is also harder to find by statistical methods.

arima(s, xreg=u, order=c(2,1,2))

## 
## Call:
## arima(x = s, order = c(2, 1, 2), xreg = u)
## 
## Coefficients:
##          ar1      ar2      ma1     ma2       u
##       0.4928  -0.6413  -0.7707  1.0000  0.1624
## s.e.  0.1515   0.1264   0.0762  0.0955  0.0202
## 
## sigma^2 estimated as 0.07786:  log likelihood = -9.9,  aic = 31.8

Here, we see no evidence for the relationship.
An experiment found no evidence of an effect might have been a bad choice of experiment, or might have been carried out poorly.
It appears that temporal differencing has destroyed too much of the evidence we were hoping it would clarify.

Fitted Model

Now, we use an ARMA(2,2) model to fit the detrended data.

arima(s_hp, xreg=u_hp, order=c(2,0,2))

## 
## Call:
## arima(x = s_hp, order = c(2, 0, 2), xreg = u_hp)
## 
## Coefficients:
##          ar1      ar2      ma1     ma2  intercept    u_hp
##       1.7669  -0.9033  -1.9978  0.9997     0.0056  0.0920
## s.e.  0.0525   0.0519   0.1444  0.1445     0.0014  0.0427
## 
## sigma^2 estimated as 0.03703:  log likelihood = 7.7,  aic = -1.4

The fitted model here is \[s^{HP}_n-1.7669s^{HP}_{n-1}+0.9033s^{HP}_{n-2} = 0.0056 + 0.0920 u^{HP}_n + \epsilon_n-1.9978\epsilon_{n-1}+0.9997\epsilon_{n-2}\]
Although this model fits good, the coefficients are hard to interpret.
The standard errors, computed from the observed Fisher information approximation, suggset a statistically significant association between cyclical variation in unemployment and suicide rate.
We can also compute a p-value from a likelihood ratio test.

log_lik_ratio <- as.numeric(
   logLik(arima(s_hp,xreg=u_hp,order=c(2,0,2))) -
   logLik(arima(s_hp,order=c(2,0,2)))
)
LRT_pval <- 1-pchisq(2*log_lik_ratio,df=1)
cat(LRT_pval)

## 3.506149e-06

We may also notice from the plot that the relationship seems clearer befroe 1980s, say in the first 30 years of the time series.

tt = 1:30
arima(s_hp[tt], xreg=u_hp[tt], order=c(2,0,2))

## 
## Call:
## arima(x = s_hp[tt], order = c(2, 0, 2), xreg = u_hp[tt])
## 
## Coefficients:
##           ar1      ar2     ma1     ma2  intercept  u_hp[tt]
##       -1.4539  -0.6748  1.9948  0.9999     0.0439    0.2554
## s.e.   0.1496   0.1349  0.2926  0.2926     0.0470    0.0420
## 
## sigma^2 estimated as 0.04065:  log likelihood = 1.82,  aic = 10.37

There is some suggestion that the association is stronger in the time period 1950-1980, but difference is not large compared to the standard error on the coefficient.
It seems reasonable that the relationship between suicide rate and the economy is changing over time because of the changes of human beings’ values and life styles.
Since the ARMA(2,2) model is hard to interpret and seems to be unstable, we may want to choose another simpler model say ARMA(1,0) from the AIC table.

arima(s_hp, xreg=u_hp, order=c(1,0,0))

## 
## Call:
## arima(x = s_hp, order = c(1, 0, 0), xreg = u_hp)
## 
## Coefficients:
##          ar1  intercept    u_hp
##       0.3707     0.0097  0.1266
## s.e.  0.1340     0.0514  0.0386
## 
## sigma^2 estimated as 0.05905:  log likelihood = -0.31,  aic = 8.62

The fitted model is \[s^{HP}_n-0.3707s^{HP}_{n-1} = 0.0097 + 0.1266 u^{HP}_n\]
We can compute a p-value from a likelihood ratio test.

log_lik_ratio <- as.numeric(
   logLik(arima(s_hp,xreg=u_hp,order=c(1,0,0))) -
   logLik(arima(s_hp,order=c(1,0,0)))
)
LRT_pval <- 1-pchisq(2*log_lik_ratio,df=1)
cat(LRT_pval)

## 0.00173612

We can also see a stistically significant association between cyclical variation in unemployment and mortality from this model, although it doesn’t fit as good as ARMA(2,2) model.

Conclusions

We check that detrended suicide rate and detrended unemployment cycle together.
There is clear evidence of pro-cyclical suicide rate at a notional level in the US. For example, in 1981-1983 and 1991-1992, we have high unemployment but below-trend suicide rate.
More data, perhaps a state-level or panel analysisi combining many time series, might be able to improve the signal to noise ration and lead to clearer results.
To talk about the association, it is plausible that economic fluctuations drive the suicide fluctuations. (the reverse causation can not be true)
We regard unemployment as a proxy variable for economic fluctuations. We do not claim that increased unemployment itself is necessarily directly causing reduced mortality.
The association we have found is clearest with no time lag, and (as we have seen previously) economic fluctuations between periods of boom and bust have historically had quite variable duration. A stable lagged relationship between economic activity and suicide rate has not yet been discovered.
Beside economic recession, the suicide rate may has association with education, Drug and Alcohol use, gun prevalence and divorce law, which needs further discussions.

References

Lindsay Lee and Max Roser (2015) - ‘Suicide’. Published online at OurWorldInData.org. Retrieved from: http://ourworldindata.org/data/health/suicide/ [Online Resource]