midterm project

1. Introduction

In this project, I study the suicide time series in Austrilia. Suicide is such a huge issue around the world. Surprisingly, most of deaths are caused by suicides rather than homicides. Many people are under high pressure due to different reasons. Even people who know them may not aware the potential suicides of their beloved. Looking at historial data would be a great way for professions to take actions. By predicting based on the past, it is efficient for them to help more people suffering from potential suicides.

The data I used is from here. It records yearly suicide deaths in Austrilia from 1915 to 2004 in the unit of 100000 people.

2. Explore the data

We first take a look at the data:

suicide=read.csv(file="deaths.csv",header=TRUE)
death=suicide$Suicide
summary(death)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.836   2.600   2.975   2.804   3.224   4.032

Suicide=ts(death,start=1915)
plot(Suicide)

We can see from the plot there is obviously a trend, so we will detrending it later.

Then let us look at the spectrum:

spectrum(death,spans=c(2,2))

From the periodogram, there seems to no typical cycles.

3. Detrending

We first need to detrend the data. Here I use the polynomial model \[X=Intercept+\beta_1 n+\beta_2 n^2+\beta_3 n^3\] The model is as follows:

t=seq(1:90)
dreg=lm(death~t+I(t^2)+I(t^3))
summary(dreg)

## 
## Call:
## lm(formula = death ~ t + I(t^2) + I(t^3))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.01515 -0.24198 -0.03677  0.32339  1.07636 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.026e+00  1.864e-01  21.596  < 2e-16 ***
## t           -1.341e-01  1.764e-02  -7.601 3.37e-11 ***
## I(t^2)       3.849e-03  4.492e-04   8.570 3.70e-13 ***
## I(t^3)      -3.052e-05  3.246e-06  -9.404 7.41e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4238 on 86 degrees of freedom
## Multiple R-squared:  0.6255, Adjusted R-squared:  0.6125 
## F-statistic: 47.88 on 3 and 86 DF,  p-value: < 2.2e-16

plot(death,type='l')
lines(dreg$fitted.values,col='red')

All of the coefficients are all sigificant, and the R_squared value is around 62%. Hence, this is a good estimation of the trend.

4. Fitting the error

Next, we could remove the trend from data, and look at the error.

A good start is to look at the ACF and spectrum of the residuals:

From the periodogram, there is a high peak at frequency = 0.33, which suggests a period of 30 years. The ACF plot also supports this. Every about 30 years, there is a period.

However, since our data is about only 90 years, it may not be a good idea to consider seanality in this situation. More data is needed to study the periodic behaivor.

We then choose a ARMA model to fit the errors based on AIC values:

	MA0	MA1	MA2	MA3	MA4	MA5
AR0	100.80	67.21	61.48	57.90	58.86	58.45
AR1	50.94	52.24	53.90	55.34	56.65	58.10
AR2	52.42	54.10	48.09	46.48	48.23	57.46
AR3	53.57	55.54	57.46	49.70	50.40	60.65
AR4	55.53	57.36	59.36	57.46	59.73	51.96
AR5	57.46	50.14	57.48	60.00	56.34	56.73

We see ARIMA(2,0,3) has a low AIC = 46.48, we choose to fit this model. The summary is below:

md=arima(r,order=c(2,0,3))
md

## 
## Call:
## arima(x = r, order = c(2, 0, 3))
## 
## Coefficients:
##          ar1      ar2      ma1     ma2     ma3  intercept
##       1.9253  -0.9838  -1.4675  0.2620  0.2930     0.0310
## s.e.  0.0323   0.0304   0.1216  0.1801  0.1201     0.0464
## 
## sigma^2 estimated as 0.082:  log likelihood = -16.24,  aic = 46.48

Next we look at the residuals of this model:

res=md$residuals
acf(res, main='ACF of residuals')

Box.test(res)

## 
##  Box-Pierce test
## 
## data:  res
## X-squared = 0.020187, df = 1, p-value = 0.887

qqnorm(res)
qqline(res)

shapiro.test(res)

## 
##  Shapiro-Wilk normality test
## 
## data:  res
## W = 0.98841, p-value = 0.6141

The ACF plot shows that the residuals are uncorrelated as they all lie in the confidence interval. The Box test indicates the residuals are independent.

The Normal QQ-plot shows overall a good behavior of normality, although there are tails at both end. The Shapiro-Wilk normality test supports this assumption.

Hence, it is appropriate to consider the residuals as Gaussian white noise.

5. Summary of model and Prediction

After detrending, then fit an ARIMA(2,0,3) model to the errors, we have found a good model for suicide data in Austrilia from 1915 to 2004.

The model is: \[X_n=4.03-0.13n+3.85*10^{-3}n^2-3.05*10^{-5}n^3+\epsilon_n\] where \[\epsilon_n=1.93\epsilon_{n-1}-0.98\epsilon_{n-2}+\omega_n-1.47\omega_{n-1}+0.26\omega_{n-2}+0.29\omega_{n-3}\], and \(\{\omega_n\}:\omega_n \sim N(0,\sigma^2)\) is Gaussian white noise

Based on this model, we can make predictions. Here is the predictions for the next 50 years:

Further Discussion

Based on the 90-year data, we’ve come up with a ARIMA(2,0,3) model. As discussed before, the data is too small for the study of seasonality. The potiential next step could be to get more data, and see if periodic behavior is worth awareness.

Reference

[1] Data from: https://datamarket.com/data/set/238x/deaths-from-homicides-and-suicides-in-australia#!ds=238x!2jzl=1:2jzm&display=line

[2] https://ionides.github.io/531w18/