In this project, I study the suicide time series in Austrilia. Suicide is such a huge issue around the world. Surprisingly, most of deaths are caused by suicides rather than homicides. Many people are under high pressure due to different reasons. Even people who know them may not aware the potential suicides of their beloved. Looking at historial data would be a great way for professions to take actions. By predicting based on the past, it is efficient for them to help more people suffering from potential suicides.
The data I used is from here. It records yearly suicide deaths in Austrilia from 1915 to 2004 in the unit of 100000 people.
We first take a look at the data:
suicide=read.csv(file="deaths.csv",header=TRUE)
death=suicide$Suicide
summary(death)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.836 2.600 2.975 2.804 3.224 4.032
Suicide=ts(death,start=1915)
plot(Suicide)
We can see from the plot there is obviously a trend, so we will detrending it later.
Then let us look at the spectrum:
spectrum(death,spans=c(2,2))
From the periodogram, there seems to no typical cycles.
We first need to detrend the data. Here I use the polynomial model \[X=Intercept+\beta_1 n+\beta_2 n^2+\beta_3 n^3\] The model is as follows:
t=seq(1:90)
dreg=lm(death~t+I(t^2)+I(t^3))
summary(dreg)
##
## Call:
## lm(formula = death ~ t + I(t^2) + I(t^3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.01515 -0.24198 -0.03677 0.32339 1.07636
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.026e+00 1.864e-01 21.596 < 2e-16 ***
## t -1.341e-01 1.764e-02 -7.601 3.37e-11 ***
## I(t^2) 3.849e-03 4.492e-04 8.570 3.70e-13 ***
## I(t^3) -3.052e-05 3.246e-06 -9.404 7.41e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4238 on 86 degrees of freedom
## Multiple R-squared: 0.6255, Adjusted R-squared: 0.6125
## F-statistic: 47.88 on 3 and 86 DF, p-value: < 2.2e-16
plot(death,type='l')
lines(dreg$fitted.values,col='red')
All of the coefficients are all sigificant, and the R_squared value is around 62%. Hence, this is a good estimation of the trend.
Next, we could remove the trend from data, and look at the error.
A good start is to look at the ACF and spectrum of the residuals:
From the periodogram, there is a high peak at frequency = 0.33, which suggests a period of 30 years. The ACF plot also supports this. Every about 30 years, there is a period.
However, since our data is about only 90 years, it may not be a good idea to consider seanality in this situation. More data is needed to study the periodic behaivor.
We then choose a ARMA model to fit the errors based on AIC values:
MA0 | MA1 | MA2 | MA3 | MA4 | MA5 | |
---|---|---|---|---|---|---|
AR0 | 100.80 | 67.21 | 61.48 | 57.90 | 58.86 | 58.45 |
AR1 | 50.94 | 52.24 | 53.90 | 55.34 | 56.65 | 58.10 |
AR2 | 52.42 | 54.10 | 48.09 | 46.48 | 48.23 | 57.46 |
AR3 | 53.57 | 55.54 | 57.46 | 49.70 | 50.40 | 60.65 |
AR4 | 55.53 | 57.36 | 59.36 | 57.46 | 59.73 | 51.96 |
AR5 | 57.46 | 50.14 | 57.48 | 60.00 | 56.34 | 56.73 |
We see ARIMA(2,0,3) has a low AIC = 46.48, we choose to fit this model. The summary is below:
md=arima(r,order=c(2,0,3))
md
##
## Call:
## arima(x = r, order = c(2, 0, 3))
##
## Coefficients:
## ar1 ar2 ma1 ma2 ma3 intercept
## 1.9253 -0.9838 -1.4675 0.2620 0.2930 0.0310
## s.e. 0.0323 0.0304 0.1216 0.1801 0.1201 0.0464
##
## sigma^2 estimated as 0.082: log likelihood = -16.24, aic = 46.48
Next we look at the residuals of this model:
res=md$residuals
acf(res, main='ACF of residuals')
Box.test(res)
##
## Box-Pierce test
##
## data: res
## X-squared = 0.020187, df = 1, p-value = 0.887
qqnorm(res)
qqline(res)
shapiro.test(res)
##
## Shapiro-Wilk normality test
##
## data: res
## W = 0.98841, p-value = 0.6141
The ACF plot shows that the residuals are uncorrelated as they all lie in the confidence interval. The Box test indicates the residuals are independent.
The Normal QQ-plot shows overall a good behavior of normality, although there are tails at both end. The Shapiro-Wilk normality test supports this assumption.
Hence, it is appropriate to consider the residuals as Gaussian white noise.
After detrending, then fit an ARIMA(2,0,3) model to the errors, we have found a good model for suicide data in Austrilia from 1915 to 2004.
The model is: \[X_n=4.03-0.13n+3.85*10^{-3}n^2-3.05*10^{-5}n^3+\epsilon_n\] where \[\epsilon_n=1.93\epsilon_{n-1}-0.98\epsilon_{n-2}+\omega_n-1.47\omega_{n-1}+0.26\omega_{n-2}+0.29\omega_{n-3}\], and \(\{\omega_n\}:\omega_n \sim N(0,\sigma^2)\) is Gaussian white noise
Based on this model, we can make predictions. Here is the predictions for the next 50 years:
Based on the 90-year data, we’ve come up with a ARIMA(2,0,3) model. As discussed before, the data is too small for the study of seasonality. The potiential next step could be to get more data, and see if periodic behavior is worth awareness.