Collect the monthly data of fatal traffic crashes in Michigan(Year 1994~2014), from http://www.michigantrafficcrashfacts.org, we try to find specific trend and pattern of the crashes number. Traffic safety offices will find guidance to schedule their policies, people from relative industries such as insurance will also find it helpful in predictive analysis .
Firstly, we can use quantitative summary and the plot to explore the data. The plot presents an obivious seasonal pattern and a slow decreasing trend over the 20 years.
## Date FatalCrash year
## Min. :1994-01-31 00:00:00 Min. : 47.00 Min. :1994
## 1st Qu.:1999-04-22 12:15:00 1st Qu.: 80.00 1st Qu.:1999
## Median :2004-07-15 12:00:00 Median : 98.00 Median :2004
## Mean :2004-07-15 12:48:34 Mean : 98.52 Mean :2004
## 3rd Qu.:2009-10-07 18:00:00 3rd Qu.:118.50 3rd Qu.:2009
## Max. :2014-12-31 00:00:00 Max. :166.00 Max. :2014
## month
## Min. : 1.00
## 1st Qu.: 3.75
## Median : 6.50
## Mean : 6.50
## 3rd Qu.: 9.25
## Max. :12.00
The acf plot also suggests autocorrelation and strong seasonal pattern.
acf(data$FatalCrash, main = 'ACF Plot of Monthly Fatal Crashes')
By plotting the smoothed specturm density function, we find the frequency near 0.08 dominates the density. Since the units of x-axis is cycles per obeservation(month), we can conclude that 12.5 month, which is approximately 1 year, is a significant period here. The 2 years period is also significant suggested by the error bar, which is a harmonic of 1 year.
spectrum(data$FatalCrash,spans=c(3,5,3), main="Smoothed periodogram")
Then we decompose the series as different level of frequency variation, and try to extract the business cycles. The low frequency item is a decreasing trend, which suggests Fatal Crash goes down over the last 20 years. This is resonable because of the improvement of road condition, car safety features and state policies. And we can find a business cycle with 1 year period.
crash.low <- ts(loess(FatalCrash~time,span=0.5)$fitted,frequency=12)
crash.hi <- ts(FatalCrash - loess(FatalCrash~time,span=0.05)$fitted,frequency=12)
crash.cycles <- FatalCrash - crash.hi - crash.low
plot(ts.union(FatalCrash, crash.low,crash.hi,crash.cycles),
main="Decomposition of Fatal Crashes as trend + noise + cycles", lwd = 1.5)
Due to the existence of trend, a stationary ARMA model will not fit well here. We plot the first order differenced data below, it becomes mean stationary with seasonality.
plot(c(NA,diff(data$FatalCrash))~time, lwd = 1, type = 'l', ylab = 'difference of fatal crash', main = 'Plot of Differenced Series')
Therefore, we try to fit \(SARIMA(p,1,q)×(0,1,0)_{12}\) model, and tabulate some AIC values for a range of different choices of p and q.
MA0 | MA1 | MA2 | MA3 | MA4 | MA5 | |
---|---|---|---|---|---|---|
AR0 | 2187.09 | 2018.17 | 2019.71 | 2021.09 | 2022.59 | 2024.54 |
AR1 | 2105.96 | 2019.67 | 2021.56 | 2022.20 | 2024.15 | 2026.15 |
AR2 | 2073.14 | 2021.23 | 2014.98 | 2015.07 | 2002.26 | 2002.05 |
AR3 | 2064.90 | 2022.34 | 2023.96 | 2016.55 | 2004.30 | 2006.20 |
AR4 | 2050.39 | 2024.21 | 2016.76 | 2018.35 | 2006.34 | 2009.52 |
AR5 | 2045.36 | 2024.50 | 2018.57 | 2000.13 | 2002.01 | 1995.47 |
The AIC value goes smaller when the model becomes more complex, and finally prefers the largest model we have considered, so we switch to look at the BIC value and see if things will change.
Table_For_ARMA_BIC <- function(data,P,Q){
table <- matrix(NA,(P+1),(Q+1))
for(p in 0:P) {
for(q in 0:Q) {
table[p+1,q+1] <- Arima(data,order=c(p,1,q),seasonal=list(order=c(0,1,0),period=12))$bic
}
}
dimnames(table) <- list(paste("<b> AR",0:P, "</b>", sep=""),paste("MA",0:Q,sep=""))
table
}
Fatal_bic_table <- Table_For_ARMA_BIC(data$FatalCrash,5,5)
kable(Fatal_bic_table,digits=2)
MA0 | MA1 | MA2 | MA3 | MA4 | MA5 | |
---|---|---|---|---|---|---|
AR0 | 2190.57 | 2025.13 | 2030.14 | 2035.00 | 2039.97 | 2045.40 |
AR1 | 2112.91 | 2030.10 | 2035.46 | 2039.58 | 2045.01 | 2050.49 |
AR2 | 2083.57 | 2035.13 | 2032.36 | 2035.93 | 2026.59 | 2029.86 |
AR3 | 2078.80 | 2039.72 | 2044.82 | 2040.89 | 2032.11 | 2037.48 |
AR4 | 2067.77 | 2045.07 | 2041.10 | 2046.16 | 2037.63 | 2044.28 |
AR5 | 2066.21 | 2048.84 | 2046.39 | 2031.42 | 2036.77 | 2033.71 |
The BIC reaches minimum(2025.13) when p = 0 and q = 1, this combination also provides a local minimum(2018.17) value of AIC when both p and q are not large, so we prefers to choose the \(SARIMA(0,1,1)×(0,1,0)_{12}\) model.
Analyzing the ACF of residuals after fitting this model as above, we still see a high autocorrelation when Lag = 12.
result_ar01 = arima(data$FatalCrash,order=c(0,1,1),seasonal=list(order=c(0,1,0),period=12))
acf(result_ar01$residuals)
Now we try \(SARIMA(0,1,1)×(0,1,1)_{12}\) model to further improve the fitting.
result_ma1sma1 = arima(data$FatalCrash,order=c(0,1,1),seasonal=list(order=c(0,1,1),period=12))
summary(result_ma1sma1)
##
## Call:
## arima(x = data$FatalCrash, order = c(0, 1, 1), seasonal = list(order = c(0,
## 1, 1), period = 12))
##
## Coefficients:
## ma1 sma1
## -0.9156 -0.9141
## s.e. 0.0293 0.0589
##
## sigma^2 estimated as 141.6: log likelihood = -943, aic = 1892
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -1.09393 11.58909 8.860361 -1.978095 9.825396 0.6269948
## ACF1
## Training set -0.06262877
From the above result, using the value of s.e., we know both coefficients of ma1 and sma1 are significant. The AIC value of this model is 1892, which is also much smaller than before. Plotting the ACF of residuals again, we find it becomes independent. And the qqnorm suggests the residual is approximately normal with a little bit heavy tail.
acf(result_ma1sma1$residuals)
qqnorm(result_ma1sma1$residuals)
qqline(result_ma1sma1$residuals)
The model can be written as: \[ (1-B)(1-B^{12})X_n= (1-0.9156B)(1-0.9141B^{12}) \epsilon_n , \] where \(\epsilon_n\) is a white noise process. Compare the fitted value and original value of fatal crash below, the SARIMA model makes a good performance.
The time series of fatal crashes present peaks in the summer. However, we find that traffic crashes happen much more in the winter, which seems contradictory at the first glance. By some further investigation, we notice that the traffic crashes happens in clear weather condition(clear crash) have a close relationship with those fatal ones, so we try to analyze fatal crashes using a regression on clear weather crashes(from year 2004 to 2014) with ARMA errors model.
## Date Total.Crash Fatal.Crash Clear.Crash
## 10/31/04: 1 Min. :17752 Min. : 47.00 Min. : 7354
## 10/31/05: 1 1st Qu.:21740 1st Qu.: 68.00 1st Qu.:11103
## 10/31/06: 1 Median :24912 Median : 81.00 Median :13524
## 10/31/07: 1 Mean :25744 Mean : 82.54 Mean :13246
## 10/31/08: 1 3rd Qu.:28200 3rd Qu.: 96.00 3rd Qu.:15387
## 10/31/09: 1 Max. :41560 Max. :130.00 Max. :22939
## (Other) :126
Since we are interested in changes over business cycle timescales, we use the Hodrick-Prescott (HP) filter to extract the cyclical component, and find the detrended fatal crashes and detrended clear crashes cycle together.
fatal_hp <- hpfilter(compare.data$Fatal.Crash, freq=50,type="lambda",drift=F)$cycle
clear_hp <- hpfilter(compare.data$Clear.Crash, freq=50,type="lambda",drift=F)$cycle
plot(time1,fatal_hp,type="l",xlab="Year",ylab="")
par(new=TRUE)
plot(time1,clear_hp,col="red",type="l",lty =2, axes=FALSE,xlab="",ylab="", main = 'Detrended Fatal Crash and Clear Crash')
legend("topright", c('fatal_hp', 'clear_hp'), lwd =1, lty = c(1,2), col=c('black','red'))
axis(side=4, col="red")
Therefore, we fit a regression model with these two detrended series with ARMA errors, and use the AIC values to tabulate the parameters. The AIC table below chooses the ARMA(2,3) model as best with AIC = 937.43.
MA0 | MA1 | MA2 | MA3 | MA4 | MA5 | |
---|---|---|---|---|---|---|
AR0 | 1040.26 | 1040.72 | 1041.73 | 1015.91 | 1015.78 | 1013.72 |
AR1 | 1040.55 | 1042.52 | 1043.57 | 1015.67 | 1016.54 | 1015.71 |
AR2 | 1042.43 | 1043.95 | 951.16 | 937.44 | 939.50 | 1014.41 |
AR3 | 1041.73 | 994.14 | 940.67 | 940.38 | 940.72 | 1016.09 |
AR4 | 1038.48 | 980.30 | 940.67 | 942.00 | 942.82 | 941.99 |
AR5 | 1027.95 | 969.45 | 940.60 | 942.91 | 944.46 | 944.96 |
Looking at the details of ARMA(2,3) errors model, all the coefficients are significant. The coefficient of clear_hp is 0.0014, since the fatal crash and clear crash data have different scale, the result suggests a significant linear relationship between these two series.
##
## Call:
## arima(x = fatal_hp, order = c(2, 0, 3), xreg = clear_hp)
##
## Coefficients:
## ar1 ar2 ma1 ma2 ma3 intercept clear_hp
## 1.6859 -0.9447 -2.4026 1.8173 -0.4144 -0.0057 0.0014
## s.e. 0.0274 0.0257 0.1108 0.2050 0.0998 0.0026 0.0006
##
## sigma^2 estimated as 56.24: log likelihood = -460.72, aic = 937.44
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.5323786 7.499009 6.104763 32.31389 130.0341 0.4699667
## ACF1
## Training set 0.01028354
The ACF plot and qqplot of residuals below also show that the residuals are Gaussian white noise, which indicates the ARMA error model fits well.
acf(result_armaerr$residuals)
qqnorm(result_armaerr$residuals)
qqline(result_armaerr$residuals)
The model can be written as: \[ Fatal^{HP}_n = -0.0057 + 0.0014 Clear^{HP}_n + \epsilon_n, \] \[ \epsilon_n = 1.6856 X_{n-1} - 0.9444 X_{n-2} + \omega_n - 2.4006 \omega_{n-1} + 1.8124 \omega_{n-2} - 0.4115 \omega_{n-3} \] where \(\omega_n\) is a white noise process. The fitted value are plotted as following.