Midterm Project - Monthly Fatal Crashes in Michigan

1. Introduction

Collect the monthly data of fatal traffic crashes in Michigan(Year 1994~2014), from http://www.michigantrafficcrashfacts.org, we try to find specific trend and pattern of the crashes number. Traffic safety offices will find guidance to schedule their policies, people from relative industries such as insurance will also find it helpful in predictive analysis .

Firstly, we can use quantitative summary and the plot to explore the data. The plot presents an obivious seasonal pattern and a slow decreasing trend over the 20 years.

##       Date                       FatalCrash          year     
##  Min.   :1994-01-31 00:00:00   Min.   : 47.00   Min.   :1994  
##  1st Qu.:1999-04-22 12:15:00   1st Qu.: 80.00   1st Qu.:1999  
##  Median :2004-07-15 12:00:00   Median : 98.00   Median :2004  
##  Mean   :2004-07-15 12:48:34   Mean   : 98.52   Mean   :2004  
##  3rd Qu.:2009-10-07 18:00:00   3rd Qu.:118.50   3rd Qu.:2009  
##  Max.   :2014-12-31 00:00:00   Max.   :166.00   Max.   :2014  
##      month      
##  Min.   : 1.00  
##  1st Qu.: 3.75  
##  Median : 6.50  
##  Mean   : 6.50  
##  3rd Qu.: 9.25  
##  Max.   :12.00

The acf plot also suggests autocorrelation and strong seasonal pattern.

acf(data$FatalCrash, main = 'ACF Plot of Monthly Fatal Crashes')

2. Frequency Domain Analysis

By plotting the smoothed specturm density function, we find the frequency near 0.08 dominates the density. Since the units of x-axis is cycles per obeservation(month), we can conclude that 12.5 month, which is approximately 1 year, is a significant period here. The 2 years period is also significant suggested by the error bar, which is a harmonic of 1 year.

spectrum(data$FatalCrash,spans=c(3,5,3), main="Smoothed periodogram")

Then we decompose the series as different level of frequency variation, and try to extract the business cycles. The low frequency item is a decreasing trend, which suggests Fatal Crash goes down over the last 20 years. This is resonable because of the improvement of road condition, car safety features and state policies. And we can find a business cycle with 1 year period.

crash.low <- ts(loess(FatalCrash~time,span=0.5)$fitted,frequency=12)
crash.hi <- ts(FatalCrash - loess(FatalCrash~time,span=0.05)$fitted,frequency=12)
crash.cycles <- FatalCrash - crash.hi - crash.low
plot(ts.union(FatalCrash, crash.low,crash.hi,crash.cycles),
     main="Decomposition of Fatal Crashes as trend + noise + cycles", lwd = 1.5)

3. Fit a SARIMA(p,d,q)×(P,D,Q) Model

Due to the existence of trend, a stationary ARMA model will not fit well here. We plot the first order differenced data below, it becomes mean stationary with seasonality.

plot(c(NA,diff(data$FatalCrash))~time, lwd = 1, type = 'l', ylab = 'difference of fatal crash', main = 'Plot of Differenced Series')

Therefore, we try to fit \(SARIMA(p,1,q)×(0,1,0)_{12}\) model, and tabulate some AIC values for a range of different choices of p and q.

	MA0	MA1	MA2	MA3	MA4	MA5
AR0	2187.09	2018.17	2019.71	2021.09	2022.59	2024.54
AR1	2105.96	2019.67	2021.56	2022.20	2024.15	2026.15
AR2	2073.14	2021.23	2014.98	2015.07	2002.26	2002.05
AR3	2064.90	2022.34	2023.96	2016.55	2004.30	2006.20
AR4	2050.39	2024.21	2016.76	2018.35	2006.34	2009.52
AR5	2045.36	2024.50	2018.57	2000.13	2002.01	1995.47

The AIC value goes smaller when the model becomes more complex, and finally prefers the largest model we have considered, so we switch to look at the BIC value and see if things will change.

Table_For_ARMA_BIC <- function(data,P,Q){
  table <- matrix(NA,(P+1),(Q+1))
  for(p in 0:P) {
    for(q in 0:Q) {
      table[p+1,q+1] <- Arima(data,order=c(p,1,q),seasonal=list(order=c(0,1,0),period=12))$bic
    }
  }
  dimnames(table) <- list(paste("<b> AR",0:P, "</b>", sep=""),paste("MA",0:Q,sep=""))
  table
}
Fatal_bic_table <- Table_For_ARMA_BIC(data$FatalCrash,5,5)
kable(Fatal_bic_table,digits=2)

	MA0	MA1	MA2	MA3	MA4	MA5
AR0	2190.57	2025.13	2030.14	2035.00	2039.97	2045.40
AR1	2112.91	2030.10	2035.46	2039.58	2045.01	2050.49
AR2	2083.57	2035.13	2032.36	2035.93	2026.59	2029.86
AR3	2078.80	2039.72	2044.82	2040.89	2032.11	2037.48
AR4	2067.77	2045.07	2041.10	2046.16	2037.63	2044.28
AR5	2066.21	2048.84	2046.39	2031.42	2036.77	2033.71

The BIC reaches minimum(2025.13) when p = 0 and q = 1, this combination also provides a local minimum(2018.17) value of AIC when both p and q are not large, so we prefers to choose the \(SARIMA(0,1,1)×(0,1,0)_{12}\) model.

Analyzing the ACF of residuals after fitting this model as above, we still see a high autocorrelation when Lag = 12.

result_ar01 = arima(data$FatalCrash,order=c(0,1,1),seasonal=list(order=c(0,1,0),period=12))
acf(result_ar01$residuals)

Now we try \(SARIMA(0,1,1)×(0,1,1)_{12}\) model to further improve the fitting.

result_ma1sma1 = arima(data$FatalCrash,order=c(0,1,1),seasonal=list(order=c(0,1,1),period=12))
summary(result_ma1sma1)

## 
## Call:
## arima(x = data$FatalCrash, order = c(0, 1, 1), seasonal = list(order = c(0, 
##     1, 1), period = 12))
## 
## Coefficients:
##           ma1     sma1
##       -0.9156  -0.9141
## s.e.   0.0293   0.0589
## 
## sigma^2 estimated as 141.6:  log likelihood = -943,  aic = 1892
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE     MAPE      MASE
## Training set -1.09393 11.58909 8.860361 -1.978095 9.825396 0.6269948
##                     ACF1
## Training set -0.06262877

From the above result, using the value of s.e., we know both coefficients of ma1 and sma1 are significant. The AIC value of this model is 1892, which is also much smaller than before. Plotting the ACF of residuals again, we find it becomes independent. And the qqnorm suggests the residual is approximately normal with a little bit heavy tail.

acf(result_ma1sma1$residuals)

qqnorm(result_ma1sma1$residuals)
qqline(result_ma1sma1$residuals)

The model can be written as: \[ (1-B)(1-B^{12})X_n= (1-0.9156B)(1-0.9141B^{12}) \epsilon_n , \] where \(\epsilon_n\) is a white noise process. Compare the fitted value and original value of fatal crash below, the SARIMA model makes a good performance.

4. Fit an ARMA Errors Model

The time series of fatal crashes present peaks in the summer. However, we find that traffic crashes happen much more in the winter, which seems contradictory at the first glance. By some further investigation, we notice that the traffic crashes happens in clear weather condition(clear crash) have a close relationship with those fatal ones, so we try to analyze fatal crashes using a regression on clear weather crashes(from year 2004 to 2014) with ARMA errors model.

##        Date      Total.Crash     Fatal.Crash      Clear.Crash   
##  10/31/04:  1   Min.   :17752   Min.   : 47.00   Min.   : 7354  
##  10/31/05:  1   1st Qu.:21740   1st Qu.: 68.00   1st Qu.:11103  
##  10/31/06:  1   Median :24912   Median : 81.00   Median :13524  
##  10/31/07:  1   Mean   :25744   Mean   : 82.54   Mean   :13246  
##  10/31/08:  1   3rd Qu.:28200   3rd Qu.: 96.00   3rd Qu.:15387  
##  10/31/09:  1   Max.   :41560   Max.   :130.00   Max.   :22939  
##  (Other) :126

Since we are interested in changes over business cycle timescales, we use the Hodrick-Prescott (HP) filter to extract the cyclical component, and find the detrended fatal crashes and detrended clear crashes cycle together.

fatal_hp <- hpfilter(compare.data$Fatal.Crash, freq=50,type="lambda",drift=F)$cycle
clear_hp <- hpfilter(compare.data$Clear.Crash, freq=50,type="lambda",drift=F)$cycle
plot(time1,fatal_hp,type="l",xlab="Year",ylab="")
par(new=TRUE)
plot(time1,clear_hp,col="red",type="l",lty =2, axes=FALSE,xlab="",ylab="", main = 'Detrended Fatal Crash and Clear Crash')
legend("topright", c('fatal_hp', 'clear_hp'), lwd =1, lty = c(1,2), col=c('black','red'))
axis(side=4, col="red")

Therefore, we fit a regression model with these two detrended series with ARMA errors, and use the AIC values to tabulate the parameters. The AIC table below chooses the ARMA(2,3) model as best with AIC = 937.43.

	MA0	MA1	MA2	MA3	MA4	MA5
AR0	1040.26	1040.72	1041.73	1015.91	1015.78	1013.72
AR1	1040.55	1042.52	1043.57	1015.67	1016.54	1015.71
AR2	1042.43	1043.95	951.16	937.44	939.50	1014.41
AR3	1041.73	994.14	940.67	940.38	940.72	1016.09
AR4	1038.48	980.30	940.67	942.00	942.82	941.99
AR5	1027.95	969.45	940.60	942.91	944.46	944.96

Looking at the details of ARMA(2,3) errors model, all the coefficients are significant. The coefficient of clear_hp is 0.0014, since the fatal crash and clear crash data have different scale, the result suggests a significant linear relationship between these two series.

## 
## Call:
## arima(x = fatal_hp, order = c(2, 0, 3), xreg = clear_hp)
## 
## Coefficients:
##          ar1      ar2      ma1     ma2      ma3  intercept  clear_hp
##       1.6859  -0.9447  -2.4026  1.8173  -0.4144    -0.0057    0.0014
## s.e.  0.0274   0.0257   0.1108  0.2050   0.0998     0.0026    0.0006
## 
## sigma^2 estimated as 56.24:  log likelihood = -460.72,  aic = 937.44
## 
## Training set error measures:
##                      ME     RMSE      MAE      MPE     MAPE      MASE
## Training set -0.5323786 7.499009 6.104763 32.31389 130.0341 0.4699667
##                    ACF1
## Training set 0.01028354

The ACF plot and qqplot of residuals below also show that the residuals are Gaussian white noise, which indicates the ARMA error model fits well.

acf(result_armaerr$residuals)

qqnorm(result_armaerr$residuals)
qqline(result_armaerr$residuals)

The model can be written as: \[ Fatal^{HP}_n = -0.0057 + 0.0014 Clear^{HP}_n + \epsilon_n, \] \[ \epsilon_n = 1.6856 X_{n-1} - 0.9444 X_{n-2} + \omega_n - 2.4006 \omega_{n-1} + 1.8124 \omega_{n-2} - 0.4115 \omega_{n-3} \] where \(\omega_n\) is a white noise process. The fitted value are plotted as following.

5. Conclusion

The fatal crash time series has a seasonality of 1 cycle per year, it also has a descreasing trend over the last 20 years.
SARIMA(0,1,1)×(0,1,1) model with period = 12 fits this series well, which has often been used for forecasting monthly time series in economics and business.
The fatal crash series has association with number of traffic crashes in clearly weather condition, this makes sense because people tend to drive careless in good weather. Build an ARMA errors model on these two series can also get a good fit.

6. Reference

http://www.michigantrafficcrashfacts.org
http://ionides.github.io/531w16/#course-description
http://www.mlive.com/news/kalamazoo/index.ssf/2015/08/whats_the_most_deadly_holiday.html