## year month crash snowfall
## Min. :2004 Min. : 1.00 Min. :17752 Min. : 0.00
## 1st Qu.:2007 1st Qu.: 3.75 1st Qu.:21926 1st Qu.: 0.00
## Median :2010 Median : 6.50 Median :24776 Median : 2.00
## Mean :2010 Mean : 6.50 Mean :25688 Mean : 15.71
## 3rd Qu.:2013 3rd Qu.: 9.25 3rd Qu.:28182 3rd Qu.: 25.56
## Max. :2016 Max. :12.00 Max. :41560 Max. :105.50
First, we will focus on the traffic crash data to find some evidence for seasonality.
From smoothed periodogram, we can observe 3 main peaks. From left to right, the first one is the main peak, which is quite significant with a corresponding period of 1.03 year, and the second and third peaks may be potential harmonics of the first one.
To the left of the first peak, we can observe a low frequency variation, or trend.
The results are quite consistent with what we observe above.
par(mfrow=c(2,1),cex=0.8)
spectrum(mydata$crash,sub="",main="Unsmoothed Periodogram")
smoothed_periodogram = spectrum(mydata$crash,method="pgram",spans=c(5,5),main="Smoothed Periodogram for Traffic Crash",sub="")
periodogram_freq = smoothed_periodogram$freq[which.max(smoothed_periodogram$spec)]
## The estimated frequency through AR fit is : 0.08125 cycles per month
## The corrsponding period is : 12.30769 months, or 1.025641 years
From part 1, we see that the crash number goes down from 2004 to 2012 and then goes up a little bit. However, we are not sure if we should take trend into consideration, Therefore, in order to find a better model, let’s decompose the data first.
We use a local linear regression approach to decompose the data; we treat the high frequency part as noise and low frequency part as trend.
n = seq(1,156)
Crash = mydata$crash
Trend = ts(loess(Crash~n,span=0.5)$fitted,start=2004,frequency=12)
Noise = ts(Crash-loess(Crash~n,span=0.1)$fitted,start=2004,frequency=12)
Cycles = Crash - Trend - Noise
plot(ts.union(Crash,Trend,Noise,Cycles),main="Decomposition of Traffic Crash as Trend + Noise + Cycles",xlab="Year")
The trend is quite clear, therefore a model with ARMA error may be appropriate.
We write \(y_{1:N}^{\star}\) as the values of car crash number, \(z_{1:N}\) for the corresponding values of snowfall. We model \(y_{1:N}^{\star}\) coditional on \(z_{1:N}\) as a realizeation of time series model \(Y_{1:N}\) defined by
\[ Y_{n} = \alpha +\beta z_{n}+\epsilon_{n} \] * \(\epsilon_{1:N}\) is a stationary, causal, invertable, Gaussian ARMA(p,q) model satisfying
\[ \phi(B)\epsilon_{n}=\psi(B)\omega_{n} \] * \(\omega_{n}\) a Gaussian white noise and
\[ \omega_{n} \sim N(0,\sigma^{2})\\ \phi(x) = 1 - \phi_{1}x -...- \phi_{p}x^{p}\\ \psi(x) = 1 + \psi_{1}x +...+ \psi_{q}x^{q} \]
We condiser a table of AIC values for different ARMA(p,q) to find the best fit.
MA0 | MA1 | MA2 | MA3 | MA4 | |
---|---|---|---|---|---|
AR0 | 3045.19 | 2995.15 | 2973.38 | 2974.91 | 2976.90 |
AR1 | 2978.26 | 2978.50 | 2974.91 | 2976.91 | 2978.91 |
AR2 | 2977.79 | 2979.14 | 2976.90 | 2978.81 | 2977.33 |
AR3 | 2978.25 | 2978.85 | 2962.74 | 2964.49 | 2977.99 |
AR4 | 2978.47 | 2980.17 | 2964.33 | 2965.16 | 2960.88 |
The AIC table suggests that ARMA(3,2) may be a potential choice. However the ACF of residual shows a clear deviation from wihte noise, especially on the lag-11 lag-12 and lag-13 terms, which suggests a model with annual seasonality SMA(1) term may be better.
Let’s try \(SARIMA(p,0,q)\times (0,0,1)_{12}\) model. Again we start by an AIC table and ACF plot.
MA0 | MA1 | MA2 | MA3 | MA4 | |
---|---|---|---|---|---|
AR0 | 2921.25 | 2905.21 | 2887.41 | 2888.96 | 2889.67 |
AR1 | 2891.76 | 2878.62 | 2879.65 | 2874.81 | 2875.39 |
AR2 | 2881.61 | 2883.61 | 2881.79 | 2875.75 | 2877.31 |
AR3 | 2883.61 | 2885.39 | 2877.72 | 2877.39 | 2878.60 |
AR4 | 2881.69 | 2876.28 | 2878.28 | 2875.86 | 2879.30 |
\(SARMA(1,1)\times(1,1)\) seems to be a good choice. Though it doesn’t have the lowest AIC value, it is a small model and is easy to analyze.
##
## Call:
## arima(x = mydata$crash, order = c(1, 0, 1), seasonal = list(order = c(1, 0,
## 1), period = 12), xreg = mydata$snowfall)
##
## Coefficients:
## ar1 ma1 sar1 sma1 intercept mydata$snowfall
## 0.9535 -0.7099 0.9903 -0.7896 25526.828 90.6088
## s.e. 0.0412 0.1144 0.0105 0.1027 5912.013 13.5145
##
## sigma^2 estimated as 4744590: log likelihood = -1432.31, aic = 2878.62
As we can see from the above regression summary, \(\beta\) term is quite significant with a value of 90.6 and standard error only 13. via t-test. Also, all the coefficients of SARMA model are significant, with zero out of confidence intervals.
The final model can be written as:
\[ Crash_{n} = 25526.8 + 90.6\times Snowfall_{n} + \epsilon_{n} \]
\[ (1-0.95B)(1-0.99B^{12})\epsilon_{n}=(1 - 0.71B)(1 - 0.79B^{12})\omega_{n} \]
\[ \omega_{n} \sim N(0,\sigma^{2})\\ \]
AR_roots = polyroot(c(1,-fit2$coef["ar1"],rep(0,10),-fit2$coef["sar1"],fit2$coef["sar1"]*fit2$coef["ar1"]))
MA_roots = polyroot(c(1,fit2$coef["ma1"],rep(0,10),fit2$coef["sma1"],fit2$coef["sma1"]*fit2$coef["ma1"]))
min(abs(AR_roots))
## [1] 1.000814
min(abs(MA_roots))
## [1] 1.019881
library(forecast)
plot(time,mydata$crash,type="l",ylab="Crash",xlab="Year",main="Traffic Crash of Original Value and Fitted Value")
lines(time,fitted(fit2),col="red",lty="dashed")
legend("topright",c("Original Value","Fitted Value"),lty=c(1,2),col=c("black","red"))
Michigan snowfall data. Retrieved from http://www.mtu.edu/alumni/favorites/snowfall/
Michigan car crash data. Retrieved from https://www.michigantrafficcrashfacts.org/
Some time series technic from https://ionides.github.io/531w18/
Seasonality term identification from https://onlinecourses.science.psu.edu/stat510/node/67