1 Introdection

This report will analyse the trend and changing pattern of the number of traffic fatalities happening in Ontario. With the analysis of traffic fatality time series, we may get some information about people’s driving habits and purposes in Ontario. Then it can give the drivers and pedestrians more alert under the high probability traffic fatality happening cases. The data used in the report is provided by “datamarket.com”, which is available here. In this dataset, there are 180 numbers recording monthly traffic fatalities in Ontario from 1960 to 1974.

Section 2 gives a brief understanding of the data. Section 3 fits two models to explain the behavior of fatality numbers. Section 4 introduces some guess from the life experience aspect. Section 5 summarizes all the analysis and restates the most important conclusions obtained in section 3 & 4.



2 Exploratory Data Analysis

First, let’s look at the format of the data.

##     Month Monthly.traffic.fatalities.in.Ontario.1960.1974
## 1 1960-01                                              61
## 2 1960-02                                              65
## 3 1960-03                                              55
## 4 1960-04                                              56
## 5 1960-05                                              91
## 6 1960-06                                              80

The first column is the time stamp, and the second column is the number of traffic fatalities happening in this month. There is no missing in the data.

I remap the first column into three column, “time”, “year” and “month”. “time” is the number of month from January 1960 to now.

##   time year month fatality
## 1    1 1960     1       61
## 2    2 1960     2       65
## 3    3 1960     3       55
## 4    4 1960     4       56
## 5    5 1960     5       91
## 6    6 1960     6       80

Local Variance of Original Fatality Time Series

Point 1 to 36 Point 37 to 72 Point 73 to 108 Point 109 to 144 Point 145 to 180
Local variance 976.6563 966.1206 1139.837 1370.999 1879.704

From the time plot we can see the smallest number and the largest number in a year deviate from each other with the time pass by. In other words, the local variance of the time series is increasing which conflicts to stationary definition. So I make a \(f(x) = x^{0.2}\) switch to the original data. Then the variance of the time series fits stationary terms.

Local Variance of Switched Fatality Time Series

Point 1 to 36 Point 37 to 72 Point 73 to 108 Point 109 to 144 Point 145 to 180
Local variance 0.0252381 0.0179159 0.0192592 0.0219338 0.0248845

Also, the data shows strong seasonality. It seems like on a 12-month cycle. The spectral density plot confirms that.

To get a better view of the data, I decomposite the time series by different frequency. The part with the lowest frequency is the trend. The part with the highest frequency is the noise. And the part with the middle frequency and main power is the main cycle. From the decomposition plot, we can also clearly see the cycles of 12 month.



3 Time Series Analysis

Because of the strong seasonality found in section 2, I only choose models which contain the information of cyclicity.

3.1 ARMA Errors Model

The first model I fit is ARMA error model with seasonality.

3.1.1 Detrending Data

I fit two regression model to detrend data.

\[ X_n = Intercept + \beta*time + ARMA\ errors \] \[ X_n = Intercept + \beta_1*time + \beta_2*time^2 + ARMA\ errors \]

## 
## Call:
## lm(formula = fatality ~ time + I(time^2), data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.31389 -0.10689  0.02395  0.11603  0.30915 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.467e+00  3.238e-02  76.186  < 2e-16 ***
## time         2.775e-03  8.259e-04   3.360 0.000953 ***
## I(time^2)   -7.418e-06  4.420e-06  -1.678 0.095032 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1432 on 177 degrees of freedom
## Multiple R-squared:  0.2253, Adjusted R-squared:  0.2166 
## F-statistic: 25.74 on 2 and 177 DF,  p-value: 1.541e-10
## 
## Call:
## lm(formula = fatality ~ time, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.30793 -0.10638  0.02846  0.11703  0.28911 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.5073505  0.0215433  116.39  < 2e-16 ***
## time        0.0014328  0.0002064    6.94 7.02e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1439 on 178 degrees of freedom
## Multiple R-squared:  0.213,  Adjusted R-squared:  0.2086 
## F-statistic: 48.17 on 1 and 178 DF,  p-value: 7.019e-11

The summary results show that \(time^2\) is not significant, so \(X_n = Intercept + \beta*time\) is the model used below.

3.1.2 ARMA Model for Residuals

Next, we can fit stationary Gaussian \(ARMA(p,0,q)*(1,0,0)_{12}\) to the residuals under the no trend null hypothesis. It is natural to choose (p,q) by AIC scores. The lower AIC score reflects the higher log-likelihood and fewer parameters.

AIC Values of ARMA models for Residuals

MA0 MA1 MA2 MA3 MA4 MA5
AR0 -354.3365 -366.5590 -373.5274 -374.5916 -376.4186 -375.7235
AR1 -372.3934 -372.4104 -373.1932 -378.5928 -376.7593 -374.7732
AR2 -373.3752 -371.4579 -371.7231 -376.7469 -378.4876 -376.4879
AR3 -371.7141 -374.7299 -374.7566 -375.7795 -376.4879 -374.4886
AR4 -376.8714 -375.1923 -377.4941 -406.3402 -396.9142 -372.7726

The result shows that \(ARMA(1,0,3)*(1,0,0)_{12}\) should be the model we choose. Although there are some values (such as \(ARMA(3,0,3)*(1,0,0)_{12}\)) in the table smaller than -378.5928, those values’ neighbors show the difficulties of numeric computation. So the simpler model \(ARMA(1,0,3)*(1,0,0)_{12}\) becomes the best choice.

## 
## Call:
## arima(x = dat$fatality, order = c(1, 0, 3), seasonal = list(order = c(1, 0, 
##     0), period = 12), method = "ML")
## 
## Coefficients:
##           ar1     ma1     ma2     ma3    sar1  intercept
##       -0.6900  1.0479  0.5213  0.3406  0.7107    -0.0148
## s.e.   0.1338  0.1266  0.1138  0.0776  0.0598     0.0306
## 
## sigma^2 estimated as 0.006305:  log likelihood = 196.3,  aic = -378.59

Each parameter \(\phi_i\) has the 95% confidence interval [\(\hat{\phi}_i-1.96\hat{\sigma}^2_i\),\(\hat{\phi}_i+1.96\hat{\sigma}_i\)]. All \(|\hat{\phi}_i|\) is more than four times larger than \(\hat{\sigma}_i\) in this model, so there is no parameter having confidence interval containing 0, which means all parameters are significant.

The model is \[ X_n^{0.2} = Y_n+0.00143*time + 2.507 \] \[ (1-0.7107B^{12})(1+0.6900B)Y_n = \epsilon_n+1.0479\epsilon_{n-1}+0.5213\epsilon_{n-2}+0.3406\epsilon_{n-3} \] \[ \epsilon_i \sim iid N(0,0.006305),i = 0,1,-1,2,-2...... \]

With this model, we can simulate the time series on the computer. In the simulation plot, the red line is the time series generating by SARMA model and the blue line is the detrended original ARMA errors.

3.1.3 Diagnostic Analysis

Then we do some diagnostic analysis to the residuals.

From the acf plot we can see three violations (lag = 12, 19, 20) of the hypothesis testing lines. It reflects the residuals are not fitting the null hypothesis: independent and identically distributed. The spectral density plot also contain some peaks, but we cannot find the domain frequency. This phenomenon indicates that there are still some local pattern cannot be captured well by my model. Since the data I get can’t support for more complex models, this problem should be considered carefully when we get much larger dataset.

The QQ plot tells us that the residuals fit gaussian very well. Then I use Shapiro-Wilk test to judge whether residuals are approximate Gaussian.

## 
##  Shapiro-Wilk normality test
## 
## data:  SARMA$residuals
## W = 0.99504, p-value = 0.8156

The p-value of Shapiro-Wilk test is 0.8156 much larger than 0.05. We should definitely accept the null hypothesis of gaussian. The residuals of ARMA errors are likely to be white noise. In conclusion, although this model misses some patterns shown in the time series, it performs very well overall.



3.2 SARIMA Model

From section 2 we know this time series have trend. Maybe the simplest way to detrend is by differencing. In section 3.2, I will use SARIMA model with first order differencing to analyze the fatality numbers.

3.2.1 Fitting SARIMA Model

The format of models is \(ARIMA(p,1,q)*(1,0,0)_{12}\), choosing (p,q) by AIC scores. Because of first order differencing, I can use the original data without detrending. The lower AIC score reflects the higher log-likelihood and fewer parameters.

AIC Values of SARIMA models for Residuals

MA0 MA1 MA2 MA3 MA4 MA5
AR0 -313.8433 -352.8165 -358.7116 -364.7138 -365.6781 -367.3500
AR1 -336.6633 -363.6909 -364.0339 -364.3775 -369.8664 -368.0261
AR2 -338.9111 -364.9898 -363.0214 -362.7105 -368.0183 -366.0259
AR3 -337.4963 -363.1017 -366.4616 -377.5595 -367.9630 -366.4699
AR4 -347.9651 -366.5802 -366.9499 -375.6639 -366.8297 -368.1611

The result shows that \(ARIMA(1,1,4)*(1,0,0)_{12}\) should be the model we choose. Although there are some values (such as \(ARIMA(3,1,3)*(1,0,0)_{12}\)) in the table smaller than -369.8664, those values’ neighbors show the difficulties of numeric computation. So the simpler model \(ARIMA(1,1,4)*(1,0,0)_{12}\) becomes the best choice.

## 
## Call:
## arima(x = dat$fatality, order = c(1, 1, 4), seasonal = list(order = c(1, 0, 
##     0), period = 12), method = "ML")
## 
## Coefficients:
##           ar1     ma1      ma2      ma3      ma4    sar1
##       -0.6968  0.0796  -0.5050  -0.1625  -0.3286  0.7303
## s.e.   0.1318  0.1255   0.1154   0.0781   0.0787  0.0593
## 
## sigma^2 estimated as 0.006499:  log likelihood = 191.93,  aic = -369.87

Each parameter \(\phi_i\) has the 95% confidence interval [\(\hat{\phi}_i-1.96\hat{\sigma}^2_i\),\(\hat{\phi}_i+1.96\hat{\sigma}_i\)]. \(|\hat{\phi}_i|\) is more than 1.96 times larger than \(\hat{\sigma}_i\) for “ar1”, “ma2”, “ma3”, “ma4”, “sar1” in this model, so there is no parameter having confidence interval containing 0, which means those parameters are significant. Only “ma1” has a confidence interval having overlap with zero.

The model is \[ Y_{n-1} = X_n-X_{n-1} \] \[ (1-0.7303B^{12})(1+0.6968B)Y_n = \epsilon_n+0.0796\epsilon_{n-1}-0.5050\epsilon_{n-2}-0.1625\epsilon_{n-3}-0.3286\epsilon_{n-4} \] \[ \epsilon_i \sim\ iid\ N(0,0.006499),i = 0,1,-1,2,-2...... \]

With this model, we can simulate the time series on the computer. In the simulation plot, the red line is the time series generating by SARIMA model and the blue line is the original fatality numbers.

3.2.2 Diagnostic Analysis

Then we do some diagnostic analysis to the residuals.

From the acf plot we can see three violations (lag = 10, 19, 20) of the hypothesis testing lines. The model is also not complex enough to capture all the local influence. The spectral density plot contain some peaks with large confidence interval. The domain frequency cannot be recognized, because peaks have similar power. I can reach the conclusion again that we need larger dataset to fit more complex models.

The QQ plot tells us that the residuals have a little tail. Then I use Shapiro-Wilk test to judge whether residuals are approximate Gaussian.

## 
##  Shapiro-Wilk normality test
## 
## data:  SARIMA$residuals
## W = 0.99438, p-value = 0.7306

The p-value of Shapiro-Wilk test is 0.7306 much larger than 0.05. We should accept the null hypothesis of gaussian. The residuals of ARMA errors are likely to be white noise. We can say the ARMA errors model performs better than the SARIMA model from the QQ plot, while the SARIMA model also works well for this task.



4 Conclusion

In the report, we analyze the traffic fatality time series. After the overall exploratory data analysis, fitting models and diagnostic analysis, we can get five main conclusions:



5 Reference

[1] Data resource https://datamarket.com/data/set/22ty/monthly-traffic-fatalities-in-ontario-1960-1974#!ds=22ty&display=line
[2] Shapiro-Wilk test https://en.wikipedia.org/wiki/Shapiro-Wilk_test
[3] Class notes https://ionides.github.io/531w18/04/notes04.html
[4] Class notes https://ionides.github.io/531w18/05/notes05.html
[5] Class notes https://ionides.github.io/531w18/06/notes06.html