1 Introdection
2 Exploratory Data Analysis
3 Time Series Analysis
- 3.1 ARMA Errors Model
- 3.2 SARIMA Model
  - 3.2.1 Fitting SARIMA Model
  - 3.2.2 Diagnostic Analysis
4 Conclusion
5 Reference

1 Introdection

This report will analyse the trend and changing pattern of the number of traffic fatalities happening in Ontario. With the analysis of traffic fatality time series, we may get some information about people’s driving habits and purposes in Ontario. Then it can give the drivers and pedestrians more alert under the high probability traffic fatality happening cases. The data used in the report is provided by “datamarket.com”, which is available here. In this dataset, there are 180 numbers recording monthly traffic fatalities in Ontario from 1960 to 1974.

Section 2 gives a brief understanding of the data. Section 3 fits two models to explain the behavior of fatality numbers. Section 4 introduces some guess from the life experience aspect. Section 5 summarizes all the analysis and restates the most important conclusions obtained in section 3 & 4.

2 Exploratory Data Analysis

First, let’s look at the format of the data.

##     Month Monthly.traffic.fatalities.in.Ontario.1960.1974
## 1 1960-01                                              61
## 2 1960-02                                              65
## 3 1960-03                                              55
## 4 1960-04                                              56
## 5 1960-05                                              91
## 6 1960-06                                              80

The first column is the time stamp, and the second column is the number of traffic fatalities happening in this month. There is no missing in the data.

I remap the first column into three column, “time”, “year” and “month”. “time” is the number of month from January 1960 to now.

##   time year month fatality
## 1    1 1960     1       61
## 2    2 1960     2       65
## 3    3 1960     3       55
## 4    4 1960     4       56
## 5    5 1960     5       91
## 6    6 1960     6       80

Local Variance of Original Fatality Time Series

	Point 1 to 36	Point 37 to 72	Point 73 to 108	Point 109 to 144	Point 145 to 180
Local variance	976.6563	966.1206	1139.837	1370.999	1879.704

From the time plot we can see the smallest number and the largest number in a year deviate from each other with the time pass by. In other words, the local variance of the time series is increasing which conflicts to stationary definition. So I make a \(f(x) = x^{0.2}\) switch to the original data. Then the variance of the time series fits stationary terms.

Local Variance of Switched Fatality Time Series

	Point 1 to 36	Point 37 to 72	Point 73 to 108	Point 109 to 144	Point 145 to 180
Local variance	0.0252381	0.0179159	0.0192592	0.0219338	0.0248845

Also, the data shows strong seasonality. It seems like on a 12-month cycle. The spectral density plot confirms that.

To get a better view of the data, I decomposite the time series by different frequency. The part with the lowest frequency is the trend. The part with the highest frequency is the noise. And the part with the middle frequency and main power is the main cycle. From the decomposition plot, we can also clearly see the cycles of 12 month.

3 Time Series Analysis

Because of the strong seasonality found in section 2, I only choose models which contain the information of cyclicity.

3.1 ARMA Errors Model

The first model I fit is ARMA error model with seasonality.

3.1.1 Detrending Data

I fit two regression model to detrend data.

\[ X_n = Intercept + \beta*time + ARMA\ errors \] \[ X_n = Intercept + \beta_1*time + \beta_2*time^2 + ARMA\ errors \]

## 
## Call:
## lm(formula = fatality ~ time + I(time^2), data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.31389 -0.10689  0.02395  0.11603  0.30915 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.467e+00  3.238e-02  76.186  < 2e-16 ***
## time         2.775e-03  8.259e-04   3.360 0.000953 ***
## I(time^2)   -7.418e-06  4.420e-06  -1.678 0.095032 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1432 on 177 degrees of freedom
## Multiple R-squared:  0.2253, Adjusted R-squared:  0.2166 
## F-statistic: 25.74 on 2 and 177 DF,  p-value: 1.541e-10

## 
## Call:
## lm(formula = fatality ~ time, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.30793 -0.10638  0.02846  0.11703  0.28911 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.5073505  0.0215433  116.39  < 2e-16 ***
## time        0.0014328  0.0002064    6.94 7.02e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1439 on 178 degrees of freedom
## Multiple R-squared:  0.213,  Adjusted R-squared:  0.2086 
## F-statistic: 48.17 on 1 and 178 DF,  p-value: 7.019e-11

The summary results show that \(time^2\) is not significant, so \(X_n = Intercept + \beta*time\) is the model used below.

3.1.2 ARMA Model for Residuals

Next, we can fit stationary Gaussian \(ARMA(p,0,q)*(1,0,0)_{12}\) to the residuals under the no trend null hypothesis. It is natural to choose (p,q) by AIC scores. The lower AIC score reflects the higher log-likelihood and fewer parameters.

AIC Values of ARMA models for Residuals

	MA0	MA1	MA2	MA3	MA4	MA5
AR0	-354.3365	-366.5590	-373.5274	-374.5916	-376.4186	-375.7235
AR1	-372.3934	-372.4104	-373.1932	-378.5928	-376.7593	-374.7732
AR2	-373.3752	-371.4579	-371.7231	-376.7469	-378.4876	-376.4879
AR3	-371.7141	-374.7299	-374.7566	-375.7795	-376.4879	-374.4886
AR4	-376.8714	-375.1923	-377.4941	-406.3402	-396.9142	-372.7726

The result shows that \(ARMA(1,0,3)*(1,0,0)_{12}\) should be the model we choose. Although there are some values (such as \(ARMA(3,0,3)*(1,0,0)_{12}\)) in the table smaller than -378.5928, those values’ neighbors show the difficulties of numeric computation. So the simpler model \(ARMA(1,0,3)*(1,0,0)_{12}\) becomes the best choice.

## 
## Call:
## arima(x = dat$fatality, order = c(1, 0, 3), seasonal = list(order = c(1, 0, 
##     0), period = 12), method = "ML")
## 
## Coefficients:
##           ar1     ma1     ma2     ma3    sar1  intercept
##       -0.6900  1.0479  0.5213  0.3406  0.7107    -0.0148
## s.e.   0.1338  0.1266  0.1138  0.0776  0.0598     0.0306
## 
## sigma^2 estimated as 0.006305:  log likelihood = 196.3,  aic = -378.59

Each parameter \(\phi_i\) has the 95% confidence interval [\(\hat{\phi}_i-1.96\hat{\sigma}^2_i\),\(\hat{\phi}_i+1.96\hat{\sigma}_i\)]. All \(|\hat{\phi}_i|\) is more than four times larger than \(\hat{\sigma}_i\) in this model, so there is no parameter having confidence interval containing 0, which means all parameters are significant.

The model is \[ X_n^{0.2} = Y_n+0.00143*time + 2.507 \] \[ (1-0.7107B^{12})(1+0.6900B)Y_n = \epsilon_n+1.0479\epsilon_{n-1}+0.5213\epsilon_{n-2}+0.3406\epsilon_{n-3} \] \[ \epsilon_i \sim iid N(0,0.006305),i = 0,1,-1,2,-2...... \]

With this model, we can simulate the time series on the computer. In the simulation plot, the red line is the time series generating by SARMA model and the blue line is the detrended original ARMA errors.

3.1.3 Diagnostic Analysis

Then we do some diagnostic analysis to the residuals.

From the acf plot we can see three violations (lag = 12, 19, 20) of the hypothesis testing lines. It reflects the residuals are not fitting the null hypothesis: independent and identically distributed. The spectral density plot also contain some peaks, but we cannot find the domain frequency. This phenomenon indicates that there are still some local pattern cannot be captured well by my model. Since the data I get can’t support for more complex models, this problem should be considered carefully when we get much larger dataset.

The QQ plot tells us that the residuals fit gaussian very well. Then I use Shapiro-Wilk test to judge whether residuals are approximate Gaussian.

## 
##  Shapiro-Wilk normality test
## 
## data:  SARMA$residuals
## W = 0.99504, p-value = 0.8156

The p-value of Shapiro-Wilk test is 0.8156 much larger than 0.05. We should definitely accept the null hypothesis of gaussian. The residuals of ARMA errors are likely to be white noise. In conclusion, although this model misses some patterns shown in the time series, it performs very well overall.

3.2 SARIMA Model

From section 2 we know this time series have trend. Maybe the simplest way to detrend is by differencing. In section 3.2, I will use SARIMA model with first order differencing to analyze the fatality numbers.

3.2.1 Fitting SARIMA Model

The format of models is \(ARIMA(p,1,q)*(1,0,0)_{12}\), choosing (p,q) by AIC scores. Because of first order differencing, I can use the original data without detrending. The lower AIC score reflects the higher log-likelihood and fewer parameters.

AIC Values of SARIMA models for Residuals

	MA0	MA1	MA2	MA3	MA4	MA5
AR0	-313.8433	-352.8165	-358.7116	-364.7138	-365.6781	-367.3500
AR1	-336.6633	-363.6909	-364.0339	-364.3775	-369.8664	-368.0261
AR2	-338.9111	-364.9898	-363.0214	-362.7105	-368.0183	-366.0259
AR3	-337.4963	-363.1017	-366.4616	-377.5595	-367.9630	-366.4699
AR4	-347.9651	-366.5802	-366.9499	-375.6639	-366.8297	-368.1611

The result shows that \(ARIMA(1,1,4)*(1,0,0)_{12}\) should be the model we choose. Although there are some values (such as \(ARIMA(3,1,3)*(1,0,0)_{12}\)) in the table smaller than -369.8664, those values’ neighbors show the difficulties of numeric computation. So the simpler model \(ARIMA(1,1,4)*(1,0,0)_{12}\) becomes the best choice.

## 
## Call:
## arima(x = dat$fatality, order = c(1, 1, 4), seasonal = list(order = c(1, 0, 
##     0), period = 12), method = "ML")
## 
## Coefficients:
##           ar1     ma1      ma2      ma3      ma4    sar1
##       -0.6968  0.0796  -0.5050  -0.1625  -0.3286  0.7303
## s.e.   0.1318  0.1255   0.1154   0.0781   0.0787  0.0593
## 
## sigma^2 estimated as 0.006499:  log likelihood = 191.93,  aic = -369.87

Each parameter \(\phi_i\) has the 95% confidence interval [\(\hat{\phi}_i-1.96\hat{\sigma}^2_i\),\(\hat{\phi}_i+1.96\hat{\sigma}_i\)]. \(|\hat{\phi}_i|\) is more than 1.96 times larger than \(\hat{\sigma}_i\) for “ar1”, “ma2”, “ma3”, “ma4”, “sar1” in this model, so there is no parameter having confidence interval containing 0, which means those parameters are significant. Only “ma1” has a confidence interval having overlap with zero.

The model is \[ Y_{n-1} = X_n-X_{n-1} \] \[ (1-0.7303B^{12})(1+0.6968B)Y_n = \epsilon_n+0.0796\epsilon_{n-1}-0.5050\epsilon_{n-2}-0.1625\epsilon_{n-3}-0.3286\epsilon_{n-4} \] \[ \epsilon_i \sim\ iid\ N(0,0.006499),i = 0,1,-1,2,-2...... \]

With this model, we can simulate the time series on the computer. In the simulation plot, the red line is the time series generating by SARIMA model and the blue line is the original fatality numbers.

3.2.2 Diagnostic Analysis

Then we do some diagnostic analysis to the residuals.

From the acf plot we can see three violations (lag = 10, 19, 20) of the hypothesis testing lines. The model is also not complex enough to capture all the local influence. The spectral density plot contain some peaks with large confidence interval. The domain frequency cannot be recognized, because peaks have similar power. I can reach the conclusion again that we need larger dataset to fit more complex models.

The QQ plot tells us that the residuals have a little tail. Then I use Shapiro-Wilk test to judge whether residuals are approximate Gaussian.

## 
##  Shapiro-Wilk normality test
## 
## data:  SARIMA$residuals
## W = 0.99438, p-value = 0.7306

The p-value of Shapiro-Wilk test is 0.7306 much larger than 0.05. We should accept the null hypothesis of gaussian. The residuals of ARMA errors are likely to be white noise. We can say the ARMA errors model performs better than the SARIMA model from the QQ plot, while the SARIMA model also works well for this task.

4 Conclusion

In the report, we analyze the traffic fatality time series. After the overall exploratory data analysis, fitting models and diagnostic analysis, we can get five main conclusions:

This times series have nearly linear trend growing variance respect to time. Regression with time and the first order differencing can successfully detrend the data. Transforming data with \(f(x) = x^{0.2}\) can effectively solve the variance increasing problem. Then the time series fit the weak stationary terms.
Both the minimum number of traffic fatality in one year and the maximum number of traffic fatality in one year increase. It may caused by the development of economy. There are more cars on the road in 1974 than in 1960, then the number of fatalities increases accordingly.
The fatality numbers have the cycle of 12 month.
Both ARMA errors model ( \(ARMA(1,0,3)*(1,0,0)_{12}\) ) and SARIMA model ( \(ARIMA(1,1,4)*(1,0,0)_{12}\) ) fit the time series very well, which capture the trend, the main reliance between near time stamp and leave the white noise residuals. However, there are some test evidence shows the model can be better if we have larger dataset.
There are more traffic fatalities in summer and much fewer traffic fatalities in winter. I think it may be caused by two reasons. Firstly, weather is always extremely cold in Ontario in winter, so people tend to stay at home. The amount of traffic reduces and make it less likely to happen serious traffic accident. Secondly, winter always brings snow and ice. People will be more careful when they are on the road, both drivers and pedestrians, then the number of fatalities reduces.

5 Reference

[1] Data resource https://datamarket.com/data/set/22ty/monthly-traffic-fatalities-in-ontario-1960-1974#!ds=22ty&display=line
[2] Shapiro-Wilk test https://en.wikipedia.org/wiki/Shapiro-Wilk_test
[3] Class notes https://ionides.github.io/531w18/04/notes04.html
[4] Class notes https://ionides.github.io/531w18/05/notes05.html
[5] Class notes https://ionides.github.io/531w18/06/notes06.html

Monthly Traffic Fatalities in Ontario (1960-1974)

2018/3/6