As the largest city in the United States, there are many car accident happening each day in New York City. Based on the common sense, there should be some connection between the number of car accident and the weather. Because as the temperature goes down, it more likely to have snowy weather and ice on roads, which may increase the chance of car accident.
Obviously I’m going to use two dataset in this study. The data of car accident is “NYPD Motor Vehicle Collisions” from data.gov(http://data.gov/). The data of temperature in New York is from NOAA https://www.ncdc.noaa.gov/cdo-web/search. Those two dataset would need pre-process before using, and I’m going to talk about it in section 2.
In this study I’m going to use time series models to fit the data of car accident and try to find the relationship between temperature and number of car accident. I’m going to make use of most of the materials learned in class, like spectrum analysis, AIC, likelihood ratio test, ARMA and SARMA model and diagnostic. This is a study based on my previous work in STAT 506, in which I used regression methods to check the relationship between temperature and car accident.
Here we first load the data of car accident. The data are collected with date, time, location and detail information about the car accident. In this study we only concern about the number of car accident on each day. So we group by the column “DATE” and drop columns we won’t need. Here we only take the data of 2013, which can provide us 365 data points, enough to identify pattern or trend.
library(data.table)
library(tidyr)
library(dplyr)
### The car accident data
data = read.csv("NYPD_Motor_Vehicle_Collisions-2.csv", header = TRUE, sep = ",")
data = as.data.table(data)
data = data[, .(DATE)]
data = data[, .N, by=DATE]
# specify the new column names:
vars <- c("month", "date", "year")
# then separate the column according to regex and drop extra columns:
data_split = separate(data, DATE, into = vars, sep = "/", extra = "drop")
# Sort data with year, month and date
data_split = data_split[order(data_split[,3], data_split[,1], data_split[,2]), ]
# Select only samples from 2013 and 2014
data_split = data_split[year == 2013]
data_split$time = paste(data_split$year,data_split$month,data_split$date, sep = "")
data_split = data_split[order(data_split$time)]
## STATION NAME DATE TAVG TMAX TMIN TOBS
## 1: USC00280907 BOONTON 1 SE, NJ US 2013-01-01 NA 38 29 38
## 2: USC00280907 BOONTON 1 SE, NJ US 2013-01-02 NA 39 22 23
## 3: USC00280907 BOONTON 1 SE, NJ US 2013-01-03 NA 32 21 21
## 4: USC00280907 BOONTON 1 SE, NJ US 2013-01-04 NA 34 21 33
## 5: USC00280907 BOONTON 1 SE, NJ US 2013-01-05 NA 37 30 32
## 6: USC00280907 BOONTON 1 SE, NJ US 2013-01-06 NA 41 30 35
total = merge(data_split, data_weather_split, by=c("year", "month", "date"))
total$time = paste(total$year, total$month, total$date, sep = "-")
total$month = NULL
total$date = NULL
total$year = NULL
data = total
data$time = as.Date(data$time)
data$time.x = NULL
data$time.y = NULL
We first use time plot to see whether there are significant seasonal pattern or trend in temperature and number of car accident, and whether one can be a good proxy for another.
# Time Plot
date = data$time
temperature = data$TMIN
traffic = data$N
par(mar=c(6, 5, 4, 5))
plot(date, traffic,xlim = c(as.Date("2013-01-01"), as.Date("2013-12-31")),
xlab = "Month", ylab = "Number of Car Accident", main = "Time Plot of Car Accident Number",
sub = "Fig 1", type = "l")
par(new = T)
plot(date, temperature,xlim = c(as.Date("2013-01-01"), as.Date("2013-12-31")),
xlab = "",ylab = "", col = "red", type = "l")
axis(side = 4)
mtext("Temperature (Fahrenheit)", side = 4, line = 3)
We perform the spectrum analysis of the temperature data, which can see a cycle pattern about 1 year from the time plot in Fig 1.
spec_temp = spectrum(temperature, main="Spectrum periodogram of Temperature", xlab = "Frequency \n Fig 2")
cycle = spec_temp$freq[which.max(spec_temp$spec)]
cat("The frequency domain is ", cycle, "cycle per observation (day).", "\n",
"The cycle period is ",1/cycle, "days.")
## The frequency domain is 0.002666667 cycle per observation (day).
## The cycle period is 375 days.
Then let’s see the spectrum analysis of car accident data.
# define a function to find the nth largest number in a array without changing the sequence of data.
find_nth =function(n,list){
temp = list
i = 1
for (i in 1:(n-1)){
temp[which.max(temp)] = 0
}
return(which.max(temp))
}
#####################
spec_traf = spectrum(traffic, main = "Smoothed periodogram", xlab = "Frequency \n Fig 3")
cycle = spec_traf$freq[which.max(spec_traf$spec)]
cat("The frequency domain is ", cycle, "cycle per observation (day).", "\n",
"The cycle period is ",1/cycle, "days.\n")
## The frequency domain is 0.2853333 cycle per observation (day).
## The cycle period is 3.504673 days.
cycle = spec_traf$freq[find_nth(2, spec_traf$spec)]
cat("The second largest peak", "The frequency domain is ", cycle, "cycle per observation (day).", "\n",
"The cycle period is ",1/cycle, "days.\n")
## The second largest peak The frequency domain is 0.144 cycle per observation (day).
## The cycle period is 6.944444 days.
cycle = spec_traf$freq[find_nth(3, spec_traf$spec)]
cat("The third largest peak", "The frequency domain is ", cycle, "cycle per observation (day).", "\n",
"The cycle period is ",1/cycle, "days.\n")
## The third largest peak The frequency domain is 0.005333333 cycle per observation (day).
## The cycle period is 187.5 days.
3.5 days per cycle, which is likely to be 2 cycle per week. This is far away from the cycle of temperature. Also there are two peaks other than the frequency domain that are statistically significant, the second largest has a cycle approximately 1 week and third largest peak has a cycle close to 0.5 year. The cycle of temperature is common multiple of cycle of car accident. However, it’s hardly to consider this as a prove of association between temperature and car accident.
So far we haven’t get any evidence that show a connection between temperature and number of car accident. Next we are going to fit signal plus ARMA noise model with the data.
We use AIC table to choose the best signal plus ARMA noise model. The formula of the model is Ref: \[ Y = Z\beta + \eta.\] Where \(\mu_n\) is a trend function that has a linear specification: \[\mu_n = \sum_{k=1}^K Z_{n,k}\beta_k,\] And \(\eta\) is a is a stationary, causal, invertible ARMA(p,q) process with mean zero.
As we know, the ARMA noise model can predict the trend with the data that is not fitted in ARMA model. If there is a connection between temperature and car accident, the model should fit nicely.
require(knitr)
aic_table <- function(data, P, Q, xreg = NULL){
table <- matrix(NA,(P+1),(Q+1))
for(p in 0:P) {
for(q in 0:Q) {
table[p+1,q+1] <- arima(data,order=c(p,0,q),xreg = xreg, method = "ML")$aic
}
}
dimnames(table) <- list(paste("<b> AR",0:P, "</b>", sep=""),paste("MA",0:Q,sep=""))
table
}
traffic_aic_table <- aic_table(traffic, 4, 4, xreg = temperature)
traffic_aic_table
## MA0 MA1 MA2 MA3 MA4
## <b> AR0</b> 4254.883 4220.569 4221.409 4222.950 4221.091
## <b> AR1</b> 4228.016 4221.362 4223.350 4214.133 4212.525
## <b> AR2</b> 4225.170 4223.322 4216.283 4152.886 4211.297
## <b> AR3</b> 4217.759 4213.796 4141.792 4160.490 4149.205
## <b> AR4</b> 4219.286 4220.274 4143.671 4145.806 4143.393
Here is one thing to mention. The default for arima function is using CSS (conditional sum of squares), it is possible for the autoregressive coefficients to be non-stationary, which may result an error. So here I force R to use MLE (maximum likelihood estimation) instead by using the argument method=“ML”. This is slower but gives better estimates and always returns a stationary model.
From the AIC table we can see the ARMA(4,4) have the lowest AIC value. However, here we have some problem about the AIC of ARMA(4,4). Indeed it is the lowest, but the difference between it and AIC of ARMA(3,4) is greater than 2. This means something wrong in the maximization of log-likelihood. Also ARMA(4,4) is a large model with many roots in AR and MA polynomial, so we need to check the roots to make sure it satisfied the causality and invertibility.
arma = arima(data$N, order = c(4,0,4), xreg = temperature, method = "ML")
AR_roots = polyroot(c(1,-coef(arma)[c("ar1", "ar2", "ar3","ar4")]))
MA_roots = polyroot(c(1, coef(arma)[c("ma1", "ma2", "ma3","ma4")]))
## Define a function to chekc the roots whether landed outside the unit circle
mod_check = function(roots){
for (i in roots){
if (Mod(i) - 1 >= 0.001){
cat("satisfied", "\n")
}
else{
cat("not satisfied", "\n")
}
}
}
mod_check(AR_roots)
## satisfied
## not satisfied
## not satisfied
## satisfied
mod_check(MA_roots)
## not satisfied
## not satisfied
## satisfied
## satisfied
Based on the AIC table and our preference for smaller model, we choose ARMA(2,2) model.
##
## Call:
## arima(x = traffic, order = c(2, 0, 2), xreg = temperature)
##
## Coefficients:
## ar1 ar2 ma1 ma2 intercept temperature
## 0.6638 0.2917 -0.3521 -0.5208 511.0807 0.9348
## s.e. 0.1633 0.1512 0.1461 0.1197 24.1619 0.4756
##
## sigma^2 estimated as 5851: log likelihood = -2101.14, aic = 4216.28
## The p-value is 0.06831255
We can fitting the temperature as explanatory variable and compare the result with signal plus white noise model.
a = seq(1,365, by = 1)
fit0 = lm(traffic ~ a)
fit1 = lm(traffic ~ temperature)
cat("The AIC of signal plus white noise model is ",AIC(fit0), "\n",
"The AIC of the model with temperature as explanatory variable is ",AIC(fit1), "\n")
## The AIC of signal plus white noise model is 4253.953
## The AIC of the model with temperature as explanatory variable is 4254.883
summary(fit1)
##
## Call:
## lm(formula = traffic ~ temperature)
##
## Residuals:
## Min 1Q Median 3Q Max
## -274.62 -56.10 -0.53 50.17 322.19
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 513.7223 12.5021 41.091 < 2e-16 ***
## temperature 0.9422 0.2492 3.781 0.000182 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 81.79 on 363 degrees of freedom
## Multiple R-squared: 0.03789, Adjusted R-squared: 0.03524
## F-statistic: 14.3 on 1 and 363 DF, p-value: 0.0001824
The AIC of white noise model is even lower. From the summary, the R-squared is too small to consider there is a linear relationship between temperature and traffic.
So far we have confirmed there is no significant association between temperature and number of car accident in New York. However, we haven’t find the best arma model to fit the car accident data yet.
We can start by checking the ACF plot of ARMA(2,2) model. Here I remove the term “xreg” in arima function, since we have confirmed there is no need for fitting temperature in model.
residuals = resid(arima(traffic, order = c(2, 0, 2)))
par(mar=c(6, 5, 4, 5))
acf(residuals, xlab = "Lag \n Fig 4", main = "ACF plot of ARMA(2,2)")
The ACF plot of the residuals of SARMA(2,2)×(1,1) with period 7 is shown in Fig 5.
residuals = resid(arima(traffic, order = c(2, 0, 2), seasonal=list(order=c(1,0,1),period=7)))
par(mar=c(6, 5, 4, 5))
acf(residuals, xlab = "Lag \n Fig 5", main = "ACF plot of SARMA(2,2)×(1,1)" )
We can also construct a AIC table to check whether SARMA(2,2)×(1,1) has the lowest AIC value.
require(knitr)
aic_table <- function(data, P, Q){
table <- matrix(NA,(P+1),(Q+1))
for(p in 0:P) {
for(q in 0:Q) {
table[p+1,q+1] <- arima(data,order=c(p,0,q), method = "ML",seasonal=list(order=c(1,0,1),period=7))$aic
}
}
dimnames(table) <- list(paste("<b> AR",0:P, "</b>", sep=""),paste("MA",0:Q,sep=""))
table
}
traffic_aic_table <- aic_table(traffic, 4, 4)
traffic_aic_table
## MA0 MA1 MA2 MA3 MA4
## <b> AR0</b> 4133.816 4093.042 4087.296 4086.994 4087.883
## <b> AR1</b> 4082.281 4079.221 4073.118 4074.378 4076.400
## <b> AR2</b> 4081.995 4072.512 4074.475 4076.388 4078.368
## <b> AR3</b> 4081.954 4074.566 4076.501 4074.433 4080.108
## <b> AR4</b> 4080.559 4076.407 4078.476 4080.754 4074.380
par(mfrow=c(2,2))
par(mar=c(4, 4, 4, 4))
acf(residuals, xlab = "Lag ", main = "ACF of SARMA(2,2)×(1,1)" )
residuals = resid(arima(traffic, order = c(2, 0, 1), seasonal=list(order=c(1,0,1),period=7)))
acf(residuals, xlab = "Lag ", main = "ACF of SARMA(2,1)×(1,1)" )
plot(resid(arima(traffic, order = c(2, 0, 2), seasonal=list(order=c(1,0,1),period=7))),
main = "Residual plot of SARMA(2,2)×(1,1)", ylab = "Residuals")
plot(residuals, main = "Residual plot of SARMA(2,1)×(1,1)")
As we find the fitted model, it natural to compare the fitted values with the original data.
Based on the analysis above, temperature has no assoication with number of car accident in New York. In other word, it is as likely to have car accident in winter as in summer.
There is no clear trend in the car accident data.
The temperature data have only one statistically significant frequency domain, which is 375 days per cycle. This result is reasonable as it corresponding to the season changes in one year.
The car accident data have three statistically significant frequency, and the spectrum of those three are very close. We find the cycle of 7 days is easier for interpretation. As the traffic on Monday is heavier than other days.
The diagnositic on SARMA model confirmed the cycle of 7 days is reasonable and find the best fitted model is SARMA(2,1)×(1,1).
The result that temperature has no association with number of car accident in New York is quite surprising to me at first. As I seek for a better explanation,several ideas comes up:
The car accident data we used in this study measured the number of car accidents each day. But the amount of traffic on the road varies day by day, and a larger number of car accident may not necessarily means a higher chance of crash. Maybe we have a large number on Monday
just because it is most crowded on that day. So in order to get a more accurate result, we should instead use the percentage of car accident for analysis.
Drivers take more cautions in winter than in summer, as they know there might be snow and ice on the road. And they tend to drive slow in limited visibility condition (like snowy days). Therefore, they offset the bad effect of temperature.
As the largest city in America, New York has the best road cleaning team and police force to maintain the traffic all year round. This may also offset the effect of temperature.
NYPD Motor Vehicle Collisions - From DATA.GOV **https://catalog.data.gov/dataset/nypd-motor-vehicle-collisions-07420**
Daily Weather Report - From NOAA **https://www.ncdc.noaa.gov/cdo-web/search**
Lecture Note-6, Prof. Edward Ionides, University of Michigan