Fire alarm can be triggered by lots of reason. As a student living in student apartment, I sometimes can hear fire alarm several time a day. Most of them are false alarm, and people who hear these alarms everyday get tired and stop evacuating from the building in time. This can be dangerous if real fire breaks out. So I want to explore and fit a time series model of fire alarm emergency.
In this project, my goal is to explore the following questions:
What time series model fits fire alarm daily data well?
Is there any significant seasonality or cycle?
If we find a pattern of the fire alarms, we can be more prepared to the fire related emergency and prevent possible waste of public resource.
The data we are using is 2016-2018 daily Emergency (911) Calls dataset from Kaggle, including Fire, Traffic, EMS for Montgomery County, PA. The raw data includes 141 different kinds of emergency and more than 400,000 data points. So we first need to select the cases that relates to our topic. In the following analysis, we only focus on the category “Fire: FIRE ALARM”.
After selecting the data we need, there are 23945 cases of fire alarms in total, and we aggregate the datapoints by date.
First, we visualize the fire alarm time series data.
The blue line is the mean of all cases. The mean of our data is 22.78, which indicates that there are around 23 reported cases of fire alarm in average.
DF$date[which.max(DF$count)]
## [1] "2018-03-03"
max(DF$count)
## [1] 91
On 2018-03-03, there were 91 cases of fire alarm in Montgomery County, which is the maximum of the data we are using. Other data points are mostly under 60.
As we can see from this plot, there is no obvious violation of mean stationary. So it is possible to fit an AR model using the data.
We can plot the autocorrelation function (ACF) of the dataset.
As we can see, almost all the autocorrelation is greater than 0. It shows weakly decreasing trend and no sign of sinusoidal pattern, so AR(1) might be a good fit to this time series.
We assume our data can fit the model \(ARMA(p,q)\), which has the form: \[X_n=\mu+\phi_1 (X_{n-1}-\mu)+\dots+\phi_p (X_{n-p}-\mu)+\epsilon_{n}+\psi_1\epsilon_{n-1}+\dots+\psi_q\epsilon_{n-q}\] {\(X_n\)} is the time series data. \(\mu\) is the mean of all emergency calls (the intercept in model description). {\(\phi_p\)} and {\(\psi_q\)} are coefficients to be determined. {\(\epsilon_n\)} is Gaussian white noise with variance \(\sigma^2\). We need to choose the values of p and q. So we tabulate AIC values for a range of different choices of p and q.
MA0 | MA1 | MA2 | MA3 | |
---|---|---|---|---|
AR0 | 7308.87 | 7189.67 | 7173.46 | 7164.75 |
AR1 | 7161.80 | 7148.42 | 7139.36 | 7141.34 |
AR2 | 7156.67 | 7140.28 | 7141.35 | 7143.33 |
AR3 | 7149.16 | 7141.25 | 7143.27 | 7144.62 |
In this table, we can see that the AIC of ARMA(1,2), 7139.36 is the least of all. The differences of AIC among ARMA(1,3), ARMA(2,2) and ARMA(1,2) are no larger than 2, so we need to consider all these three models and choose one from them.
arma12=arima(DF$count,order=c(1,0,2))
arma12
##
## Call:
## arima(x = DF$count, order = c(1, 0, 2))
##
## Coefficients:
## ar1 ma1 ma2 intercept
## 0.8718 -0.5586 -0.1316 22.7654
## s.e. 0.0422 0.0537 0.0380 0.5340
##
## sigma^2 estimated as 51.69: log likelihood = -3564.68, aic = 7139.36
ARMA_roots <- polyroot(c(1,-coef(arma12)[c("ar1","ma1","ma2")]))
ARMA_roots
## [1] 0.7047401+0.920562i 0.7047401-0.920562i -5.6547738-0.000000i
arma13=arima(DF$count,order=c(1,0,3))
arma13
##
## Call:
## arima(x = DF$count, order = c(1, 0, 3))
##
## Coefficients:
## ar1 ma1 ma2 ma3 intercept
## 0.8688 -0.5562 -0.1325 0.0048 22.7659
## s.e. 0.0483 0.0571 0.0385 0.0344 0.5323
##
## sigma^2 estimated as 51.69: log likelihood = -3564.67, aic = 7141.34
ARMA_roots <- polyroot(c(1,-coef(arma13)[c("ar1","ma1","ma2","ma3")]))
ARMA_roots
## [1] 0.7025237+0.9267792i 0.7025237-0.9267792i -4.9443765-0.0000000i
## [4] 31.0807053+0.0000000i
arma22=arima(DF$count,order=c(2,0,2))
arma22
##
## Call:
## arima(x = DF$count, order = c(2, 0, 2))
##
## Coefficients:
## ar1 ar2 ma1 ma2 intercept
## 0.8272 0.0354 -0.5147 -0.1549 22.7527
## s.e. 0.2590 0.2043 0.2566 0.1402 0.5313
##
## sigma^2 estimated as 51.69: log likelihood = -3564.68, aic = 7141.35
ARMA_roots <- polyroot(c(1,-coef(arma22)[c("ar1","ar2","ma1","ma2")]))
ARMA_roots
## [1] 0.7872347+0.612690i 0.7872347-0.612690i -2.4480545-0.701687i
## [4] -2.4480545+0.701687i
The roots of all models are outside the unit circle and all of the coefficients are less than 1, suggesting we have stationary invertible causal fitted ARMA.
Since these three models do not have discriminative difference, we choose the simplest model \(ARMA(1,2)\) to avoid computational inefficiency and risk of overfitting.
So our model equation is \[X_n=\mu+\phi_1 (X_{n-1}-\mu)+\psi_1\epsilon_{n-1}+\psi_2\epsilon_{n-2}\] With \(\mu=22.7654, \phi_1=0.8718, \psi_1=-0.5586, \psi_2=-0.1316, \sigma^2=51.69\).
To further improve the model, we try to fit a SARMA model. The possible numbers of period are 7 (as a week) and 30 (as a month).
This is the model description of period = 7:
##
## Call:
## arima(x = DF$count, order = c(1, 0, 2), seasonal = list(order = c(1, 0, 0),
## period = 7))
##
## Coefficients:
## ar1 ma1 ma2 sar1 intercept
## 0.8505 -0.5381 -0.1229 0.0459 22.7627
## s.e. 0.0526 0.0623 0.0404 0.0330 0.5248
##
## sigma^2 estimated as 51.6: log likelihood = -3563.71, aic = 7139.41
This is the model description of period = 30:
##
## Call:
## arima(x = DF$count, order = c(1, 0, 2), seasonal = list(order = c(1, 0, 0),
## period = 30))
##
## Coefficients:
## ar1 ma1 ma2 sar1 intercept
## 0.8709 -0.5576 -0.1310 0.0147 22.7659
## s.e. 0.0422 0.0537 0.0379 0.0311 0.5405
##
## sigma^2 estimated as 51.68: log likelihood = -3564.57, aic = 7141.14
The ARMA(1,2) model without seasonality:
##
## Call:
## arima(x = DF$count, order = c(1, 0, 2))
##
## Coefficients:
## ar1 ma1 ma2 intercept
## 0.8718 -0.5586 -0.1316 22.7654
## s.e. 0.0422 0.0537 0.0380 0.5340
##
## sigma^2 estimated as 51.69: log likelihood = -3564.68, aic = 7139.36
We can see that the log likelihoods are improved by less than 1 in both SARMA. So there is no obvious seasonality in the fire alarm daily data.
We plot the spectral density plot:
s=h$freq[which.max(h$spec)]
s
## [1] 0.0009259259
1/s
## [1] 1080
The unit of periodogram frequency is cycles per day. The highest frequency is at 0.0009. This maximum point is significant, which can be seen by sliding the confidence interval bar on the right.
However, 0.0009 corresponds to a period of 1080 days. 1080 days is almost three years. We only have three-year data, which means we do not have enough data to discover the cycle.
r<-arma12$residuals
acf(r)
qqnorm(r)
qqline(r)
We can see the ACF plot is within the dashed lines, so we can say that the residuals are uncorrelated. But the Q-Q plot shows that the residual is not gaussian. It has a lone tail on the right. To have the residual act like normal, we can try log transformation.
our model is ARMA(1,2) \[X_n=\mu+\phi_1 (X_{n-1}-\mu)+\psi_1\epsilon_{n-1}+\psi_2\epsilon_{n-2}\] {\(\epsilon_n\)} is Gaussian white noise with variance \(\sigma^2\). \(\mu=22.7654, \phi_1=0.8718, \psi_1=-0.5586, \psi_2=-0.1316, \sigma^2=51.69\).
There is no obvious seasonality or cycle.
This result suggests that the fire alarm emergency can happen anytime in the year.
The residual is right-skewed, so we can try log transformation to the fire alarm data in the future. If we want to find a cycle, we will need to find more data. Ths data we are using is just data in one county. We may need dataset of more places to have a more general model.
data from: https://www.kaggle.com/mchirico/montcoalert
notes and codes from: https://ionides.github.io/531w20/