Distracted driving has become increasingly common in recent years. In 2013, Kilbey counted 195,723 total number of casualties in road accidents just in the UK alone. Human failure or error has been implicated to be the cause of about 70% of accidents, while environmental factors and vehicular factors have been identified to be the cause of the remaining 30% of accidents. To date, researchers have been trying to tackle this issue based on two approaches. One is based on analyzing the driver’s physical behavior while the other is based off analyzing the vehicle’s real-time driving patterns.
Research based off analyzing the driver’s physical behavior primarily involves the use of visual or auxiliary systems. For example, many approaches use camera-based sensors installed inside the vehicle to monitor the driver’s eyes and issue warnings if an irregularity of eye movement or body posture was detected. However, these approaches require additional camera costs and the camera may also distract the driver’s attention leading to safety hazards. Thus, implementation of this approach in practice is still very limited.
On the other hand, there are increasing research interests in analyzing real-time driving patterns by using available built-in sensors measuring vehicle kinematic data like speed, yaw/turn rate, and lane offset. These measured signals offer rich information to distinguish between normal and irregular driving behaviors. It is often reasonable to assume that most drivers may have their own normal driving patterns and a significant deviation from their normal behavior may reflect high probability of distracted behaviors.
In this study, we aim to analyze time series models that appropriately represent drivers driving straight. In particular, given kinematic data from different drivers driving that are driving under the driving pattern known as straight constant speed driving (according to an algorithm known as AprioriAll), can we find a single model best able to describe straight constant speed driving? If this is possible, then future work can be done with regards to distinguishing between distracted driving data vs. non-distracted driving data. Additionally, I am interested in seeing if can we learn any information as to how this AprioriAll algorithm categorized these instances of data as the so called straight constant speed driving state.
The University of Michigan Transportation Research Institute (UMTRI) has gathered kinematic driving data from drivers around southeast Michigan. These drivers were given vehicles with sensors recording acceleration, yaw/turn rate, lane offset, and lateral speed at a sampling rate of 10Hz. Drivers were allowed to keep these vehicles for four weeks and allowed to do whatever they wanted with the vehicle. At the end of the four weeks, the vehicles were given back to UMTRI, and the data was stored for researchers to use. Most drivers simply drove their normal daily routines implying that UMTRI’s stored dataset has many replications of realistic kinematic driving data on roads related to their daily routines. Additionally, some of these drivers took similar highways with regards to their daily routines implying that UMTRI also has this realistic kinematic driving data from multiple drivers on the same roads.
Data was selected from a particular highway that is not allowed to be disclosed due to privacy reasons. A sequential associational rule-finding algorithm known as AprioriAll was implemented to find instances of drivers driving on this road via straight constant speed driving. The algorithm claims that is able to find frequent patterns from a given dataset and can cluster subsegements of data into the frequent patterns. With this specific driving dataset, each frequent pattern extracted can be thought to represent a certain driving state that the vehicle may be in. One of these driving states is straight constant speed driving; this is the state that this study will analyze.
Based on the algorithm’s results, straight constant speed driving entails that the sensors acceleration, yaw/turn rate, lane offset, and lateral speed are all exactly 0 within a window of 3.5 secs. Of course reality dictates that this can never be true, but more often than not, a driver driving on this particular highway will have some period of time where the sensors reach values very close to straight constant speed driving. A total of 5 instances were selected from the dataset where drivers had sensor values most closely related to AprioriAll’s straight constant speed driving category. Of these 5 instances, there were 3 different drivers, two of which had 2 different trips each and the other having only 1 trip recorded. The drivers, trips, and number of data points associated with each file can be summarized below.
Driver | Trip | NumObservations |
---|---|---|
4 | 148 | 321 |
4 | 186 | 461 |
25 | 173 | 702 |
103 | 33 | 375 |
103 | 118 | 211 |
In each of these instances, we have data from the sensors acceleration, yaw/turn rate, lane offset, and lateral speed; however, in this study we will only analyze the single signal of lane offset as this appears to be most closely related to determining if a driver is driving straight.
It may be interesting to veiw the lane offset signal from each of these instances before doing an analysis on them, so we will begin with that:
Looking at these plots, it is not very clear why the algorithm AprioriAll clustered them together (STATE THIS BETTER). However, given that each instance’s data is a time series, we may be able to better understand this clustering via time series analysis.
Time series models are often appropriate when data when there appears to be characterstics of weak stationarity. To be weak stationary, one must be both mean stationary and covariance stationary. Mean stationary implies that the mean of the data is constant over time; hence we can check this by applying a 10 point sliding nonoverlapping window over the data and calculating the mean of the data within each window:
It can be seen here that for the most part, the mean appears have low variability over time in each of these instances, so we can say we have evidence of mean stationarity characteristics. Now all that’s left to check is covariance stationarity characteristics. To be covariance stationary, the covaraiance between two observations should only depend on the distance between them. Thus we can check this by plotting the autocorrelation function of different time instances with each trip and see if they display similar patterns:
It appears for the most part that they display similar autocorrelation patterns both between and within datasets, so I believe it is safe to say that there does appear to be a sense of covariance stationarity.
Given both evidence of mean stationarity and covariance stationarity, we can now attempt to fit time series models for each of these datasets under the assumption of weak stationarity. One of the most basic yet well developed time series models based on this assumption is the Auto Regressive Moving Average (ARMA) model.A time series \(\{x_t:t=0,\pm1,\pm2,\dots\}\) is ARMA(\(p,q\)) if it is weakly stationary and \[x_t=\phi_1x_{t-1}+\dots+\phi_px_{t-p}+w_t+\psi_1w_{t-1}+\dots+\psi_qw_{t-q}\] where
The parameters \(p\) and \(q\) are called the autoregressive and moving average orders respectively. If \(x_t\) has a nonzero mean \(\mu\), then we set \(\alpha = \mu(1-\phi_1-\dots-\phi_p)\) and the model becomes \[x_t=\alpha+\phi_1x_{t-1}+\dots+\phi_px_{t-p}+w_t+\psi_1w_{t-1}+\dots+\psi_qw_{t-q}.\]
Our goal in this section will be to fit ARMA models to each of the time series given in the dataset provided and then compare the models to see if we can better understand why AprioriAll believes these time series should be clustered together. Given the arima() function in R, fitting ARMA models is easy to do; arima runs an optimization procedure to estimate the parameters \(\phi_i\) for \(i=1,\dots,p\) and \(\theta_j\) for \(j = 1,\dots,q\) in the ARMA model chosen. However, the real challenge arises when deciding what ARMA orders (the \(p\) and \(q\) parameters) to choose when fitting.
Looking at this problem from a predictability standpoint, the Akaike’s Information Criterion (AIC) criterion comes to mind as a model selection method. AIC compares likelihoods of different models by penalizing the likelihood of each model by a measure of its complexity via the following equation:
\[AIC = -2\ell(\theta^*)+2D\] where \(D\) is the number of parameters present, \(\ell(\cdot)\) is the loglikelihood of the model, and \(\theta^*\) is the estimate given from maximizing the log likelihood. Essentially, the lower the AIC score, the better the model. Given this approach, we can apply this methodology to the ARMA models to compare models of different orders. We choose to apply this to the given instances’ lane offsets looking of AR orders \(p\) ranging from 0 to 10 as the autocorrelation plotted before seems to strongly support this idea, and additionally we choose to look at MA orders \(q\) ranging from 0 to 3. Using this, we get the following results:
MA0 | MA1 | MA2 | MA3 | |
---|---|---|---|---|
AR0 | -222 | -602 | -851 | -1015 |
AR1 | -1331 | -1330 | -1354 | -1371 |
AR2 | -1329 | -1330 | -1352 | -1350 |
AR3 | -1368 | -1375 | -1327 | -1452 |
AR4 | -1395 | -1398 | -1456 | -1336 |
AR5 | -1413 | -1471 | -1490 | -1454 |
AR6 | -1497 | -1503 | -1503 | -1504 |
AR7 | -1500 | -1493 | -1504 | -1502 |
AR8 | -1504 | -1507 | -1507 | -1505 |
AR9 | -1503 | -1505 | -1505 | -1504 |
AR10 | -1510 | -1508 | -1512 | -1510 |
MA0 | MA1 | MA2 | MA3 | |
---|---|---|---|---|
AR0 | -569 | -1017 | -1332 | -1450 |
AR1 | -1842 | -1861 | -1863 | -1861 |
AR2 | -1855 | -1851 | -1861 | -1868 |
AR3 | -1860 | -1863 | -1861 | -1870 |
AR4 | -1876 | -1879 | -1888 | -1893 |
AR5 | -1884 | -1893 | -1893 | -1891 |
AR6 | -1891 | -1892 | -1891 | -1890 |
AR7 | -1894 | -1892 | -1892 | -1894 |
AR8 | -1892 | -1890 | -1889 | -1888 |
AR9 | -1891 | -1889 | -1888 | -1886 |
AR10 | -1889 | -1887 | -1889 | -1887 |
MA0 | MA1 | MA2 | MA3 | |
---|---|---|---|---|
AR0 | -808 | -1532 | -1879 | -2159 |
AR1 | -2437 | -2435 | -2521 | -2521 |
AR2 | -2435 | -2444 | -2520 | -2521 |
AR3 | -2524 | -2571 | -2678 | -2691 |
AR4 | -2619 | -2618 | -2694 | -2692 |
AR5 | -2619 | -2643 | -2686 | -2701 |
AR6 | -2720 | -2723 | -2722 | -2720 |
AR7 | -2724 | -2722 | -2719 | -2718 |
AR8 | -2722 | -2720 | -2721 | -2724 |
AR9 | -2720 | -2718 | -2716 | -2719 |
AR10 | -2719 | -2717 | -2719 | -2720 |
MA0 | MA1 | MA2 | MA3 | |
---|---|---|---|---|
AR0 | -546 | -958 | -1257 | -1411 |
AR1 | -1827 | -1827 | -1825 | -1823 |
AR2 | -1827 | -1827 | -1825 | -1827 |
AR3 | -1825 | -1825 | -1827 | -1829 |
AR4 | -1825 | -1828 | -1848 | -1853 |
AR5 | -1857 | -1907 | -1906 | -1909 |
AR6 | -1891 | -1906 | -1905 | -1908 |
AR7 | -1896 | -1906 | -1904 | -1910 |
AR8 | -1895 | -1897 | -1896 | -1903 |
AR9 | -1896 | -1895 | -1892 | -1908 |
AR10 | -1918 | -1917 | -1915 | -1914 |
MA0 | MA1 | MA2 | MA3 | |
---|---|---|---|---|
AR0 | -310 | -552 | -708 | -807 |
AR1 | -1060 | -1058 | -1056 | -1058 |
AR2 | -1058 | -1056 | -1054 | -1099 |
AR3 | -1056 | -1055 | -1062 | -1099 |
AR4 | -1057 | -1085 | -1064 | -1064 |
AR5 | -1065 | -1092 | -1099 | -1054 |
AR6 | -1081 | -1092 | -1090 | -1088 |
AR7 | -1083 | -1090 | -1088 | -1088 |
AR8 | -1082 | -1089 | -1098 | -1088 |
AR9 | -1084 | -1080 | -1087 | -1091 |
AR10 | -1089 | -1091 | -1087 | -1089 |
These results appear to recommend a large variety of models based on the AIC:
To our dismay, there does not appear to be an agreement on a single model. However,there are hints that the optimization procedure had issues optimizing as the AIC values can be seen to occasionally change by more than 2 units when shifting over a row or column. Addtionally, we also have hints that the MA coefficient should be 0 and we can notice that the ARMA models favored are the more higher models. However this second fact could simply be the result of the high autocorrelation shown in the lane offset variable.
The results from the previous ARMA analysis did not seem support AprioriAll’s conclusion. However, what if looked at the data from a differential perspective? Differential data can be described as a transformation of data \(x_{1:N}\) via the following formula: \[z_n = x_n-x_{n-1}, \ \forall n.\]
This difference transformation can be viewed as a transformation to weak stationarity. Applying this transformation to our instances’ lane offsets to see:
Afer applying this differential transformation to the data, we can again check the weak stationarity conditions. Looking first at the mean stationarity at a window size of 10 points, we can see the following:
These results seem to indicate very small variability in the mean–allowing us to make the claim of the data displaying a sense of mean stationarity. We can then check covariance stationarity by looking at the autocorrelation function again of the differential data. (NOTE: This time around, the plots were enlarged so that we can better see the patterns as the amount of autocorrelation changed dramatically from the previous results.)
Looking at the we can see that autocorrelation patterns are generally the same throughout all of the instances with the exception of Driver 103, but even then parts of that driver’s data show similar patterns to the rest of the instances. Hence we can claim at least some form of covariance stationarity. Thus we can fit ARMA models again and perform theAIC Analysis again:
MA0 | MA1 | MA2 | MA3 | |
---|---|---|---|---|
AR0 | -1330 | -1330 | -1355 | -1353 |
AR1 | -1328 | -1339 | -1353 | -1358 |
AR2 | -1369 | -1377 | -1446 | -1453 |
AR3 | -1397 | -1400 | -1457 | -1457 |
AR4 | -1414 | -1466 | -1486 | -1489 |
AR5 | -1494 | -1499 | -1498 | -1500 |
AR6 | -1497 | -1497 | -1499 | -1498 |
AR7 | -1499 | -1502 | -1503 | -1501 |
AR8 | -1499 | -1501 | -1505 | -1505 |
AR9 | -1504 | -1502 | -1507 | -1506 |
AR10 | -1503 | -1508 | -1505 | -1505 |
MA0 | MA1 | MA2 | MA3 | |
---|---|---|---|---|
AR0 | -1836 | -1857 | -1860 | -1859 |
AR1 | -1851 | -1859 | -1858 | -1866 |
AR2 | -1857 | -1860 | -1880 | -1883 |
AR3 | -1875 | -1877 | -1885 | -1890 |
AR4 | -1882 | -1888 | -1889 | -1888 |
AR5 | -1887 | -1888 | -1887 | -1887 |
AR6 | -1890 | -1888 | -1888 | -1889 |
AR7 | -1888 | -1890 | -1889 | -1888 |
AR8 | -1887 | -1885 | -1886 | -1884 |
AR9 | -1885 | -1883 | -1887 | -1885 |
AR10 | -1884 | -1883 | -1881 | -1882 |
MA0 | MA1 | MA2 | MA3 | |
---|---|---|---|---|
AR0 | -2420 | -2420 | -2516 | -2516 |
AR1 | -2418 | -2469 | -2515 | -2517 |
AR2 | -2516 | -2567 | -2671 | -2686 |
AR3 | -2616 | -2614 | -2689 | -2687 |
AR4 | -2616 | -2640 | -2687 | -2696 |
AR5 | -2713 | -2714 | -2713 | -2711 |
AR6 | -2715 | -2713 | -2711 | -2709 |
AR7 | -2713 | -2711 | -2713 | -2714 |
AR8 | -2711 | -2709 | -2716 | -2713 |
AR9 | -2710 | -2708 | -2706 | -2714 |
AR10 | -2708 | -2706 | -2713 | -2711 |
MA0 | MA1 | MA2 | MA3 | |
---|---|---|---|---|
AR0 | -1826 | -1827 | -1825 | -1824 |
AR1 | -1827 | -1825 | -1823 | -1828 |
AR2 | -1825 | -1823 | -1834 | -1848 |
AR3 | -1826 | -1829 | -1849 | -1855 |
AR4 | -1856 | -1891 | -1892 | -1895 |
AR5 | -1886 | -1893 | -1891 | -1893 |
AR6 | -1890 | -1891 | -1889 | -1896 |
AR7 | -1888 | -1889 | -1887 | -1891 |
AR8 | -1888 | -1891 | -1898 | -1896 |
AR9 | -1903 | -1901 | -1899 | -1899 |
AR10 | -1901 | -1912 | -1910 | -1908 |
MA0 | MA1 | MA2 | MA3 | |
---|---|---|---|---|
AR0 | -1059 | -1057 | -1055 | -1057 |
AR1 | -1057 | -1055 | -1054 | -1086 |
AR2 | -1055 | -1054 | -1065 | -1090 |
AR3 | -1056 | -1074 | -1090 | -1088 |
AR4 | -1063 | -1079 | -1094 | -1093 |
AR5 | -1077 | -1080 | -1078 | -1076 |
AR6 | -1077 | -1078 | -1076 | -1075 |
AR7 | -1076 | -1076 | -1080 | -1074 |
AR8 | -1076 | -1076 | -1080 | -1079 |
AR9 | -1077 | -1075 | -1089 | -1077 |
AR10 | -1077 | -1075 | -1074 | -1075 |
Based on these results from the differential data, we can see
On the other hand, recalling the results from before we differenced the data
In both cases, we have the issue of imperfect maximization–the AIC values occasionally change by more than 2 when moving over by one row or column. Additionally, there does not appear to be a consensus on a single model size (as seen before), but this time around, smaller models began to show favor in the eyes of the AIC model selection procedure. Hence we have hints that maybe if we continue to detrend the data, we can better understand why AprioriAll believes these five instances should be clustered together; however one should keep in mind this is simply a hypothesis we can make based on these results (and not a very strong one at that).
Based on these results, we cannot really make too many strong inferences as to what kinds of information the AprioriAll algorithm extracts when extracting patterns from a dataset. I believe this could be the result of how the AprioriAll algorithm clustered all six of these instances based on the 4 variables: acceleration, yaw/turn rate, lane offset, and lateral speed. This study only analyzed the singular variable lane offset; hence we have reason to believe that possibly a multivariate ARMA model may actually have better results than a univariate ARMA model. However if we want to analyze such a model, we must include an analysis of cross correlation (correlation between variables) in addition to our current results. Only then would I suggest even further analysis.
Although I said we cannot make too many strong inferences on the types of information AprioriAll extracts, this does not imply there was nothing learned. One thing we did learn was it appears that the AprioriAll algorithm takes into account autocorrelation (correlation within a variable over time) when applied to a dataset. This can be seen when we viewed the autocorrelation of the lane offset before differentiating the data. We had high autocorrelation and the ARMA models appeared to lean towards picking higher order models.
Future work I would suggest for further analyzing the AprioriAll algorithm would be to also look into the times of day the dataset was recorded as well as maybe determining if any of these drivers were distracted during these instances. AprioriAll might believe all of these drivers were driving in the same state, but if distractions were present or if traffic conditions were varying between these drivers, then this could be reasons why we could not find a single model agreed upon by each of the instances we found.