1. Introduction

Distracted driving has become increasingly common in recent years. In 2013, Kilbey counted 195,723 total number of casualties in road accidents just in the UK alone. Human failure or error has been implicated to be the cause of about 70% of accidents, while environmental factors and vehicular factors have been identified to be the cause of the remaining 30% of accidents. To date, researchers have been trying to tackle this issue based on two approaches. One is based on analyzing the driver’s physical behavior while the other is based off analyzing the vehicle’s real-time driving patterns.

Research based off analyzing the driver’s physical behavior primarily involves the use of visual or auxiliary systems. For example, many approaches use camera-based sensors installed inside the vehicle to monitor the driver’s eyes and issue warnings if an irregularity of eye movement or body posture was detected. However, these approaches require additional camera costs and the camera may also distract the driver’s attention leading to safety hazards. Thus, implementation of this approach in practice is still very limited.

On the other hand, there are increasing research interests in analyzing real-time driving patterns by using available built-in sensors measuring vehicle kinematic data like speed, yaw/turn rate, and lane offset. These measured signals offer rich information to distinguish between normal and irregular driving behaviors. It is often reasonable to assume that most drivers may have their own normal driving patterns and a significant deviation from their normal behavior may reflect high probability of distracted behaviors.

Objective:

In this study, we aim to analyze time series models that appropriately represent drivers driving straight. In particular, given kinematic data from different drivers driving that are driving under the driving pattern known as straight constant speed driving (according to an algorithm known as AprioriAll), can we find a single model best able to describe straight constant speed driving? If this is possible, then future work can be done with regards to distinguishing between distracted driving data vs. non-distracted driving data. Additionally, I am interested in seeing if can we learn any information as to how this AprioriAll algorithm categorized these instances of data as the so called straight constant speed driving state.


2. Data Overview

The University of Michigan Transportation Research Institute (UMTRI) has gathered kinematic driving data from drivers around southeast Michigan. These drivers were given vehicles with sensors recording acceleration, yaw/turn rate, lane offset, and lateral speed at a sampling rate of 10Hz. Drivers were allowed to keep these vehicles for four weeks and allowed to do whatever they wanted with the vehicle. At the end of the four weeks, the vehicles were given back to UMTRI, and the data was stored for researchers to use. Most drivers simply drove their normal daily routines implying that UMTRI’s stored dataset has many replications of realistic kinematic driving data on roads related to their daily routines. Additionally, some of these drivers took similar highways with regards to their daily routines implying that UMTRI also has this realistic kinematic driving data from multiple drivers on the same roads.

Data was selected from a particular highway that is not allowed to be disclosed due to privacy reasons. A sequential associational rule-finding algorithm known as AprioriAll was implemented to find instances of drivers driving on this road via straight constant speed driving. The algorithm claims that is able to find frequent patterns from a given dataset and can cluster subsegements of data into the frequent patterns. With this specific driving dataset, each frequent pattern extracted can be thought to represent a certain driving state that the vehicle may be in. One of these driving states is straight constant speed driving; this is the state that this study will analyze.

Based on the algorithm’s results, straight constant speed driving entails that the sensors acceleration, yaw/turn rate, lane offset, and lateral speed are all exactly 0 within a window of 3.5 secs. Of course reality dictates that this can never be true, but more often than not, a driver driving on this particular highway will have some period of time where the sensors reach values very close to straight constant speed driving. A total of 5 instances were selected from the dataset where drivers had sensor values most closely related to AprioriAll’s straight constant speed driving category. Of these 5 instances, there were 3 different drivers, two of which had 2 different trips each and the other having only 1 trip recorded. The drivers, trips, and number of data points associated with each file can be summarized below.

Driver Trip NumObservations
4 148 321
4 186 461
25 173 702
103 33 375
103 118 211

In each of these instances, we have data from the sensors acceleration, yaw/turn rate, lane offset, and lateral speed; however, in this study we will only analyze the single signal of lane offset as this appears to be most closely related to determining if a driver is driving straight.



3. Analysis

It may be interesting to veiw the lane offset signal from each of these instances before doing an analysis on them, so we will begin with that:

Looking at these plots, it is not very clear why the algorithm AprioriAll clustered them together (STATE THIS BETTER). However, given that each instance’s data is a time series, we may be able to better understand this clustering via time series analysis.

3.1 Checking for Weak Stationarity

Time series models are often appropriate when data when there appears to be characterstics of weak stationarity. To be weak stationary, one must be both mean stationary and covariance stationary. Mean stationary implies that the mean of the data is constant over time; hence we can check this by applying a 10 point sliding nonoverlapping window over the data and calculating the mean of the data within each window:

It can be seen here that for the most part, the mean appears have low variability over time in each of these instances, so we can say we have evidence of mean stationarity characteristics. Now all that’s left to check is covariance stationarity characteristics. To be covariance stationary, the covaraiance between two observations should only depend on the distance between them. Thus we can check this by plotting the autocorrelation function of different time instances with each trip and see if they display similar patterns:

It appears for the most part that they display similar autocorrelation patterns both between and within datasets, so I believe it is safe to say that there does appear to be a sense of covariance stationarity.

3.2 Fitting ARMA Models

Given both evidence of mean stationarity and covariance stationarity, we can now attempt to fit time series models for each of these datasets under the assumption of weak stationarity. One of the most basic yet well developed time series models based on this assumption is the Auto Regressive Moving Average (ARMA) model.A time series \(\{x_t:t=0,\pm1,\pm2,\dots\}\) is ARMA(\(p,q\)) if it is weakly stationary and \[x_t=\phi_1x_{t-1}+\dots+\phi_px_{t-p}+w_t+\psi_1w_{t-1}+\dots+\psi_qw_{t-q}\] where

  1. \(\phi_p\neq0\) and \(\psi_p\neq0\),
  2. \(\phi_i\) for \(i=1,\dots,p\) and \(\psi_j\) for \(j = 1,\dots,q\) are constants, and
  3. \(w_t\) is white noise with mean \(0\) and variance \(\sigma_w^2\) (denoted as \(w_t\sim wn(0,\sigma_w^2)\)).

The parameters \(p\) and \(q\) are called the autoregressive and moving average orders respectively. If \(x_t\) has a nonzero mean \(\mu\), then we set \(\alpha = \mu(1-\phi_1-\dots-\phi_p)\) and the model becomes \[x_t=\alpha+\phi_1x_{t-1}+\dots+\phi_px_{t-p}+w_t+\psi_1w_{t-1}+\dots+\psi_qw_{t-q}.\]

Our goal in this section will be to fit ARMA models to each of the time series given in the dataset provided and then compare the models to see if we can better understand why AprioriAll believes these time series should be clustered together. Given the arima() function in R, fitting ARMA models is easy to do; arima runs an optimization procedure to estimate the parameters \(\phi_i\) for \(i=1,\dots,p\) and \(\theta_j\) for \(j = 1,\dots,q\) in the ARMA model chosen. However, the real challenge arises when deciding what ARMA orders (the \(p\) and \(q\) parameters) to choose when fitting.

Looking at this problem from a predictability standpoint, the Akaike’s Information Criterion (AIC) criterion comes to mind as a model selection method. AIC compares likelihoods of different models by penalizing the likelihood of each model by a measure of its complexity via the following equation:

\[AIC = -2\ell(\theta^*)+2D\] where \(D\) is the number of parameters present, \(\ell(\cdot)\) is the loglikelihood of the model, and \(\theta^*\) is the estimate given from maximizing the log likelihood. Essentially, the lower the AIC score, the better the model. Given this approach, we can apply this methodology to the ARMA models to compare models of different orders. We choose to apply this to the given instances’ lane offsets looking of AR orders \(p\) ranging from 0 to 10 as the autocorrelation plotted before seems to strongly support this idea, and additionally we choose to look at MA orders \(q\) ranging from 0 to 3. Using this, we get the following results:

AIC Values for D4T148
MA0 MA1 MA2 MA3
AR0 -222 -602 -851 -1015
AR1 -1331 -1330 -1354 -1371
AR2 -1329 -1330 -1352 -1350
AR3 -1368 -1375 -1327 -1452
AR4 -1395 -1398 -1456 -1336
AR5 -1413 -1471 -1490 -1454
AR6 -1497 -1503 -1503 -1504
AR7 -1500 -1493 -1504 -1502
AR8 -1504 -1507 -1507 -1505
AR9 -1503 -1505 -1505 -1504
AR10 -1510 -1508 -1512 -1510
AIC Values for D4T186
MA0 MA1 MA2 MA3
AR0 -569 -1017 -1332 -1450
AR1 -1842 -1861 -1863 -1861
AR2 -1855 -1851 -1861 -1868
AR3 -1860 -1863 -1861 -1870
AR4 -1876 -1879 -1888 -1893
AR5 -1884 -1893 -1893 -1891
AR6 -1891 -1892 -1891 -1890
AR7 -1894 -1892 -1892 -1894
AR8 -1892 -1890 -1889 -1888
AR9 -1891 -1889 -1888 -1886
AR10 -1889 -1887 -1889 -1887
AIC Values for D25T173
MA0 MA1 MA2 MA3
AR0 -808 -1532 -1879 -2159
AR1 -2437 -2435 -2521 -2521
AR2 -2435 -2444 -2520 -2521
AR3 -2524 -2571 -2678 -2691
AR4 -2619 -2618 -2694 -2692
AR5 -2619 -2643 -2686 -2701
AR6 -2720 -2723 -2722 -2720
AR7 -2724 -2722 -2719 -2718
AR8 -2722 -2720 -2721 -2724
AR9 -2720 -2718 -2716 -2719
AR10 -2719 -2717 -2719 -2720
AIC Values for D103T33
MA0 MA1 MA2 MA3
AR0 -546 -958 -1257 -1411
AR1 -1827 -1827 -1825 -1823
AR2 -1827 -1827 -1825 -1827
AR3 -1825 -1825 -1827 -1829
AR4 -1825 -1828 -1848 -1853
AR5 -1857 -1907 -1906 -1909
AR6 -1891 -1906 -1905 -1908
AR7 -1896 -1906 -1904 -1910
AR8 -1895 -1897 -1896 -1903
AR9 -1896 -1895 -1892 -1908
AR10 -1918 -1917 -1915 -1914
AIC Values for D103T118
MA0 MA1 MA2 MA3
AR0 -310 -552 -708 -807
AR1 -1060 -1058 -1056 -1058
AR2 -1058 -1056 -1054 -1099
AR3 -1056 -1055 -1062 -1099
AR4 -1057 -1085 -1064 -1064
AR5 -1065 -1092 -1099 -1054
AR6 -1081 -1092 -1090 -1088
AR7 -1083 -1090 -1088 -1088
AR8 -1082 -1089 -1098 -1088
AR9 -1084 -1080 -1087 -1091
AR10 -1089 -1091 -1087 -1089

3.2.1 Discussion

These results appear to recommend a large variety of models based on the AIC:

  • D4T148 favors ARMA(10,2), ARMA(10,0), and ARMA(9,3)
  • D4T186 favors ARMA(7,0), ARMA(5,1), ARMA(5,2)
  • D25T173 favors ARMA(7,0), ARMA(6,1), and ARMA(6,2)
  • D103T33 favors ARMA(10,0), ARMA(10,1) and ARMA(10,2)
  • D103T118 favors ARMA(2,3), ARMA(5,2) and ARMA(5,1)

To our dismay, there does not appear to be an agreement on a single model. However,there are hints that the optimization procedure had issues optimizing as the AIC values can be seen to occasionally change by more than 2 units when shifting over a row or column. Addtionally, we also have hints that the MA coefficient should be 0 and we can notice that the ARMA models favored are the more higher models. However this second fact could simply be the result of the high autocorrelation shown in the lane offset variable.

3.3 ARIMA Analysis

The results from the previous ARMA analysis did not seem support AprioriAll’s conclusion. However, what if looked at the data from a differential perspective? Differential data can be described as a transformation of data \(x_{1:N}\) via the following formula: \[z_n = x_n-x_{n-1}, \ \forall n.\]

This difference transformation can be viewed as a transformation to weak stationarity. Applying this transformation to our instances’ lane offsets to see:

Afer applying this differential transformation to the data, we can again check the weak stationarity conditions. Looking first at the mean stationarity at a window size of 10 points, we can see the following:

These results seem to indicate very small variability in the mean–allowing us to make the claim of the data displaying a sense of mean stationarity. We can then check covariance stationarity by looking at the autocorrelation function again of the differential data. (NOTE: This time around, the plots were enlarged so that we can better see the patterns as the amount of autocorrelation changed dramatically from the previous results.)

Looking at the we can see that autocorrelation patterns are generally the same throughout all of the instances with the exception of Driver 103, but even then parts of that driver’s data show similar patterns to the rest of the instances. Hence we can claim at least some form of covariance stationarity. Thus we can fit ARMA models again and perform theAIC Analysis again:

AIC Values for D4T148
MA0 MA1 MA2 MA3
AR0 -1330 -1330 -1355 -1353
AR1 -1328 -1339 -1353 -1358
AR2 -1369 -1377 -1446 -1453
AR3 -1397 -1400 -1457 -1457
AR4 -1414 -1466 -1486 -1489
AR5 -1494 -1499 -1498 -1500
AR6 -1497 -1497 -1499 -1498
AR7 -1499 -1502 -1503 -1501
AR8 -1499 -1501 -1505 -1505
AR9 -1504 -1502 -1507 -1506
AR10 -1503 -1508 -1505 -1505
AIC Values for D4T186
MA0 MA1 MA2 MA3
AR0 -1836 -1857 -1860 -1859
AR1 -1851 -1859 -1858 -1866
AR2 -1857 -1860 -1880 -1883
AR3 -1875 -1877 -1885 -1890
AR4 -1882 -1888 -1889 -1888
AR5 -1887 -1888 -1887 -1887
AR6 -1890 -1888 -1888 -1889
AR7 -1888 -1890 -1889 -1888
AR8 -1887 -1885 -1886 -1884
AR9 -1885 -1883 -1887 -1885
AR10 -1884 -1883 -1881 -1882
AIC Values for D25T173
MA0 MA1 MA2 MA3
AR0 -2420 -2420 -2516 -2516
AR1 -2418 -2469 -2515 -2517
AR2 -2516 -2567 -2671 -2686
AR3 -2616 -2614 -2689 -2687
AR4 -2616 -2640 -2687 -2696
AR5 -2713 -2714 -2713 -2711
AR6 -2715 -2713 -2711 -2709
AR7 -2713 -2711 -2713 -2714
AR8 -2711 -2709 -2716 -2713
AR9 -2710 -2708 -2706 -2714
AR10 -2708 -2706 -2713 -2711
AIC Values for D103T33
MA0 MA1 MA2 MA3
AR0 -1826 -1827 -1825 -1824
AR1 -1827 -1825 -1823 -1828
AR2 -1825 -1823 -1834 -1848
AR3 -1826 -1829 -1849 -1855
AR4 -1856 -1891 -1892 -1895
AR5 -1886 -1893 -1891 -1893
AR6 -1890 -1891 -1889 -1896
AR7 -1888 -1889 -1887 -1891
AR8 -1888 -1891 -1898 -1896
AR9 -1903 -1901 -1899 -1899
AR10 -1901 -1912 -1910 -1908
AIC Values for D103T118
MA0 MA1 MA2 MA3
AR0 -1059 -1057 -1055 -1057
AR1 -1057 -1055 -1054 -1086
AR2 -1055 -1054 -1065 -1090
AR3 -1056 -1074 -1090 -1088
AR4 -1063 -1079 -1094 -1093
AR5 -1077 -1080 -1078 -1076
AR6 -1077 -1078 -1076 -1075
AR7 -1076 -1076 -1080 -1074
AR8 -1076 -1076 -1080 -1079
AR9 -1077 -1075 -1089 -1077
AR10 -1077 -1075 -1074 -1075

3.3.1 Discussion

Based on these results from the differential data, we can see

  • D4T148 favors ARMA(10,1), ARMA(9,2), and ARMA(9,6)
  • D4T186 favors ARMA(6,0), ARMA(7,1), ARMA(3,3)
  • D25T173 favors ARMA(8,2), ARMA(6,0), and ARMA(5,1) and ARMA(7,3)
  • D103T33 favors ARMA(10,1), ARMA(10,2) and ARMA(10,3)
  • D103T118 favors ARMA(4,2), ARMA(4,3) and ARMA(2,3) and ARMA(3,2)

On the other hand, recalling the results from before we differenced the data

  • D4T148 favored ARMA(10,2), ARMA(10,0), and ARMA(9,3)
  • D4T186 favored ARMA(7,0), ARMA(5,1), ARMA(5,2)
  • D25T173 favored ARMA(7,0), ARMA(6,1), and ARMA(6,2)
  • D103T33 favored ARMA(10,0), ARMA(10,1) and ARMA(10,2)
  • D103T118 favored ARMA(2,3), ARMA(5,2) and ARMA(5,1)

In both cases, we have the issue of imperfect maximization–the AIC values occasionally change by more than 2 when moving over by one row or column. Additionally, there does not appear to be a consensus on a single model size (as seen before), but this time around, smaller models began to show favor in the eyes of the AIC model selection procedure. Hence we have hints that maybe if we continue to detrend the data, we can better understand why AprioriAll believes these five instances should be clustered together; however one should keep in mind this is simply a hypothesis we can make based on these results (and not a very strong one at that).


4. Conclusions

Based on these results, we cannot really make too many strong inferences as to what kinds of information the AprioriAll algorithm extracts when extracting patterns from a dataset. I believe this could be the result of how the AprioriAll algorithm clustered all six of these instances based on the 4 variables: acceleration, yaw/turn rate, lane offset, and lateral speed. This study only analyzed the singular variable lane offset; hence we have reason to believe that possibly a multivariate ARMA model may actually have better results than a univariate ARMA model. However if we want to analyze such a model, we must include an analysis of cross correlation (correlation between variables) in addition to our current results. Only then would I suggest even further analysis.

Although I said we cannot make too many strong inferences on the types of information AprioriAll extracts, this does not imply there was nothing learned. One thing we did learn was it appears that the AprioriAll algorithm takes into account autocorrelation (correlation within a variable over time) when applied to a dataset. This can be seen when we viewed the autocorrelation of the lane offset before differentiating the data. We had high autocorrelation and the ARMA models appeared to lean towards picking higher order models.

Future work I would suggest for further analyzing the AprioriAll algorithm would be to also look into the times of day the dataset was recorded as well as maybe determining if any of these drivers were distracted during these instances. AprioriAll might believe all of these drivers were driving in the same state, but if distractions were present or if traffic conditions were varying between these drivers, then this could be reasons why we could not find a single model agreed upon by each of the instances we found.



5. References