This final project is a replication of a previous research paper which took advantage of hidden markov models. The article’s title is “Analysis of animal accelerometer data using hidden Markov Models”[2]. Using accelerometer to capture animals’ motion is a prevalent way to keep track of animal behaviour. The measured data, usually called minimum specific acceleration(MSA), can directly indicate the instant activeness of the animal[2]. The larger the MSA is, the probable that the animal is in an active state[2]. I am going to use one of the dataset from this research article to experiment hidden Markov models and sketch out the activeness of an Verreaux eagle throughout a day.
According to the article, an Verreaux eagle(simplified as “eagle” below) was tracked for 9 days from 04/16 to 04/24 [1][2]. On each day data were collected from about 8am to 8pm, with a time interval of about one and a half minutes between two adjacent timepoints. Therefore each time series from different days were independent of each other[2].
According to the article, the authors presumed that there are two hidden states (active and inactive)[2]. Within the active state, the MSA follows a gamma distribution. Within the inactive state, the MSA follows a mixture of two gamma distributions[2]. Gamma distribution is a simple and ideal disribution to describe a set of points that are mostly small with a handful of extremely large points. The authors decided to use a mixture of gamma distributions for the inactive state because the eagle could be either gliding at a low speed or it could be roosting[2].
Since there are only two states, the state transition can be represented by a 2 by 2 probability matrix. In the dataset the authors also collected the wind speed and temperature for each time point. They believed that the transitional probability can be fitted by a logistic regression with wind speed, temperature and their interaction[2].
Because of the time constraint, I decide to carry out analysis only on data points collect on a single day. Since there are the most number of data points on 04/21 (235). My plan is to fit seven POMP models, compare their likelihood, then interpret the one with the best fit.
First I plan to just fit a gamma distribution on the 235 data points.
The second model I want to fit is the simplest POMP model. It has a transitional probability matrix which is a constant. \[ \begin{bmatrix} 1-p_{0} & 1-p_{1}\\ p_{0} & p_{1} \end{bmatrix} \] \(p_{0}\) represents the probability of the eagle changing from inactive state to active state \(p_{1}\) represents the probability of the eagle staying in the active state. Within each state the MSA follows its own gamma distribution \[ \begin{aligned} \text{MSA} \stackrel{\text{inactive}}{\sim} Gamma(\text{shape0}, \text{scale0})\\ \text{MSA} \stackrel{\text{active}}{\sim} Gamma(\text{shape1}, \text{scale1}) \end{aligned} \]
The third model is similar, but differs from the second model in that the measurement distribution of the inactive state is a mixture of two gamma distributions
\[ \text{MSA} \stackrel{\text{inactive}}{\sim} p_{mix}\times Gamma(\text{shape01}, \text{scale01})+(1-p_{mix})\times Gamma(\text{shape02}, \text{scale02}) \]
The fourth model adds to the third model by making the transitional probability follow a logistic regression with wind speed as a covariate, namely
\[ \begin{aligned} \text{logit}(p_{0})\sim \beta_{00}+\beta_{01}\text{wind}\\ \text{logit}(p_{1})\sim \beta_{10}+\beta_{11}\text{wind} \end{aligned} \]
The fifth model has the logistic regression based on temperature
\[ \begin{aligned} \text{logit}(p_{0})\sim \beta_{00}+\beta_{02}\text{temperature}\\ \text{logit}(p_{1})\sim \beta_{10}+\beta_{12}\text{temperature} \end{aligned} \]
The sixth model has the logistic regression based on the main effects of both wind speed and temperature
\[ \begin{aligned} \text{logit}(p_{0})\sim \beta_{00}+\beta_{01}\text{wind}+\beta_{02}\text{temperature}\\ \text{logit}(p_{1})\sim \beta_{10}+\beta_{11}\text{wind}+\beta_{12}\text{temperature} \end{aligned} \]
The seventh model is the full model with both covariates and their interaction term
\[ \begin{aligned} \text{logit}(p_{0})\sim \beta_{00}+\beta_{01}\text{wind}+\beta_{02}\text{temperature}+\beta_{03}\text{temperature}\times \text{wind}\\ \text{logit}(p_{1})\sim \beta_{10}+\beta_{11}\text{wind}+\beta_{12}\text{temperature}+\beta_{13}\text{temperature}\times \text{wind} \end{aligned} \]
The fitting of the POMP models are based on the pomp package for R[4][6]. Specifically I am using the iterated particle filtering algorithm to estimate the parameters[4][6]. Because of the limitation of time, after I fit the third model and get the parameters of the measurement model, I set them as constant when fitting further models. I take this approach because I am not particularly interested in the parameters of the gamma distributions. This approach can also increase the efficiency of the algorithm given that there are fewer parameters that it needs to optimize.
First I plot the raw data and its autocorrelation for different lags.
There is significant autocorrelation at small lags, which indicates that a POMP model could fit the data well.
I summarize the log likelihoods of all the models in the table below.
Number of Parameters | Log Likelihood | Results from Article | |
---|---|---|---|
Model 1 | 2 | 188.84 | NA |
Model 2 | 6 | 344.46 | NA |
Model 3 | 9 | 355.24 | 2000.2 |
Model 4 | 11 | 358.29 | 2001.9 |
Model 5 | 13 | 355.37 | 2010.4 |
Model 6 | 15 | 358.62 | 2011.6 |
Model 7 | 17 | 355.99 | 2017.0 |
It’s worth noting that the log likelihood from the article is calculated based on the data from all nine days, therefore it’s much larger than my log likelihood calculated from a single day. After comparing the log likelihoods, I conclude that model 4 is the optimal model. Model 4 assumes that the transitional probability between hidden states can be fitted by a logistic regression with wind speed as the only covariate.
To evaluate the model diagnostics, I decide to offer the convergence diagnostics plots for model 3 and model 4. Recall that I use model 3 to decide the parameters for the measurement model and I used model 4 to decide the rest of the parameters for the transitional probability matrix.
First I offer the diagnostics plots for model 3, which is the POMP model without covariates.
All the iterative filtering trials converge, yet they converge to several distinct parameter sets. Given the time constraint, I took a heuristic and decide to pick the parameter sets based on two standards. The first standard is that the mean of the gamma distribution corresponding to my presumed active state should be higher than the mean values of the gamma distributions corresponding to my presumed inactive states. The second standard is high log likelihood. Luckily after I investigate all the filtering trials, the one with the highest log likelihood meets both my standards. Therefore I go ahead with this set of parameters.
Next is the diagnostics plot for model 4, which is the optimal model that I select.
All the iterative filtering trials converge to a single set of parameters, indicating that the model is stable.
We can also visually checkout the goodness of fit by simulating several time series using the selected model.
Because of the unpredictability of the eagle’s activity, the shape of my time series cannot match the eagle’s exactly, but the overall trends are the same. Most of the time the acceleration is small, but occsasionally there are high peaks.
The selected parameter is
X | x |
---|---|
beta00 | -2.8769716 |
beta01 | 0.1420076 |
beta10 | -1.7332549 |
beta11 | 1.3247512 |
pmix | 0.4208009 |
shape01 | 20.5100964 |
scale01 | 0.0006964 |
shape02 | 2.9724050 |
scale02 | 0.0090519 |
shape1 | 1.0658164 |
scale1 | 0.3563128 |
This set of parameters tell us that \[ \text{log odds}(p_{\text{inactive}\rightarrow \text{active}})=-2.87+0.14\times \text{wind} \]
\[ \text{log odds}(p_{\text{active}\rightarrow \text{active}})=-1.73+1.32\times \text{wind} \]
Within the active state, the MSA follows \[ MSA\sim Gamma(\text{shape}=1.066, \text{scale}=0.356) \]
Within the inactive state, the MSA follows a mixture of
\[ \begin{aligned} MSA\sim Gamma(\text{shape}=20.51, \text{scale}=6.96\times 10^{-4})\\ MSA\sim Gamma(\text{shape}=2.972, \text{scale}=0.0091) \end{aligned} \]
It would be more illustrative with some plots. Inspired by the article, I plot the probability of the eagle staying in its original state against the wind speed.
We can see that as the wind speed increases, the probability of the eagle switching from inactive state to active state will gradually decrease from about 0.95. The eagle rarely stays in active state for a prolonged period of time, but the probability of staying in active state will drastically increase when the wind speed increases. It is encouraging to see that this plot is very similar to the one in the original article[2], which indicates that my model fitting is successful.
I can also calculate the stationary distribution of two states under wind speed. The algorithm can be found [7].
It’s not surprising that the eagle will mostly stay in inactive state when the wind speed is low and it will mostly stay in active state when the wind speed is high. This plot also resembles the one in the original article[2].
Recall that we learnt about ARIMA model in the first half of class[4]. I quickly test the validity of this approach. First collect the AIC value of different ARIMA model with different number of autoregressive components and moving average components.
MA0 | MA1 | MA2 | MA3 | MA4 | MA5 | |
---|---|---|---|---|---|---|
AR0 | 216.5828 | 192.3169 | 194.1782 | 183.5883 | 182.8010 | 182.8297 |
AR1 | 190.0190 | 177.4872 | 178.1269 | 171.5401 | 173.4630 | 174.1114 |
AR2 | 190.8779 | 178.7243 | 173.2757 | 173.2533 | 174.4307 | 176.1966 |
AR3 | 174.8193 | 172.8713 | 173.8913 | 170.4232 | 170.6101 | 160.7129 |
AR4 | 175.1443 | 174.2813 | 174.5854 | 176.5850 | 177.5627 | 179.1692 |
It looks like ARIMA(3,0,5) is the best. I fit the model and check out the log likelihood of the model.
## [1] -70.35643
This is way smaller than the log likelihood of my POMP model. This shows that partial Markov Process is superior to ARIMA in this specific task.
In this project I am able to replicate the data analysis of a past ecological article using the pomp
package in R. The algorithm is different from what the authors used, but I am able to achieve similar outcomes. Both my efforts and the article confirms that modelling an eagle’s daily activity into active/inactive states is feasible. The eagle is mostly in an inactive state and might burst into active state for a short period once in a while. It’s also noted that the probability of staying in /switching states varies with the wind speed of the eagle’s environment. The higher the wind speed is, the more probable that the eagle might stay in the active state.
Because of time constraints, I am not able to carry out extra computations that I would like to do. In this project I only took advantage of the data collected in a day while there are in total nine days of data. The heuristics that I take are potentially controversial. A typical example is how I select and fix the parameters for the measurement model. Profile likelihood methods should be introduced where necessary.
All the major computations are coded up in standalone R scripts and run on the greatlakes clusters offered by University of Michigan high performace computing service. The access to the cluster is provided by Dr Edward Ionides, the instructor of STATS 531 “Analysis of Time Series”.
For the details of my code please refer to the github repository.
[1]Dryad Digital Repository(how I access the data, link)
[2] Leos‐Barajas, V., Photopoulou, T., Langrock, R., Patterson, T.A., Watanabe, Y.Y., Murgatroyd, M. and Papastamatiou, Y.P. (2017), Analysis of animal accelerometer data using hidden Markov models. Methods Ecol Evol, 8: 161-173. doi:10.1111/2041-210X.12657
[3] Wiley online library(how I access the article and its appendices, link)
[4] STATS 531 WN2020 ( link)
[5] How to write R extensions from CRAN(link)
[6] pomp
package tutorial(link)
[7] How to calculate stationary distribution from a transitional probability matrix(link)