Licensed under the Creative Commons attribution-noncommercial license, http://creativecommons.org/licenses/by-nc/3.0/. Please share and remix noncommercially, mentioning its origin.
CC-BY_NC



Objectives

  1. Develop likelihood-based inference in the context of ARMA models.

  2. Discuss maximum likelihood parameter estimation and alternative methods.

  3. Investigate strategies for model selection, also known as model identification, in the context of ARMA models.

  4. Work on practical computational approaches for implementing these methods.




5.1 Background on likelihood-based inference

5.2 The maximum likelihood estimator (MLE)




5.2.1 Question: Why are \(\arg\max_\theta {\mathscr{L}}(\theta)\) and \(\arg\max_\theta {\ell}(\theta)\) the same?




  • We can write \(\hat\theta_{MLE}\) and \({{\hat\theta_{MLE}^*}}\) if we are considering various alternative estimation methods. However, in this course, we will most often be using maximum likelihood estimation so we let \(\hat\theta\) and \({{\hat\theta^*}}\) correspond to this approach.




5.3 Standard errors for the MLE

  1. The Fisher information. This is computationally quick, but works well only when \(\hat\theta(Y_{1:N})\) is well approximated by a normal distribution.

  2. Profile likelihood estimation. This is a bit more computational effort, but generally is preferable to the Fisher information.

  3. A simulation study, also known as a bootstrap.

    • If done carefully and well, this can be the best approach.

    • A confidence interval is a claim about reproducibility. You claim, so far as your model is correct, that on 95% of realizations from the model, a 95% confidence interval you have constructed will cover the true value of the parameter.

    • A simulation study can check this claim fairly directly, but requires the most effort.

    • The simulation study takes time for you to develop and debug, time for you to explain, and time for the reader to understand and check what you have done. We usually carry out simulation studies to check our main conclusions only.




5.3.1 Standard errors via the observed Fisher information

  • We suppose that \(\theta\in{\mathbb{R}}^D\) and so we can write \(\theta=\theta_{1:D}\).

  • The Hessian matrix of a function is the matrix of its second partial derivatives. We write the Hessian matrix of the log likelihood function as \(\nabla^2{\ell}(\theta)\), a \(D\times D\) matrix whose \((i,j)\) element is \[ \big[\nabla^2{\ell}(\theta)\big]_{ij} = \frac{\partial^2}{\partial\theta_i\partial\theta_j}{\ell}(\theta).\]

  • The observed Fisher information is \[ {{\hat{I}^*}} = - \nabla^2{\ell}({{\hat\theta^*}}).\]

  • A standard asymptotic approximation to the distribution of the MLE for large \(N\) is \[ \hat\theta(Y_{1:N}) \approx N\left[\theta, [{{\hat{I}^*}}]^{-1}\right], \] where \(\theta\) is the true parameter value. This asserts that the MLE is asymptotically unbiased, with variance asymptotically attaining the Cramer-Rao lower bound. Thus, we say the MLE is asymptotically efficient. Here, we interpret \(\approx\) to mean “one could write a limit statement formally justifying this approximation in a suitable limit.”

  • A corresponding approximate 95% confidence interval for \(\theta_d\) is \[ {{\hat\theta_d^*}} \pm 1.96 \big[{{{\hat{I}^*}}}^{-1}\big]_{dd}^{1/2}.\]

  • The R function arima computes standard errors for the MLE of an ARMA model in this way.

  • We usually only have one time series, with some fixed \(N\), and so we cannot in practice take \(N\to\infty\). When our time series model is non-stationary it may not even be clear what it would mean to take \(N\to\infty\). These asymptotic results should be viewed as nice mathematical reasons to consider computing an MLE, but not a substitute for checking how the MLE behaves for our model and data.


5.3.2 Confidence intervals via the profile likelihood

  • Let’s consider the problem of obtaining a confidence interval for \(\theta_d\), the \(d\)th component of \(\theta_{1:D}\).

  • The profile log likelihood function of \(\theta_d\) is defined to be \[ {\ell^\mathrm{profile}_d}(\theta_d) = \max_{\phi\in{\mathbb{R}}^D: \phi_d=\theta_d}{\ell}(\phi).\] In general, the profile likelihood of one parameter is constructed by maximizing the likelihood function over all other parameters.

  • Check that \(\max_{\theta_d}{\ell^\mathrm{profile}_d}(\theta_d) = \max_{\theta_{1:D}}{\ell}(\theta_{1:D})\). Maximizing the profile likelihood \({\ell^\mathrm{profile}_d}(\theta_d)\) gives the MLE, \({{\hat\theta_d^*}}\).

  • An approximate 95% confidence interval for \(\theta_d\) is given by \[ \big\{\theta_d : {\ell}({{\hat\theta^*}}) - {\ell^\mathrm{profile}_d}(\theta_d)< 1.92\big\}.\]

  • This is known as a profile likelihood confidence interval. The cutoff \(1.92\) is derived using Wilks’s theorem, which we will discuss in more detail when we develop likelihood ratio tests.

  • Although the asymptotic justification of Wilks’s theorem is the same limit that justifies the Fisher information standard errors, profile likelihood confidence intervals tend to work better than Fisher information confidence intervals when \(N\) is not so large—particularly when the log likelihood function is not close to quadratic near its maximum.




5.3.3 Bootstrap methods for constructing standard errors and confidence intervals

  • Suppose we want to know the statistical behavior of the estimator \(\hat\theta({y_{1:N}})\) for models in a neighborhood of the MLE, \({\theta^*}=\hat\theta({y_{1:N}^*})\).

  • In particular, let’s consider the problem of estimating uncertainty about \(\theta_1\). We want to assess the behavior of the maximum likelihood estimator, \(\hat\theta({y_{1:N}})\), and possibly the coverage of an associated confidence interval estimator, \(\big[\hat\theta_{1,\mathrm lo}({y_{1:N}}),\hat\theta_{1,\mathrm hi}({y_{1:N}})\big]\). The confidence interval estimator could be constructed using either the Fisher information method or the profile likelihood approach.

  • The following simulation study lets us address the following goals:

  1. Evaluate the coverage of a proposed confidence interval estimator, \([\hat\theta_{1,\mathrm lo},\hat\theta_{1,\mathrm hi}]\),

  2. Construct a standard error for \({{\hat\theta_1^*}}\),

  3. Construct a confidence interval for \(\theta_1\) with exact local coverage.
  1. Generate \(J\) independent Monte Carlo simulations, \[Y_{1:N}^{[j]} \sim f_{Y_{1:N}}(y_{1:N}{\, ; \,}{{\hat\theta^*}})\mbox{ for } j\in 1:J.\]

  2. For each simulation, evaluate the maximum likelihood estimator, \[ \theta^{[j]} = \hat\theta\big(Y_{1:N}^{[j]}\big)\mbox{ for } j\in 1:J,\] and, if desired, the confidence interval estimator, \[ \big[\theta^{[j]}_{1,\mathrm lo},\theta^{[j]}_{1,\mathrm hi}\big] = \big[\hat\theta_{1,\mathrm lo}({Y^{[j]}_{1:N}}),\hat\theta_{1,\mathrm hi}({Y^{[j]}_{1:N}})\big].\]

  3. We can use these simulations to obtain solutions to our goals for uncertainty assessment:

  1. For large \(J\), the coverage of the proposed confidence interval estimator is well approximated, for models in a neighborhood of \({{\hat\theta^*}}\), by the proportion of the intervals \(\big[\theta^{[j]}_{1,\mathrm lo},\theta^{[j]}_{1,\mathrm hi}\big]\) that include \({{\hat\theta_1^*}}\).

  2. The sample standard deviation of \(\{ \theta^{[j]}_1, j\in 1:J\}\) is a natural standard error to associate with \({{\hat \theta_1^*}}\).

  3. For large \(J\), one can empirically calibrate a 95% confidence interval for \(\theta_1\) with exactly the claimed coverage in a neighborhood of \({{\hat\theta^*}}\). For example, using profile methods, one could replace the cutoff 1.92 by a constant \(\alpha\) chosen such that 95% of the profile confidence intervals computed for the simulations cover \({{\hat\theta_1^*}}\).




5.3.4 Question: Local coverage as an approximation to actual coverage for a confidence interval

  • A true 95% confidence interval covers \(\theta\) with probability 0.95 whatever the value of \(\theta\).

  • The local coverage probability at a value \(\theta=\tilde\theta\) is the chance that the confidence interval covers \(\tilde\theta\) when the true parameter value is \(\tilde\theta\). Typically, we compute local coverage at \(\theta={{\hat\theta^*}}\).

  • Local coverage can be evaluated or calibrated via simulation; the actual (global) coverage is usually hard to work with.

  • What properties of the model and data make local coverage a good substitute for global coverage? How would you check whether or not these properties hold?




5.4 Likelihood-based model selection and model diagnostics

5.4.1 Likelihood ratio tests for nested hypotheses

  • The whole parameter space on which the model is defined is \(\Theta\subset{\mathbb{R}}^D\).

  • Suppose we have two nested hypotheses \[\begin{eqnarray} H^{\langle 0\rangle} &:& \theta\in \Theta^{\langle 0\rangle}, \\ H^{\langle 1\rangle} &:& \theta\in \Theta^{\langle 1\rangle}, \end{eqnarray}\] defined via two nested parameter subspaces, \(\Theta^{\langle 0\rangle}\subset \Theta^{\langle 1\rangle}\), with respective dimensions \(D^{\langle 0\rangle}< D^{\langle 1\rangle}\le D\).

  • We consider the log likelihood maximized over each of the hypotheses, \[\begin{eqnarray} \ell^{\langle 0\rangle} &=& \sup_{\theta\in \Theta^{\langle 0\rangle}} \ell(\theta), \\ \ell^{\langle 1\rangle} &=& \sup_{\theta\in \Theta^{\langle 1\rangle}} \ell(\theta). \end{eqnarray}\]

  • A useful approximation asserts that, under the hypothesis \(H^{\langle 0\rangle}\), \[ \ell^{\langle 1\rangle} - \ell^{\langle 0\rangle} \approx (1/2) \chi^2_{D^{\langle 1\rangle}- D^{\langle 0\rangle}}, \] where \(\chi^2_d\) is a chi-squared random variable on \(d\) degrees of freedom and \(\approx\) means “is approximately distributed as.”

  • We will call this the Wilks approximation.

  • The Wilks approximation can be used to construct a hypothesis test of the null hypothesis \(H^{\langle 0\rangle}\) against the alternative \(H^{\langle 1\rangle}\).

  • This is called a likelihood ratio test since a difference of log likelihoods corresponds to a ratio of likelihoods.

  • When the data are IID, \(N\to\infty\), and the hypotheses satisfy suitable regularity conditions, this approximation can be derived mathematically and is known as Wilks’s theorem.

  • We therefore have two different interpretations of \(\approx\). One is “one could write a limit statement formally justifying this approximation in a suitable limit” and another is “this approximation is useful in the finite sample situation at hand.” These interpretations may both be appropriate!

  • The chi-squared approximation to the likelihood ratio statistic may be useful, and can be assessed empirically by a simulation study, even in situations that do not formally satisfy any known theorem.




5.4.2 Exercise: Using a likelihood ratio test to construct profile likelihood confidence intervals

  • Recall the duality between hypothesis tests and confidence intervals:

    The estimated parameter \({\theta^*}\) does not lead us to reject a null hypothesis of \(\theta=\theta^{\langle 0\rangle}\) at the 5% level \[\Updownarrow\] \(\theta^{\langle 0\rangle}\) is in a 95% confidence interval for \(\theta\).

  • We can check what the 95% cutoff is for a chi-squared distribution with one degree of freedom,

qchisq(0.95,df=1)
## [1] 3.841459
  • We can now see how the Wilks approximation suggests a confidence interval constructed from parameter values having a profile likelihood within 1.92 log units of the maximum.

  • It is a exercise to write out more details (to your own satisfaction) on how to use the Wilks approximation, together with the duality between hypothesis tests and confidence intervals, to derive a profile likelihood confidence interval.




5.4.3 Akaike’s information criterion (AIC)

  • Likelihood ratio tests provide an approach to model selection for nested hypotheses, but what do we do when models are not nested?

  • A more general approach is to compare likelihoods of different models by penalizing the likelihood of each model by a measure of its complexity.

  • Akaike’s information criterion AIC is given by \[ AIC = -2 \times {\ell}({\theta^*}) + 2D\] “Minus twice the maximized log likelihood plus twice the number of parameters.”

  • We are invited to select the model with the lowest AIC score.

  • AIC was derived as an approach to minimizing prediction error. Increasing the number of parameters leads to additional overfitting which can decrease predictive skill of the fitted model.

  • Viewed as a hypothesis test, AIC may have weak statistical properties. It can be a mistake to interpret AIC by making a claim that the favored model has been shown to provides a superior explanation of the data. However, viewed as a way to select a model with reasonable predictive skill from a range of possibilities, it is often useful.




5.4.4 Exercise: Comparing AIC with likelihood ratio tests

  • Suppose we are in a situation in which we wish to choose between two nested hypotheses, with dimensions \(D^{\langle 0\rangle}< D^{\langle 1\rangle}\). Suppose the Wilks approximation is valid.

  • Consider the strategy of selecting the model with the lowest AIC value.

  • We can view this model selection approach as a formal statistical test.

  • Find an expression for the size of this AIC test (i.e, the probability of rejecting the null hypothesis, \(H^{\langle 0\rangle}\), when this null hypothesis is true).

  • Evaluate this expression for \(D^{\langle 1\rangle} - D^{\langle 0\rangle}=1\).




5.5 Implementing likelihood-based inference for ARMA models in R




5.5.1 Reading in the data

  • Here is the head of the file huron_depth.csv.

    # downloaded on 1/24/16 from
    # http://www.glerl.noaa.gov/data/dashboard/data/levels/mGauge/miHuronMog.csv
    # Lake Michigan-Huron:, Monthly Average Master Gauge Water Levels (1860-Present)
    # Source:, NOAA/NOS
    Date, Average
    01/01/1860,177.285
    02/01/1860,177.339
    03/01/1860,177.349
    04/01/1860,177.388
    05/01/1860,177.425
  • A bit of work has to be done manipulating the Date variable.

    • Moving between date formats is a necessary skill for time series analysis!

    • A standard representation of time is POSIXct, which is a signed real number representing the number of seconds since the beginning of 1970.

    • The raw data have a character string representing date. We convert this into the standard format using strptime. Than we can extract whatever we need. See ?DateTimeClasses for more on manipulating date and time formats in R.

dat <- read.table(file="huron_depth.csv",sep=",",header=TRUE)
dat$Date <- strptime(dat$Date,"%m/%d/%Y")
dat$year <- as.numeric(format(dat$Date, format="%Y"))
dat$month <- as.numeric(format(dat$Date, format="%m"))
head(dat)
##         Date Average year month
## 1 1860-01-01 177.285 1860     1
## 2 1860-02-01 177.339 1860     2
## 3 1860-03-01 177.349 1860     3
## 4 1860-04-01 177.388 1860     4
## 5 1860-05-01 177.425 1860     5
## 6 1860-06-01 177.461 1860     6
  • For now, let’s avoid monthly seasonal variation by considering an annual series of January depths. We will investigate seasonal variation later in the course, but sometimes it is best avoided.
dat <- subset(dat,month==1)
huron_depth <- dat$Average
year <- dat$year
plot(huron_depth~year,type="l")




5.5.2 Fitting an ARMA model

  • Later, we will consider hypotheses of trend. For now, let’s start by fitting a stationary ARMA\((p,q)\) model under the null hypothesis that there is no trend. This hypothesis, which asserts that nothing has substantially changed in this system over the last 150 years, is not entirely unreasonable from looking at the data.

  • We seek to fit a stationary Gaussian ARMA(p,q) model with parameter vector \(\theta=({\phi}_{1:p},{\psi}_{1:q},\mu,\sigma^2)\) given by \[ {\phi}(B)(Y_n-\mu) = {\psi}(B) \epsilon_n,\] where \[\begin{eqnarray} \mu &=& {\mathbb{E}}[Y_n] \\ {\phi}(x)&=&1-{\phi}_1 x-\dots -{\phi}_px^p, \\ {\psi}(x)&=&1+{\psi}_1 x+\dots +{\psi}_qx^q, \\ \epsilon_n&\sim&\mathrm{ iid }\, N[0,\sigma^2]. \end{eqnarray}\]

  • We need to decide where to start in terms of values of \(p\) and \(q\). Let’s tabulate some AIC values for a range of different choices of \(p\) and \(q\).

  • In the code below, note the use of kable for formatting HTML tables when using Rmarkdown. The "<b>" and ""</b>" tags in dimnames make the rownames boldface in HTML. By default, only column names are boldface in standard HTML.

aic_table <- function(data,P,Q){
  table <- matrix(NA,(P+1),(Q+1))
  for(p in 0:P) {
    for(q in 0:Q) {
       table[p+1,q+1] <- arima(data,order=c(p,0,q))$aic
    }
  }
  dimnames(table) <- list(paste("<b> AR",0:P, "</b>", sep=""),paste("MA",0:Q,sep=""))
  table
}
huron_aic_table <- aic_table(huron_depth,4,5)
require(knitr)
kable(huron_aic_table,digits=2)
MA0 MA1 MA2 MA3 MA4 MA5
AR0 166.75 46.60 7.28 -14.97 -18.64 -26.09
AR1 -38.00 -37.41 -35.46 -33.82 -34.13 -32.20
AR2 -37.33 -38.43 -36.90 -34.93 -34.35 -33.08
AR3 -35.52 -35.17 -32.71 -31.37 -33.21 -32.98
AR4 -33.94 -34.91 -34.43 -37.48 -31.31 -30.90


5.5.3 Question: What do we learn by interpreting the results in the above table of AIC values?


5.5.4 Question: In what ways might we have to be careful not to over-interpret the results of this table?




  • Let’s fit the ARMA(2,1) model recommended by consideration of AIC.
huron_arma21 <- arima(huron_depth,order=c(2,0,1))
huron_arma21
## 
## Call:
## arima(x = huron_depth, order = c(2, 0, 1))
## 
## Coefficients:
##           ar1     ar2     ma1  intercept
##       -0.0525  0.7910  1.0000   176.4603
## s.e.   0.0522  0.0526  0.0242     0.1210
## 
## sigma^2 estimated as 0.04188:  log likelihood = 24.21,  aic = -38.43
  • We can examine the roots of the AR polynomial,
AR_roots <- polyroot(c(1,-coef(huron_arma21)[c("ar1","ar2")]))
AR_roots
## [1]  1.158083-0i -1.091668+0i
  • These are just outside the unit circle, suggesting we have a stationary causal fitted ARMA.

  • However, the MA root is -1, showing that the fitted model is at the threshold of non-invertibility.

  • Is this non-invertibility a problem? Let’s investigate a little, using profile and bootstrap methods. The claimed standard error on the MA1 coefficient, from the Fisher information approach used by arima is small.

  • First, we can see if the approximate confidence interval constructed using profile likelihood is in agreement with the approximate confidence interval constructed using the observed Fisher information.

  • To do this, we need to maximize the ARMA likelihood while fixing the MA1 coefficient at a range of values. This can be done using arima as follows. Note that the fixed argument expects a vector of length \(p+q+1\) corresponding to a concatenated vector \(({\phi}_{1:p},{\psi}_{1:q}, \mu)\). Somehow, the Gaussian white noise variance, \(\sigma^2\), is not included in this representation. Parameters with NA entries in fixed are estimated.

K <- 500
ma1 <- seq(from=0.2,to=1.1,length=K)
profile_loglik <- rep(NA,K)
for(k in 1:K){
   profile_loglik[k] <- logLik(arima(huron_depth,order=c(2,0,1),
      fixed=c(NA,NA,ma1[k],NA)))
}
plot(profile_loglik~ma1,ty="l")

5.5.5 Question: Interpret the profile likelihood plot for \({\psi}_1\).

  • What do you conclude about the Fisher information confidence interval proposed by arima?

  • When do you think the Fisher information confidence interval may be reliable?

  • Is this profile likelihood plot, and its statistical interpretation, reliable? How do you support your opinion on this?




  • Let’s do a simulation study
set.seed(57892330)
J <- 1000
params <- coef(huron_arma21)
ar <- params[grep("^ar",names(params))]
ma <- params[grep("^ma",names(params))]
intercept <- params["intercept"]
sigma <- sqrt(huron_arma21$sigma2)
theta <- matrix(NA,nrow=J,ncol=length(params),dimnames=list(NULL,names(params)))
for(j in 1:J){
   Y_j <- arima.sim(
      list(ar=ar,ma=ma),
      n=length(huron_depth),
      sd=sigma
   )+intercept
   theta[j,] <- coef(arima(Y_j,order=c(2,0,1)))
}
hist(theta[,"ma1"],freq=FALSE) 

  • This seems consistent with the profile likelihood plot.

  • A density plot shows this similarity even more clearly.

plot(density(theta[,"ma1"],bw=0.05))

  • Here, I’m showing the raw plot for instructional purposes. For a report, one should improve the default axis labels and title.

  • Note that arima transforms the model to invertibility. Thus, the estimated value of \(\theta_1\) can only fall in the interval \((-1,1)\) but can be arbitrarily close to \(-1\) or \(1\).

range(theta[,"ma1"])
## [1] -1  1
+ Estimated densities outside $[-1,1]$ are artifacts of the density estimation procedure. 

+ How would you refine this procedure to get a density estimate respecting the range of the parameter estimation procedure?
  • To understand what is going on better, it is helpful to do another simulation study for which we fit ARMA(2,1) when the true model is AR(1).

  • When doing simulation studies, it is helpful to use multicore computing, which most of us have on our machines nowadays.

  • A basic approach to multicore statistical computing is to tell R you want it to look for available processors, using the doParallel package.

require(doParallel)
registerDoParallel()
  • Then, we can use foreach to carry out a parallel for loop where jobs are sent to different processors.
J <- 1000
huron_ar1 <- arima(huron_depth,order=c(1,0,0))
params <- coef(huron_ar1)
ar <- params[grep("^ar",names(params))]
intercept <- params["intercept"]
sigma <- sqrt(huron_ar1$sigma2)
t1 <- system.time(
  huron_sim <- foreach(j=1:J) %dopar% {
     Y_j <- arima.sim(list(ar=ar),n=length(huron_depth),sd=sigma)+intercept
     try(coef(arima(Y_j,order=c(2,0,1))))
  }
) 
  • Some of these arima calls did not successfully produce parameter estimates. The try function lets the simulation proceed despite these errors. Let’s see how many of them fail:
sum(sapply(huron_sim, function(x) inherits(x,"try-error"))) 
## [1] 4
  • Now, for the remaining ones, we can look at the resulting estimates of the MA1 component:
ma1 <- unlist(lapply(huron_sim,function(x) if(!inherits(x,"try-error"))x["ma1"] else NULL ))
hist(ma1,breaks=50)  

  • When the true model is AR1 and we fit ARMA(2,1), it seems that we often obtain a model with estimated MA1 coefficient on the boundary of invertibility.

  • It is clear from this that we cannot reject an AR1 hypothesis, even though the Fisher information based analysis appears to give strong evidence that the data should be modeled with a nonzero MA1 coefficient.

  • It may be sensible to avoid fitted models too close to the boundary of invertibility. This is a reason not to blindly accept whatever model AIC might suggest.




5.5.6 Question: what else could we look for to help diagnose, and understand, this kind of model fitting problem?

  • Hint: pay some more attention to the roots of the fitted ARMA(2,1) model.




5.6 Assessing the numerical correctness of evaluation and maximization of the likelihood function


MA0 MA1 MA2 MA3 MA4 MA5
AR0 166.75 46.60 7.28 -14.97 -18.64 -26.09
AR1 -38.00 -37.41 -35.46 -33.82 -34.13 -32.20
AR2 -37.33 -38.43 -36.90 -34.93 -34.35 -33.08
AR3 -35.52 -35.17 -32.71 -31.37 -33.21 -32.98
AR4 -33.94 -34.91 -34.43 -37.48 -31.31 -30.90


5.6.1 Question: How is this table inconsistent with perfect maximization?

  • Here are two hints:

    • Recall that, for nested hypotheses \(H^{\langle 0\rangle}\subset H^{\langle 1\rangle}\), the likelihood maximized over \(H^{\langle 1\rangle}\) cannot be less than the likelihood maximized over \(H^{\langle 0\rangle}\).

    • Recall also the definition of AIC,

      AIC = -2\(\times\) maximized log likelihood \(+\) 2\(\times\) number of parameters