Licensed under the Creative Commons attribution-noncommercial license, http://creativecommons.org/licenses/by-nc/3.0/. Please share and remix noncommercially, mentioning its origin.
CC-BY_NC



Objectives




9.1 Partially observed Markov processes (POMP) models




9.1.1 Discrete time Markov processes

  • A time series model \(X_{0:N}\) is a Markov process model if the conditional densities satisfy the Markov property that, for all \(n\in 1:N\).


[MP1] \({\quad\quad}f_{X_n|X_{1:n-1}}(x_n{{\, | \,}}x_{1:n-1}) = f_{X_n|X_{n-1}}(x_n{{\, | \,}}x_{n-1})\),


  • We suppose that the random process \(X_n\) occurs at time \(t_n\) for \(n\in 0:N\), so the discrete time process corresponds to time points in continuous time.

  • We have initialized the Markov process model at a time \(t_0\), although we will suppose that data are collected only at times \(t_{1:N}\).

    • The initialization model could be deterministic (a fixed value) or a random variable.

    • Formally, a fixed initial value is a special case of a discrete distribution having a point mass with probability one at the fixed value. Therefore, fixed initial values are covered in our framework since we use probability density functions to describe both discrete and continuous probability distributions.

    • Mathematically, a probability mass function (for discrete distributions) is a probability density with respect to a counting measure. We don’t have to get sidetracked on to that topic, but it is worth noting that there is a proper mathematical justification for treating a probability mass function as a type of probability density function.

    • It is not important whether to adopt the convention that the Markov process model is intialized at time \(t_1\) or at some previous time \(t_0\). Here, we follow the choice to use \(t_0\).

  • The probability density function \(f_{X_n|X_{n-1}}(x_n{{\, | \,}}x_{n-1})\) is called the one-step transition density of the Markov process.

  • In words, the Markov property says that the next step taken by a Markov process follows the one-step transition density based on the current state, whatever the previous history of the process.

  • For a POMP model, the full joint distribution of the latent process is entirely specified by the one-step transition densities, as we will show below. Therefore, we also call \(f_{X_n|X_{n-1}}(x_n{{\, | \,}}x_{n-1})\) the process model.




9.1.2 Question: Write the joint distribution in terms of the one-step transition densities

  • Use [MP1] to derive an expression for the joint distribution of a Markov process as a product of the one-step transition densities,


[MP2] \({\quad\quad}f_{X_{0:N}}(x_{0:N}) = f_{X_0}(x_0)\prod_{n=1}^N f_{X_n|X_{n-1}}(x_n{{\, | \,}}x_{n-1})\).




9.1.3 Question: Show that a causal Gaussian AR(1) process is a Markov process.




9.1.4 Time homogeneous transitions and stationarity

  • In general, the one step transition probability density in a POMP model can depend on \(n\).

  • A latent process model \(X_{0:N}\) is time-homogeneous if the one step transition probability density does not depend on \(n\), i.e., if there is a conditional density \(f(y{{\, | \,}}x)\) such that, for all \(n\in 1:N\), \[ f_{X_n|X_{n-1}}(x_n{{\, | \,}}x_{n-1})= f(x_n{{\, | \,}}x_{n-1}).\]

  • If \(X_{0:N}\) is stationary then it is time homogeneous.

    • Why? Go back to the definitions and show this from first principles.
  • Time homogeneity does not necessarily imply stationarity.

    • Find a counter-example.

    • What has to be added to time homogeneity to get stationarity?




9.1.5 The measurement model

  • We model the observation process random variables \(Y_{1:N}\).

  • For state space models, we will generally write the data as \({y_{1:N}^*}\).

  • We model the measurement at time \(t_n\) to depend only on the value of the latent process at time \(t_n\), conditionally independent of all other latent process and observation process variables. Formally, this assumption is,


[MP3] \({\quad\quad}f_{Y_n|X_{0:N},Y_{1:n-1},Y_{n+1:N}}(y_n{{\, | \,}}x_{0:N},y_{1:n-1},y_{n+1:N}) = f_{Y_n|X_n}(y_n{{\, | \,}}x_{n})\).


  • We call \(f_{Y_n|X_n}(y_n{{\, | \,}}x_{n})\) the measurement model.

  • In general, the measurement model can depend on \(n\).

  • The measurement model is time-homogeneous if there is a conditional probability density function \(g(y{{\, | \,}}x)\) such that, for all \(n\in 1:N\), \[ f_{Y_n|X_n}(y_n{{\, | \,}}x_{n})= g(y_n{{\, | \,}}x_n).\]




9.2 Four basic calculations for working with POMP models




9.2.1 Prediction

  • One-step prediction of the latent process at time \(t_{n+1}\) given data up to time \(t_n\) involves finding \[ f_{X_{n+1}|Y_{1:n}}(x_{n+1}{{\, | \,}}{y_{1:n}^*}).\]

  • We may want to carry out prediction (also called forecasting) more than one time step ahead. However, unless specified otherwise, the prediction calculation will be one-step prediction.

  • One-step prediction turns out to be closely related to computing the likelihood function, and therefore central to statistical inference.

  • We have required our prediction to be a conditional probability density, not a point estimate. In the context of forecasting, this is called a probabilistic forecast, and has advantages over a point estimate forecast. What are they? Are there any disadvantages to probabilistic forecasting?




9.2.2 Filtering

  • The filtering calculation at time \(t_n\) is to find the conditional distribution of the latent process \(X_n\) given currently available data, \({y_{1:n}^*}\).

  • Filtering therefore involves calculating \[f_{X_{n}|Y_{1:n}}(x_n{{\, | \,}}{y_{1:n}^*}).\]




9.2.3 Smoothing

  • In the context of a POMP model, smoothing involves finding the conditional distribution of \(X_n\) given all the data, \({y_{1:N}^*}\).

  • So, the smoothing calculation is \[f_{X_{n}|Y_{1:N}}(x_n{{\, | \,}}{y_{1:N}^*}).\]




9.2.4 The likelihood

  • The model may depend on a parameter vector \(\theta\).

  • Since we haven’t explicitly written this dependence above, the likelihood calculation is to evaluate the joint density of \(Y_{1:N}\) at the data, \[f_{Y_{1:N}}({y_{1:N}^*}).\]

  • If we can compute this for any value of \(\theta\), we can perform numerical optimization to get a maximum likelihood estimate, compute profile likelihood confidence intervals, carry out likelihood ratio tests, and make AIC comparisons.




9.2.5 The prediction and filtering formulas

  • One-step prediction of the latent process at time \(t_{n}\) given data up to time \(t_{n-1}\) can be computed in terms of the filtering problem at time \(t_{n-1}\), via the prediction formula for \(n\in 1:N\),


[MP4] \({\quad\quad}\displaystyle f_{X_{n}|Y_{1:n-1}}(x_{n}{{\, | \,}}{y_{1:n-1}^*})\)


\({\quad\quad}{\quad\quad}\displaystyle = \int f_{X_{n-1}|Y_{1:n-1}}(x_{n-1}{{\, | \,}}{y_{1:n-1}^*}) f_{X_{n}|X_{n-1}}(x_{n}{{\, | \,}}x_{n-1})\, dx_{n-1}\).


  • To make this formula work for \(n=1\), we need the convention that \(1:k\) is the empty set when \(k=0\). Conditioning on an empty collection of random variables is the same as not conditioning at all! In this case, we have by definition that \[ f_{X_{0}|Y_{1:0}}(x_{0}{{\, | \,}}{y_{1:0}^*}) = f_{X_0}(x_0).\] In other words, the filtering calcuation at time \(t_0\) is the initial density for the latent process. This makes sense, since at time \(t_0\) we have no data to condition on.

  • To see why the prediction formula is true, we can view it is an application of a general identity for joint continuous random variables \(X\), \(Y\), \(Z\), \[ f_{X|Y}(x{{\, | \,}}y)= \int f_{XZ|Y}(x,z{{\, | \,}}y)\, dz,\] which is a condition form of the basic identity that integrating out a joint distribution gives a marginal distribution.



  • Filtering at time \(t_n\) can be computed by combining the new information in the datapoint \({y_{n}^*}\) with the calculation of the one-step prediction of the latent process at time \(t_{n}\) given data up to time \(t_{n-1}\). This is carried out via the filtering formula for \(n\in 1:N\),


[MP5] \({\quad\quad}\displaystyle f_{X_{n}|Y_{1:n}}(x_{n}{{\, | \,}}{y_{1:n}^*}) = \frac{ f_{X_{n}|Y_{1:n-1}}(x_{n}{{\, | \,}}{y_{1:n-1}^*})\, f_{Y_n|X_n}({y_n^*}{{\, | \,}}x_n) }{ f_{Y_{n}|Y_{1:n-1}}({y_n^*}{{\, | \,}}{y_{1:n-1}^*}) }\).


  • The denominator in the filtering formula [MP5] is the conditional likelihood of \({y_{n}^*}\) given \({y_{1:n-1}^*}\). It can be computed in terms of the one-step prediction density, via the conditional likelihood formula,


[MP6] \({\quad\quad}\displaystyle f_{Y_{n}|Y_{1:n-1}}({y_n^*}{{\, | \,}}{y_{1:n-1}^*}) = \int f_{X_{n}|Y_{1:n-1}}(x_{n}{{\, | \,}}{y_{1:n-1}^*})\, f_{Y_n|X_n}({y_n^*}{{\, | \,}}x_n)\, dx_n\).


  • To make this formula work for \(n=1\), we again take advantage of the convention that \(1:k\) is the empty set when \(k=0\).

  • To see why the filtering formula is true, we can apply the identity for joint continuous random variables \(X\), \(Y\), \(Z\), \[f_{X|YZ}(x{{\, | \,}}y,z)= \frac{ f_{Y|XZ}(y{{\, | \,}}x,z)\, f_{X|Z}(x{{\, | \,}}z)}{f_{Y|Z}(y{{\, | \,}}z)}.\] This is a conditional form of the Bayes formula, which in turn is closely related to a conditional form of the basic definition of the conditional probability density function, \[f_{X|YZ}(x{{\, | \,}}y,z)= \frac{f_{XY|Z}(x,y{{\, | \,}}z)}{f_{Y|Z}(y{{\, | \,}}z)}.\]




  • The prediction and filtering formulas are recursive. If they can be computed for time \(t_n\) then they provide the foundation for the following computation at time \(t_{n+1}\).




9.2.6 Question: Give a detailed derivation of [MP4], [MP5] and [MP6], being careful to note when you use the Markov property [MP1].



9.2.7 Computation of the likelihood

  • The likelihood of the entire dataset, \({y_{1:N}^*}\) can be found from [MP6], using the identity


[MP7] \({\quad\quad}\displaystyle f_{Y_{1:N}}({y_{1:N}^*}) = \prod_{n=1}^N f_{Y_{n}|Y_{1:n-1}}({y_n^*}{{\, | \,}}{y_{1:n-1}^*})\).


  • Yet again, this formula [MP7] requires the convention that \(1:k\) is the empty set when \(k=0\), so the first term in the product is \[f_{Y_{1}|Y_{1:0}}({y_1^*}{{\, | \,}}{y_{1:0}^*}) = f_{Y_{1}}({y_1^*}).\]

  • If our model has an unknown parameter \(\theta\), the likelihood identity [MP7] lets us evaluate the log likelihood function, \[{\ell}(\theta)=\log f_{Y_{1:N}}({y_{1:N}^*}{\, ; \,}\theta).\]




9.2.8 The smoothing formulas

  • Smoothing is less fundamental for likelihood-based inference than filtering and one-step prediction.

  • Nevertheless, sometimes we want to compute the smoothing density, so let’s obtain some necessary formulas.

  • The filtering and prediction formulas are recursions forwards in time (we use the solution at time \(t_{n-1}\) to carry out the computation at time \(t_{n}\)).

  • There are similar backwards recursion formulas,


[MP8] \({\quad\quad}f_{Y_{n:N}|X_{n}}({y_{n:N}^*}{{\, | \,}}x_n)= f_{Y_n|X_n}({y_n^*}{{\, | \,}}x_n) f_{Y_{n+1:N}|X_{n}}({y_{n+1:N}^*}{{\, | \,}}x_n)\).


[MP9] \({\quad\quad}\displaystyle f_{Y_{n+1:N}|X_{n}}({y_{n+1:N}^*}{{\, | \,}}x_n)\)
\({\quad\quad}{\quad\quad}\displaystyle = \int f_{Y_{n+1:N}|X_{n+1}}({y_{n+1:N}^*}{{\, | \,}}x_{n+1}) \, f_{X_{n+1}|X_n}(x_{n+1}{{\, | \,}}x_n)\, dx_{n+1}\).


  • The forwards and backwards recursion formulas together allow us to compute the smoothing formula,


[MP10] \({{\quad\quad}\displaystyle f_{X_{n}|Y_{1:N}}(x_n{{\, | \,}}{y_{1:N}^*}) = \frac{ f_{X_{n}|Y_{1:n-1}}(x_{n}{{\, | \,}}{y_{1:n-1}^*}) \, f_{Y_{n:N}|X_{n}}({y_{n:N}^*}{{\, | \,}}x_n) }{ f_{Y_{n:N}|Y_{1:n-1}}({y_{n:N}^*}{{\, | \,}}{y_{1:n-1}^*}) } }\).


9.2.9 Question: Show how [MP8], [MP9] and [MP10] follow from the basic properties of conditional densities combined with the Markov property.




9.2.10 Un-normalized filtering and smoothing

  • Some common Monte Carlo algorithms (Markov chain Monte Carlo and self-normalized importance sampling) need probability density functions only up to an unknown constant factor. These algorithms depend on ratios of densities, for which a constant factor cancels out and so does not have to be computed.

  • In some analytic and numeric computations, it is helpful to avoid calculating a normalizing constant for a density, since it can be worked out later using the property that the probability density function must integrate to 1.

  • The denominators \(f_{Y_{n}|Y_{1:n-1}}({y_n^*}{{\, | \,}}{y_{1:n-1}^*})\) and \(f_{Y_{n:N}|Y_{1:n-1}}({y_{n:N}^*}{{\, | \,}}{y_{1:n-1}^*})\), in equations [MP5] and [MP10] respectively, may sometimes be hard to compute.

  • When we are only interested in computing the filtering and smoothing densities up to an unknown constant, we can simplify [MP5] and [MP10] using the proportionality relationship \(\propto\). This gives,


[MP5\(^\prime\)] \({{\quad\quad}\displaystyle f_{X_{n}|Y_{1:n}}(x_{n}{{\, | \,}}{y_{1:n}^*}) \propto f_{X_{n}|Y_{1:n-1}}(x_{n}{{\, | \,}}{y_{1:n-1}^*})\, f_{Y_n|X_n}({y_n^*}{{\, | \,}}x_n) }\),


[MP10\(^\prime\)] \({{\quad\quad}\displaystyle f_{X_{n}|Y_{1:N}}(x_n{{\, | \,}}{y_{1:N}^*}) \propto f_{X_{n}|Y_{1:n-1}}(x_{n}{{\, | \,}}{y_{1:n-1}^*}) \, f_{Y_{n:N}|X_{n}}({y_{n:N}^*}{{\, | \,}}x_n) }\).


  • Note that the normalizing “constant” in equations [MP5] and [MP10] does depend on \({y_{1:N}^*}\). However, the data are fixed constant values. The variable in both these equations is \(x_n\).


9.3 Linear Gaussian POMP (LG-POMP) models




9.3.1 The general LG-POMP model

  • Suppose the latent process, \(X_{0:N}\), takes vector values with dimension \(d_X\).

  • Suppose the observation process, \(\{Y_n\}\), takes vector values with dimension \(d_Y\).

  • A general mean zero LG-POMP model is specified by

    • A sequence of \(d_X\times d_X\) matrices, \({\mathbb{A}}_{1:N}\),

    • A sequence of \(d_X\times d_X\) covariance matrices, \({\mathbb{U}}_{0:N}\),

    • A sequence of \(d_Y\times d_X\) matrices, \({\mathbb{B}}_{1:N}\)

    • A sequence of \(d_Y\times d_Y\) covariance matrices, \({\mathbb{V}}_{1:N}\).

  • We initialize with \(X_0\sim N[0,{\mathbb{U}}_0]\) and then define the entire LG-POMP model by a recursion for \(n\in 1:N\),


[LG1] \({\quad\quad}X_{n} = {\mathbb{A}}_n X_{n-1} + \epsilon_n\), \({\quad\quad}\epsilon_n\sim N[0,{\mathbb{U}}_n]\),


[LG2] \({\quad\quad}Y_{n} = {\mathbb{B}}_n X_n + \eta_n\), \({\quad\quad}\eta_n\sim N[0,{\mathbb{V}}_n]\).

  • Often, but not always, we will have a time-homogeneous LG-POMP model, with \({\mathbb{A}}_n={\mathbb{A}}\), \(\;{\mathbb{B}}_n={\mathbb{B}}\), \(\;{\mathbb{U}}_n={\mathbb{U}}\) and \({\mathbb{V}}_n={\mathbb{V}}\) for \(n\in 1:N\).




9.4 The LG-POMP representation of a Gaussian ARMA


[LG3] \({\quad\quad}\displaystyle Y_n = \sum_{j=1}^p {\phi}_j Y_{n-j} + \omega_n + \sum_{k=1}^q {\psi}_q \omega_{n-k}.\)





9.5 The basic structural model and its LG-POMP representation


\(\begin{array}{lrcl} \mbox{[BSM1]} {\quad\quad}& Y_n &=& L_n + S_n + \epsilon_n \\ \mbox{[BSM2]} {\quad\quad}& L_{n} &=& L_{n-1} + T_{n-1} + \xi_n \\ \mbox{[BSM3]} {\quad\quad}& T_{n} &=& T_{n-1} + \zeta_n \\ \mbox{[BSM4]} {\quad\quad}& S_{n} &=& -\sum_{k=1}^{11} S_{n-k} + \eta_n \end{array}\)



[BSM5] \({\quad\quad}X_n = (L_n,T_n,S_n, S_{n-1}, S_{n-2}, \dots,S_{n-10})^{\scriptsize{T}}\).



[BSM6] \({\quad\quad}Y_n = (1,0,1,0,0,\dots, 0) X_n + \epsilon_n\),


\(\displaystyle \left(\begin{array}{l} L_{n} \\ T_{n} \\ S_{n} \\ S_{n-1}\\ S_{n-2} \\ \vdots \\ S_{n-10} \end{array}\right) = \left(\begin{array}{ccccccc} 1 & 1 & 0 & 0 & 0 & \ldots & 0 \\ 0 & 1 & 0 & 0 & 0 & \ldots & 0 \\ 0 & 0 & -1 & -1 & -1 & \ldots & -1\\ 0 & 0 & 1 & 0 & 0 & \ldots & 0 \\ 0 & 0 & 0 & 1 & 0 & \ldots & 0 \\ \vdots & & &\ddots & \ddots & \ddots & \vdots \\ 0 & 0 & 0 &\ldots & 0 & 1 & 0 \end{array}\right) \left(\begin{array}{l} L_{n-1} \\ T_{n-1} \\ S_{n-1} \\ S_{n-2}\\ S_{n-3} \\ \vdots \\ S_{n-11} \end{array}\right) + \left(\begin{array}{l} \xi_n \\ \zeta_n \\ \eta_n \\ 0 \\ 0 \\ \vdots \\ 0 \end{array}\right).\)





9.6 Spline smoothing and its LG-POMP representation


[SS1] \({{\quad\quad}\displaystyle \mathrm{PSS}(x_{1:N}{\, ; \,}\lambda) = \sum_{n=1}^N({y_n^*}-x_n)^2 + \lambda\sum_{n=3}^N(\Delta^2 x_n)^2 }\).



[SS2] \({\quad\quad}\begin{array}{rclcl} X_n &=& 2X_{n-1}-X_{n-2} + \epsilon_n, & & \epsilon_n\sim \mathrm{iid}\; N[0,\sigma^2/\lambda]\\ Y_n &=& X_n + \eta_n & & \eta_n\sim \mathrm{iid}\; N[0,\sigma^2]. \end{array}\)





9.6.1 Question: Constructing a linear Gaussian POMP (LG-POMP) model from [SS2].

  • Note that \(\{X_n,Y_n\}\) defined in [SS2] is not quite an LG-POMP model. However, we can use \(\{X_n\}\) and \(\{Y_n\}\) to build an LG-POMP model. How?




  • The joint density of \(X_{1:N}\) and \(Y_{1:N}\) in [SS2] can be written as \[ f_{X_{1:N}Y_{1:N}}(x_{1:N},y_{1:N})= f_{X_{1:N}}(x_{1:N}) \, f_{Y_{1:N}|X_{1:N}}(y_{1:N}{{\, | \,}}x_{1:N}).\]

  • Taking logs, \[ \log f_{X_{1:N}Y_{1:N}}(x_{1:N},y_{1:N})= \log f_{X_{1:N}}(x_{1:N}) + \log f_{Y_{1:N}|X_{1:N}}(y_{1:N}{{\, | \,}}x_{1:N}).\]

  • Suppose that initial conditions are irrelevant (they could be either unknown parameters or an improper Gaussian distribution with infinite variance). Then, noting that \(\{\Delta^2 X_{n}, n\in 1:N\}\) and \(\{Y_n-X_n, n\in 1:N\}\) are collections of independent Normal random variables with mean zero and variances \(\sigma^2/\lambda\) and \(\sigma^2\) respectively, we have


[SS3] \({{\quad\quad}\displaystyle \log f_{X_{1:N}Y_{1:N}}(x_{1:N},y_{1:N} {\, ; \,}\sigma,\lambda) = \frac{-1}{2\sigma^2} \sum_{n=1}^N(y_n-x_n)^2 +\frac{-\lambda}{2\sigma^2} \sum_{n=3}^N(\Delta^2 x_n)^2 + C }\).


  • In [SS3], \(C\) is a constant that depends on \(\sigma\) and \(\lambda\) but not on \(x_{1:N}\) or \(y_{1:N}\).

  • Comparing [SS3] with [SS1], we see that maximizing the density \(f_{X_{1:N}Y_{1:N}}(x_{1:N},{y_{1:N}^*} {\, ; \,}\sigma,\lambda)\) as a function of \(x_{1:N}\) is the same problem as finding the smoothing spline by minimizing the penalized sum of squares in [SS1].

  • For a Gaussian density, the mode (i.e., the maximum of the density) is equal to the expected value. Therefore, we have

\[\begin{eqnarray} \arg\min_{x_{1:N}} \mathrm{PSS}(x_{1:N}{\, ; \,}\lambda), &=& \arg\max_{x_{1:N}} f_{X_{1:N}Y_{1:N}}(x_{1:N},{y_{1:N}^*}{\, ; \,}\sigma,\lambda), \\ &=& \arg\max_{x_{1:N}} \frac{ f_{X_{1:N}Y_{1:N}}(x_{1:N},{y_{1:N}^*} {\, ; \,}\sigma,\lambda) }{ f_{Y_{1:N}}({y_{1:N}^*} {\, ; \,}\sigma,\lambda) }, \\ &=& \arg\max_{x_{1:N}} f_{X_{1:N}|Y_{1:N}}(x_{1:N}{{\, | \,}}{y_{1:N}^*}{\, ; \,}\sigma,\lambda), \\ &=& {\mathbb{E}}\big[X_{1:N}{{\, | \,}}Y_{1:N}={y_{1:N}^*} {\, ; \,}\sigma,\lambda\big]. \end{eqnarray}\]

  • The smoothing calculation for an LG-POMP model involves finding the mean and variance of \(X_{n}\) given \(Y_{1:N}={y_{1:N}^*}\).

  • We conclude that the smoothing problem for this LG-POMP model is the same as the spline smoothing problem defined by [SS1].

  • If you have experience using smoothing splines, this connection may help you transfer that experience to POMP models.

  • Once you have experience with POMP models, this connection helps you understand smoothers that are commonly used in many applications.

  • For example, we might propose that the smoothing parameter \(\lambda\) could be selected by maximum likelihood for the POMP model.




9.6.2 Question: Why do we use \(\Delta^2 X_n=\epsilon_n\) for our smoothing model?

  • Seeing that the smoothing spline arrives from the particular choice of LG-POMP model in equation [SS2] could make you wonder why we choose that model.

  • Any ideas?

  • Even if this LG-POMP model is sometimes reasonable, presumably there are other occasions when a different LG-POMP model would be a superior choice for smoothing.




9.7 The Kalman filter




9.7.1 Review of the multivariate normal distribution

  • A random variable \(X\) taking values in \({\mathbb{R}}^{d_X}\) is multivariate normal with mean \(\mu_X\) and variance \(\Sigma_X\) if we can write \[X = {\mathbb{H}}Z + \mu_X,\] where \(Z\) is a vector of \(d_X\) independent identically distributed \(N[0,1]\) random variables and \({\mathbb{H}}\) is a \(d_X\times d_X\) matrix square root of \(\Sigma_X\), i.e., \[{\mathbb{H}}{\mathbb{H}}^{\scriptsize{T}}= \Sigma_X.\]

  • The choice of \({\mathbb{H}}\) is not unique, and a matrix square root of this type exists for any covariance matrix. Mathematically, this is true because covariance matrices are positive semi-definite.

  • We write \(X\sim N\big[\mu_X,\Sigma_X\big]\).

  • \(X\sim N\big[\mu_X,\Sigma_X\big]\) has a probability density function if and only if \(\Sigma_X\) is invertible. This density is given by \[f_X(x) = \frac{1}{(2\pi)^{d_X/2}|\Sigma_X|} \exp \left\{- \frac{ (x - \mu_X)\, \big[\Sigma_X\big]^{-1} \, (x - \mu_X)^{\scriptsize{T}}}{2} \right\}. \]

  • \(X\) and \(Y\) are jointly multivariate normal if the combined vector \[W=\left(\begin{array}{l} X \\ Y \end{array}\right)\] is multivariate normal. In this case, we write \[ \mu_W = \left(\begin{array}{l} \mu_X \\ \mu_Y \end{array}\right), {\quad\quad}\Sigma_W = \left(\begin{array}{cc} \Sigma_X & \Sigma_{XY}\\ \Sigma_{YX} & \Sigma_Y \end{array}\right),\] where \[\Sigma_{XY}= {\mathrm{Cov}}(X,Y) = {\mathbb{E}}\big[(X-\mu_X)(Y-\mu_Y)^{\scriptsize{T}}\big].\]

  • For jointly multivariate normal random variables \(X\) and \(Y\), we have the useful property that the conditional distribution of \(X\) given \(Y=y\) is multivariate normal, with conditional mean and variance


[KF1] \({\quad\quad}\begin{eqnarray} \mu_{X|Y}(y) &=& \mu_X + \Sigma_{XY}\Sigma_Y^{-1}\big(y-\mu_Y\big), \\ \Sigma_{X|Y} &=& \Sigma_X - \Sigma_{XY}\Sigma_Y^{-1}\Sigma_{YX}. \end{eqnarray}\)


  • We write this as \[ X{{\, | \,}}Y=y \sim N\big[\mu_{X|Y}(y),\Sigma_{X|Y}\big]. \]

  • In general, the conditional variance of \(X\) given \(Y=y\) will depend on \(y\) (remind yourself of the definition of conditional variance). In the special case where \(X\) and \(Y\) are jointly multivariate normal, this conditional variance happens not to depend on the value of \(y\).

  • If \(\Sigma_Y\) is not invertible, to make [KF1] work we have to interpret \(\Sigma_Y^{-1}\) as a generalized inverse.




  • To write the Kalman filter, we define the following notation,


[KF2] \({\quad\quad}\begin{array}{lcl} X_{n}{{\, | \,}}Y_{1:n-1}{{=\,}}y_{1:n-1} &\sim& N\big[ \mu_{n}^P(y_{1:n-1}),\Sigma_n^P\big], \\ X_n{{\, | \,}}Y_{1:n}{{=\,}}y_{1:n} &\sim& N\big[ \mu_n^F(y_{1:n-1}),\Sigma_n^F\big], \\ X_n{{\, | \,}}Y_{1:N}{{=\,}}y_{1:N} &\sim& N\big[ \mu_n^S(y_{1:N}),\Sigma_n^S\big]. \end{array}\)


  • To relate this notation to the general POMP recursion formulas, given data \({y_{1:N}^*}\), we define the following terminology:


\(\mu_n^P({y_{1:n-1}^*})={\mathbb{E}}\big[X_n{{\, | \,}}Y_{1:n-1}{{=\,}}{y_{1:n-1}^*}\big]\) is the one-step prediction mean for time \(t_n\). It is an arbitrary decision we have made to call this the prediction for time \(t_n\) (the time for which the prediction is being made) rather than for time \(t_{n-1}\) (the time at which the prediction for time \(t_n\) becomes available).


\(\Sigma_n^P({y_{1:n-1}^*})={\mathrm{Var}}\big(X_n{{\, | \,}}Y_{1:n-1}{{=\,}}{y_{1:n-1}^*}\big)\) is the one-step prediction variance for time \(t_n\). To make this terminology work for general POMP models as well as for LG-POMP models, we have included the dependence on \({y_{1:n-1}^*}\).


\(\mu_n^F({y_{1:n}^*})={\mathbb{E}}\big[X_n{{\, | \,}}Y_{1:n}{{=\,}}{y_{1:n}^*}\big]\) is the filter mean for time \(t_n\).


\(\Sigma_n^F({y_{1:n}^*})={\mathrm{Var}}\big(X_n{{\, | \,}}Y_{1:n}{{=\,}}{y_{1:n}^*}\big)\) is the filter variance for time \(t_n\).


\(\mu_n^S({y_{1:N}^*})={\mathbb{E}}\big[X_n{{\, | \,}}Y_{1:N}{{=\,}}{y_{1:N}^*}\big]\) is the smoothing mean for time \(t_n\).


\(\Sigma_n^S({y_{1:N}^*})={\mathrm{Var}}\big(X_n{{\, | \,}}Y_{1:N}{{=\,}}{y_{1:N}^*}\big)\) is the smoothing variance for time \(t_n\).

  • We have defined the above quantities as estimates rather than estimators. For example, we could define the filter mean estimator to be the function which is evaluated at the data to give the filter mean.

  • From the results for linear combinations of Normal random variables, we get the Kalman filter and prediction recursions:


[KF3] \({{\quad\quad}\displaystyle \mu_{n+1}^P(y_{1:n}) = {\mathbb{A}}_{n+1} \mu_{n}^F(y_{1:n}) }\),


[KF4] \({{\quad\quad}\displaystyle \Sigma_{n+1}^P = {\mathbb{A}}_{n+1} \Sigma_{n}^F {\mathbb{A}}_{n+1}^{\scriptsize{T}}+ {\mathbb{U}}_{n+1} }\).


[KF5] \({{\quad\quad}\displaystyle \Sigma_{n}^F = \big([\Sigma_n^P]^{-1} + {\mathbb{B}}_n^{\scriptsize{T}}{\mathbb{V}}_n^{-1}{\mathbb{B}}_n\big)^{-1} }\).


[KF6] \({{\quad\quad}\displaystyle \mu_{n}^F(y_{1:n}) = \mu_{n}^P(y_{1:n-1}) + \Sigma_n^F {\mathbb{B}}^{\scriptsize{T}}_n{\mathbb{V}}_n^{-1}\big\{y_n - {\mathbb{B}}_n\mu_n^P(y_{1:n-1})\big\} }\).


  • These equations are easy to code, and quick to compute unless the dimension of the latent space is very large. In numerical weather forecasting, with careful programming, they are solved with latent variables having dimension \(d_X\approx 10^7\).

  • A similar computation gives backward Kalman recursions. Putting the forward and backward Kalman recursions together, as in [MP10], is called Kalman smoothing.



9.7.2 Question: Add details to the derivations of [KF3–6]

  • The prediction recursions [KF3–4] are relatively easy to demonstrate, but it is a good exercise to go through the algebra to your own satisfaction.

  • A useful trick for the algebra is to notice that the conditioning identities [KF1] for joint Gaussian random variables continue to hold if left and right are both conditioned on some additional jointly Gaussian variable, such as \(Y_{1:n-1}\).

  • [KF5–6] can be deduced by completing the square in an expression for the joint density, \(f_{X_nY_n|Y_{1:n-1}}(x_n,y_n{{\, | \,}}y_{1:n-1})\) and noticing that the marginal density of \(X_n\) given \(Y_{1:n}\) is proportional to the joint density, with a normalizing constant that must allow the marginal density to integrate to one.