5. Hidden Markov Models#

Download PDF here

Authors: Lars Peter Hansen (University of Chicago) and Thomas J. Sargent (NYU)

\(\newcommand{\eqdef}{\stackrel{\text{def}}{=}}\)

../_images/ch5_goya_vangogh.jpg

Fig. 5.1 Two images hidden behind paintings and discovered via x-rays: Vincent Van Gogh’s earliest self-portrait (left); portrait of perhaps Joseph Bonaparte by Francisco Goya (right).#

“Truth is like the stars; it does not appear except from behind obscurity of the night.” Khalil Gibran

“If you torture the data long enough, it will confess to anything.” Ronald Coase

5.1. Introduction#

Markov models provide a natural framework for studying how to learn about states that are hidden from a statistician or decision maker who observes only noisy signals of some or all states. When states evolve over time, the statistician faces the challenge of learning about a moving target.

This chapter presents Hidden Markov Models (HMMs). We begin with a joint probability distribution consisting of a Markov process and a vector of noisy signals about functions of the Markov state. The statistician observes a history of signals, but not the Markov state vector itself. Learning about the Markov state involves constructing a sequence of probability distributions for the state, conditional on the observed signal histories. Recursive representations of these conditional distributions form auxiliary Markov processes that fully summarize the information about the hidden state vector contained in the signal history. The state vector in this auxiliary process serves as a set of sufficient statistics for the probability distribution of the hidden Markov state, given the signal history. We describe how to construct this auxiliary Markov process of sufficient statistics sequentially.

We present four examples of Hidden Markov Models used to learn about:

  1. A continuously distributed hidden state vector in a linear state-space system

  2. A discrete hidden state vector

  3. Unknown parameters treated as hidden invariant states

  4. Multiple VAR regimes

5.2. Kalman Filter and Smoother#

We assume that a Markov state vector \(X_t\) and a vector \(Z_{t+1}\) of observations are governed by a linear state space system

(5.1)#\[\begin{split}\begin{align} X_{t+1} & = {\mathbb A} X_t + {\mathbb B} W_{t+1} \\ Z_{t+1} & = {\mathbb N} + {\mathbb D} X_t + {\mathbb F} W_{t+1}, \end{align}\end{split}\]

where the matrix \({\mathbb F}{\mathbb F}' \) is nonsingular, \(X_t\) has dimension \(n\), \(Z_{t+1}\) has dimension \(m\) and is a signal observed at \(t+1\), \(W_{t+1}\) has dimension \(k\) and is a standard normally distributed random vector that is independent of \(X_t\), of \(Z^t = [Z_t, \ldots , Z_1]\), and of \(X_0.\) The initial state vector \(X_0 \sim Q_0\), where \(Q_0\) is a normal distribution with mean \(\overline X_0\) and covariance matrix \(\Sigma_0\).[1] To include the ability to represent an unknown fixed parameter as an invariant state associated with a unit eigenvalue in \({\mathbb A}\), we allow \({\mathbb A}\) not to be a stable matrix.

Remark 5.1

To connect this discussion to the VAR example Section 4.3.1 in the Chapter 4, suppose that \(X\) is partitioned as:

\[X_{t} = \begin{bmatrix} X_t^1 \cr \eta \end{bmatrix}\]

where

\[\begin{split} {\mathbb A} & \eqdef \begin{bmatrix} {\mathbb A}_{11} & 0 \cr 0 & 1 \end{bmatrix} \cr {\mathbb B} & \eqdef \begin{bmatrix} {\mathbb B}_1 \cr 0 \end{bmatrix}, \end{split}\]

where \({\mathbb A}_{11}\) is a stable matrix. In addition,

\[Z_{t+1} = \begin{bmatrix} Y_{t+1} - Y_t \cr Z_{t+1}^2 \end{bmatrix},\]

so \(Y_{t+1} - Y_t\) becomes one of the date \(t+1\) observations. Since we may wish to estimate the trend growth rate \(\eta\), we include this as one of the hidden states. To avoid double counting, we let the first entry of the \({\mathbb N}\) vector be zero. While the previous chapter showed how to construct a martingale component when the state vector \(X_t\) is observable at date \(t\), the analysis of the Kalman filter provides us with a different decomposition in which \(X_t\) is not known and must be inferred.

Although \(\{(X_t, Z_t), t=0, 1, 2, \ldots \}\) is Markov, \(\{ Z_{t}, t=0, 1, 2, \ldots \}\) is typically not.[2] We want to construct an affiliated Markov process whose date \(t\) state is \(Q_t\), defined to be the probability distribution of the time \(t\) Markov state \(X_t\) conditional on history \(Z^t =Z_t, \dots , Z_1\) and \(Q_0\). The distribution \(Q_t\) summarizes information about \(X_t\) that is contained in the history \(Z^t\) and \(Q_0\). We sometimes use \(Q_t\) to indicate conditioning information that is “random” in the sense that it is constructed from a history of observable random vectors. Because the distribution \(Q_t\) is multivariate normal, it suffices to keep track only of the mean vector \({\overline X}_t\) and covariance matrix \(\Sigma_t\) of \(X_t\) conditioned on \(Q_0\) and \(Z^{t}\): \({\overline X}_t\) and \(\Sigma_t\) are sufficient statistics for the probability distribution of \(X_t\) conditional on the history \(Z^{t}\) and \(Q_0\). Conditioning on \(Q_t\) is equivalent to conditioning on these sufficient statistics.

We map sufficient statistics \((\overline X_{j-1}, \Sigma_{j-1})\) for \(Q_{j-1}\) into sufficient statistics \((\overline X_j, \Sigma_j)\) for \(Q_j\) by applying formulas for means and covariances of a conditional distribution associated with a multivariate normal distribution. This generates a recursion that maps \(Q_{j-1}\) and \(Z_{j}\) into \(Q_j\). It enables us to construct \(\{Q_t\}\) sequentially. Thus, consider the following three-step process.

  1. Express the joint distribution of \(X_{t+1}, Z_{t+1}\) conditional on \(X_t\) as

\[\begin{split}\begin{bmatrix} X_{t+1} \\ Z_{t+1} \end{bmatrix} \sim \textrm{Normal}\left(\begin{bmatrix} 0 \\ {\mathbb N} \end{bmatrix} + \begin{bmatrix} {\mathbb A} \\ {\mathbb D} \end{bmatrix} X_t, \begin{bmatrix} {\mathbb B} \\ {\mathbb F} \end{bmatrix} \begin{bmatrix} {\mathbb B}' & {\mathbb F}' \end{bmatrix} \right). \end{split}\]
  1. Suppose that the distribution \(Q_t\) of \(X_t\) conditioned on \(Z^t\) and \(Q_0\) is normal with mean \(\overline X_t\) and covariance matrix \(\Sigma_t\). Use the identity \( X_t = \overline X_t + (X_t - \overline X_t)\) to represent \(\begin{bmatrix} X_{t+1} \\ Z_{t+1} \end{bmatrix}\) as

\[\begin{split}\begin{bmatrix} X_{t+1} \\ Z_{t+1} \end{bmatrix} = \begin{bmatrix} 0 \\ {\mathbb N} \end{bmatrix} + \begin{bmatrix} {\mathbb A} \\ {\mathbb D} \end{bmatrix} {\overline X}_t + \begin{bmatrix} {\mathbb A} \\ {\mathbb D} \end{bmatrix} (X_t - {\overline X}_t) +\begin{bmatrix} {\mathbb B} \\ {\mathbb F} \end{bmatrix} W_{t+1},\end{split}\]

which is just another way of describing our original state-space system (5.1). It follows that the joint distribution of \(X_{t+1}, Z_{t+1}\) conditioned on \(Z^t\) and \(Q_0\), or equivalently on \((\overline X_t, \Sigma_t)\), is

\[\begin{split}\begin{bmatrix} X_{t+1} \\ Z_{t+1} \end{bmatrix} \sim \textrm{Normal}\left( \begin{bmatrix} 0 \\ {\mathbb N} \end{bmatrix} + \begin{bmatrix}{\mathbb A} \\ {\mathbb D} \end{bmatrix} \overline X_t, \begin{bmatrix} {\mathbb A} \\ {\mathbb D} \end{bmatrix} \Sigma_t \begin{bmatrix} {\mathbb A}' & {\mathbb D}' \end{bmatrix} +\begin{bmatrix} {\mathbb B} \\ {\mathbb F} \end{bmatrix}\begin{bmatrix} {\mathbb B}' & {\mathbb F}' \end{bmatrix} \right).\end{split}\]

Evidently the marginal distribution of \(Z_{t+1}\) conditional on \((\overline X_t, \Sigma_t)\) is

\[Z_{t+1} \sim \textrm{Normal} ({\mathbb N} + {\mathbb D} \overline X_t, {\mathbb D} \Sigma_t {\mathbb D}' + {\mathbb F}{\mathbb F}') .\]

The resulting density is called the predictive density \(\phi(z^*|Q_t)\), i.e., the density of \(Z_{t+1} \) conditional on history \(Z^t\) and the initial distribution \(Q_0\).

  1. Joint normality implies that the distribution for \(X_{t+1}\) conditional on \(Z_{t+1} \) and \((\overline X_t, \Sigma_t)\) is also normal and fully characterized by a conditional mean vector and a conditional covariance matrix. We can compute the conditional mean by running a population regression of \(X_{t+1} - {\mathbb A} \overline X_t \) on the surprise in \(Z_{t+1} \) defined as \(Z_{t+1} - {\mathbb N} - {\mathbb D} \overline X_t\).[3] Having thus transformed random vectors on both sides of our regression to be independent of past observable information, as ingredients of the pertinent population regression, we have to compute the covariance matrices

\[\begin{split}\begin{align*} E \left[\left(Z_{t+1} - {\mathbb N} - {\mathbb D} \overline X_t\right) \left(Z_{t+1} - {\mathbb N} - {\mathbb D} \overline X_t\right) ' \mid {\overline X}_t, \Sigma_t \right] & = {\mathbb D} \Sigma_t {\mathbb D}' + {\mathbb F}{\mathbb F}' \equiv \Omega_t \\ E \left[(X_{t+1} - {\mathbb A} \overline X_{t})\left(Z_{t+1} - {\mathbb N} - {\mathbb D} \overline X_t\right) ' \mid {\overline X}_t , \Sigma_t \right] & = {\mathbb A} \Sigma_t {\mathbb D}' + {\mathbb B}{\mathbb F}'. \end{align*}\end{split}\]

These provide what we need to compute the conditional expectation

\[E [(X_{t+1} - {\mathbb A} {\overline X}_t ) \mid Z_{t+1} - {\mathbb N} - {\mathbb D} {\overline X}_t,Q_t ] = {\mathcal K}(\Sigma_t) (Z_{t+1} - {\mathbb N} - {\mathbb D} {\overline X}_t) ,\]

where the matrix of regression coefficients \({\mathcal K}(\Sigma_t)\) is called the Kalman gain. To compute the Kalman gain, multiply both sides by \((Z_{t+1} - {\mathbb N} - {\mathbb D} {\overline X}_t)'\) and take expectations conditioned on \(({\overline X}_t, \Sigma_t):\)

\[{\mathbb A} \Sigma_t {\mathbb D}' + {\mathbb B}{\mathbb F}' = {\mathcal K}(\Sigma_t) \left( {\mathbb D} \Sigma_t {\mathbb D}' + {\mathbb F}{\mathbb F}' \right) \]

Solving this equation for \( {\mathcal K}(\Sigma_t)\) gives

(5.2)#\[{\mathcal K}(\Sigma_t) = ({\mathbb A} \Sigma_t {\mathbb D}' + {\mathbb B}{\mathbb F}' ) ({\mathbb D} \Sigma_t {\mathbb D}' + {\mathbb F} {\mathbb F}')^{-1} .\]

Recognize formula (5.2) as an application of the population least squares regression formula. This least-squares application is a valid way to compute conditional expectations because of its application in conjunction with the multivariate normal distribution.[4] We compute \(\Sigma_{t+1}\) via the recursion

(5.3)#\[\Sigma_{t+1} = {\mathbb A} \Sigma_t {\mathbb A}' + {\mathbb B}{\mathbb B}' - ({\mathbb A} \Sigma_t {\mathbb D}' + {\mathbb B}{\mathbb F}' ) ({\mathbb D} \Sigma_t {\mathbb D}' + {\mathbb F}{\mathbb F'})^{-1} ({\mathbb A} \Sigma_t {\mathbb D}' + {\mathbb B}{\mathbb F}')' .\]

The right side of recursion (5.3) follows directly from substituting the appropriate formulas into the right side of \( \Sigma_{t+1} \equiv E ( X_{t+1} - \overline X_{t+1}) ( X_{t+1} - \overline X_{t+1})' \) and computing conditional expectations. The matrix \(\Sigma_{t+1}\) obeys the formula from standard regression theory for the population covariance matrix of the least squares residual \(X_{t+1} - {\mathbb A} \overline X_t \). The matrix \({\mathbb A} \Sigma_t {\mathbb A}' + {\mathbb B} {\mathbb B}'\) is the covariance matrix of \(X_{t+1} - {\mathbb A} \overline X_t\) and the remaining term describes the reduction in covariance associated with conditioning on \(Z_{t+1}\). Thus, the probability distribution \(Q_{t+1}\) is

\[X_{t+1} \mid Z_{t+1}, \overline X_t , \Sigma_t \sim \textrm{Normal} ({\overline X}_{t+1}, \Sigma_{t+1}).\]

where

(5.4)#\[\overline X_{t+1} = {\mathbb A} \overline X_t + {\mathcal K}(\Sigma_t) (Z_{t+1} - {\mathbb N} - {\mathbb D} \overline X_t )\]

Equations (5.2), (5.3), and (5.4) constitute the Kalman filter. They provide a recursion that describes \(Q_{t+1}\) as an exact function of \(Z_{t+1} \) and \(Q_t\).

We may view the outcome of the Kalman filter as taking an original process \(X\) with unobservable states and replacing it with one that lines up with the relevent conditioning information based on sequentially observed signals. This is reflected in the system:

(5.5)#\[\begin{split}\begin{split} \overline X_{t+1} & = {\mathbb A} \overline X_t + {\mathcal K}(\Sigma_t)U_{t+1} \\ Z_{t+1} & = {\mathbb N} + {\mathbb D} \overline X_t + U_{t+1} , \\ \Sigma_{t+1} & = {\mathbb A} \Sigma_t {\mathbb A}' + {\mathbb B}{\mathbb B}' - ({\mathbb A} \Sigma_t {\mathbb D}' + {\mathbb B}{\mathbb F}' ) ({\mathbb D} \Sigma_t {\mathbb D}' + {\mathbb F}{\mathbb F'})^{-1} ({\mathbb A} \Sigma_t {\mathbb D}' + {\mathbb B}{\mathbb F}')' . \end{split}\end{split}\]

where \(\{ (\overline X_{t}, Z_t, \Sigma_t): t \ge 0 \}\) is itself a Markov process. The shock process \(W\) used in the original specification is replaced by what is often called an innovation process, \(U\), given:

\[U_{t+1} \eqdef Z_{t+1} - {\mathbb N} - {\mathbb D} \overline X_t.\]

The \(U\) process is itself normally distributed and temporally independent and provides new information pertinent for predicting current and future states. Specifically, after we condition on initial inputs \((\overline X_0, \Sigma_0)\), \(U_t,U_{t-1}, ..., U_1\) and \(Z_t, Z_{t-1}, ..., Z_1\) generate the same information.

Remark 5.2

The covariance matrix \(\Omega_{t}\) is presumed to be nonsingular, but it is not necessarily diagonal so that components of the innovation vector \(U_{t+1}\) are possibly correlated. We can transform the innovation vector \(U_{t+1}\) to produce a new shock process \({\overline W}_{t+1}\) that has the identity as its covariance matrix. To do so, construct a matrix \(\overline {\mathbb F}(\Sigma_t)\) that satisfies

\[\Omega_{t} = \overline {\mathbb F}(\Sigma_t) [\overline {\mathbb F}(\Sigma_t)]' .\]

The factorization on the right side is not unique. For instance, we could find solutions under which \(\overline {\mathbb F}(\Sigma_t)\) is either lower or upper triangular, but there are other possibilities as well. In what follows we use one such factorization, while the analysis allows any of the possibilities. Construct:

\[\overline W_{t+1} \eqdef \left[ \overline {\mathbb F}(\Sigma_t) \right]^{-1} U_{t+1} \]

Then

(5.6)#\[\begin{split}\overline X_{t+1} & = {\mathbb A} \overline X_t + {\overline {\mathbb B}}(\Sigma_t)\overline W_{t+1} \\ Z_{t+1} & = {\mathbb N} + {\mathbb D} \overline X_t + \overline {\mathbb F}(\Sigma_t) \overline W_{t+1} \end{split}\]

where \({\overline {\mathbb B}}(\Sigma_t) \eqdef {\mathcal K}(\Sigma_t) {\overline {\mathbb F}}(\Sigma_t).\)

While we may think of \(\{ ({\overline X}_t, Z_t, \Sigma_t) : t \ge 0\}\) as a Markov process, it will typically not be stationary. Sometimes a stationary counterpart does exist, however. A necessary requirement is that \(\Sigma_t\) be invariant, because it does not depend on the underlying shocks. To construct such a process, find a positive semidefinite fixed point to the recursion (5.3) and initialize \(\Sigma_0 = \overline \Sigma\). Then \(\Sigma_t = \overline \Sigma\) for all \(t \ge 0,\) and

\[\begin{split}&{\mathcal K}(\Sigma_t) = {\mathcal K}(\overline \Sigma) \\ &\Omega_t = {\mathbb D} \overline \Sigma {\mathbb D}' + {\mathbb F} {\mathbb F}' \eqdef \overline \Omega\end{split}\]

for all \(t\ge 1.\) This simplifies recursive representation (5.6) by making \({\overline {\mathbb B}},\) \({\overline {\mathbb F}}\) and \(\Omega\) all time-invariant. One way to obtain such a construction of \({\overline \Sigma}\) is to iterate on equation (5.3) to convergence, assuming that such a limit does exist. With this construction, we have a Markov process representation

(5.7)#\[\begin{split}\overline X_{t+1} & = {\mathbb A} \overline X_t + {\overline {\mathbb B}} \hspace{.1cm} \overline W_{t+1} \\ Z_{t+1} & = {\mathbb N} + {\mathbb D} \overline X_t + { \overline {\mathbb F}}\hspace{.1cm}\overline W_{t+1} \end{split}\]

Compare this to the original state space system (5.1). Key differences are

  1. In the original system (5.1), the shock vector \(W_{t+1}\) can be of much larger dimension than the time \(t+1\) observation vector \(Z_{t+1}\), while in (5.7), the dimension \(\overline W_{t+1}\) equals that of the observation vector.

  2. The state vector \(X_t\) in the original system (5.1) is not observed while in the representation (5.7), the state vector \(\overline X_t\) is observed.

Remark 5.3

For some applications, the stationary specification may not be the most interesting one. Our decomposition in the previous chapter did not require that we initialize the (now hidden state) Markov process \(X\) in its stationary distribution. It was a forward-looking construction conditioning on the current Markov state. We now describe how we might use that apparatus when the Markov state is hidden drawing on the setup in Remark 5.1. To proceed, we partition the constructed hidden state solution (5.6) as

\[\begin{split} {\overline X}_{t+1}^1 & = {\mathbb A}_{11}{\overline X}_t^1 + {\overline {\mathbb B}}_1({\Sigma_t}){\overline W}_{t+1} \cr {\overline X}_{t+1}^2 & = {\overline X}_t^2 + {\overline {\mathbb B}}_2({\Sigma_t}){\overline W}_{t+1} \cr Y_{t+1} - Y_t & = {\mathbb D}_{11} {\overline X}_t^1 + {\mathbb D}_{12} {\overline X}_t^2 + {\overline {\mathbb F}}_1({\Sigma_t}){\overline W}_{t+1}. \end{split}\]

Suppose initially that \({\mathbb D}_{12}\) is zero. Since the matrix \({\mathbb A}_{11}\) is a stable matrix, we can imitate the first-order VAR calculation from the previous chapter and construct a martingale component with date \(t+1\) increment:

(5.8)#\[G_{t+1} = \left[ {\overline {\mathbb F}}_1({\Sigma_t}) + {\mathbb D}_{11} \left({\mathbb I} - {\mathbb A}_{11}\right)^{-1} {\overline {\mathbb B}}_1(\Sigma_t) \right]{\overline W}_{t+1}. \]

Next we include the estimation of the second state. By design, this uncertain state is used to introduce an unknown trend coefficient into the analysis. This leads us to modify the computation in more substantive way. We compute the impact of \({\overline W}_{t+1}\) on \(Y_{t+\tau}\) as:

(5.9)#\[\begin{split} & \left[{\overline {\mathbb F}}_1({\Sigma_t}) + {\mathbb D}_{11} \left[ {\mathbb I} + {\mathbb A}_{11} + ... + \left({\mathbb A}_{11}\right)^{T-1} \right]{\overline {\mathbb B}}_1(\Sigma_t) \right] {\overline W}_{t+1} \cr & \hspace{1cm} + \tau {\mathbb D}_{12} {\overline {\mathbb B}}_2(\Sigma_t) {\overline W}_{t+1}. \end{split}\]

Notice that the term on the first line of (5.9) converges to the right side of (5.8) as \(\tau\) gets arbitrarily large. The term on the second line of (5.9), however, diverges for large \(\tau\) when the trend is to be estimated. This occurs because the the second component of the state \(X_t\) is invariant overtime. A stationary initialization of the constructed state, \({\overline X}_0\) would necessarily reflect full knowledge of the trend coefficient with a zero in the lower-right entry of \(\Sigma_0\).

Example 5.1

Consider an initial first-order moving-average process \(\{Z_{t}\}\) that obeys:

\[Z_{t} = W_{t} + \lambda W_{t-1}, \]

where \(\{W_{t}\}\) is a univariate i.i.d. process of standardized normal random variables. We represent this process by letting \(X_t = W_t\) and write

\[\begin{split} X_{t+1} & = W_{t+1} \cr Z_{t+1} & = \lambda X_t + W_{t+1} = \lambda {\overline X}_t + \lambda\left( X_t - {\overline X}_t \right) + W_{t+1} \end{split}\]

Then \(Z_{t+1}\) has mean \(\lambda {\overline X}_t\) and variance \(\Omega_t = \lambda^2 \Sigma_t + 1\) conditioned on \({\overline X}_t\). The covariance of \(Z_{t+1}\) with \(X_{t+1}\) conditioned on \({\overline X_t}\) is one. Thus the Kalman gain is

\[{\mathcal K}(\Sigma_t) = \frac 1 {\lambda^2 \Sigma_t + 1},\]

and

(5.10)#\[\Sigma_{t+1} = 1 - \frac 1 {\lambda^2 \Sigma_t + 1}.\]

An invariant solution satisfies:

\[\Sigma = 1 - \frac 1 {\lambda^2 \Sigma + 1},\]

or equivalently the quadratic equation:

\[\lambda^2 \Sigma^2 + (1 - \lambda^2) \Sigma = 0.\]

Thus there are two possible solutions:

\[\Sigma = 0 \textrm{ or } \Sigma = \frac {\lambda^2 - 1}{\lambda^2}.\]

The second solution is only of interest when \(|\lambda| > 1.\)

For each of these solutions, form

\[\begin{align} \Omega & = \lambda^2 \Sigma + 1 & = 1 \textrm{ or } \lambda^2 \cr {\mathcal K}(\Sigma) & = \frac 1 {\lambda^2 \Sigma + 1} & = 1 \textrm{ or } \frac 1 {\lambda^2} \cr {\overline F} & = \sqrt{\lambda^2 \Sigma + 1} & = 1 \textrm{ or } |\lambda|\cr {\overline B} & = \frac 1 {\sqrt{\lambda^2 \Sigma + 1}} & = 1 \textrm{ or } \frac 1 {|\lambda|}. \end{align}\]

where the time-invariant recursive solution satisfies:

(5.11)#\[\begin{split} {\overline X}_{t+1} &= {\overline B} \hspace{.1cm} {\overline W}_{t+1} \cr Z_{t+1} & = \lambda {\overline X}_t + {\overline F} \hspace{.1cm} {\overline W}_{t+1} = \lambda {\overline B} \hspace{.1cm} {\overline W}_{t} + {\overline F} \hspace{.1cm} {\overline W}_{t+1}. \end{split}\]

Notice that the solution for the observation process is itself a moving-average representation expressed in terms of the shock process \(\overline W\).

Return now to the up-dating equation (5.10). One can verify that \(\Sigma = 0\) is the locally stable solution when \(| \lambda | < 1\) and \( \Sigma = ({\lambda^2 - 1})/\lambda^2\) is the stable solution when \(|lambda| > 1\). Under the \(\Sigma = 0\) solution, \({\overline X}_t = X_t\) and the recursive solution (5.11) just recovers the original moving-average representation. When \(|\lambda| > 1\), this is no longer the case; the two moving-average representations differ. To understand the finding when \(|\lambda| < 1\), solve the equation

\[W_t + \lambda W_{t-1} = Z_t \]

backwards to obtain:

\[W_{t} = \sum_{j=0}^\infty (-\lambda)^j Z_{t-j},\]

showing that the current shock \(W_t\) can be recovered by the current and infinite past of the \(Z\)’s. The infinite sum is well defined since \(|\lambda| < 1.\) This same strategy fails when \(|\lambda| > 1\) and the Kalman filtering reveals an alternative moving-average representation in which the prediction \({\overline X}_t\) differs from the hidden state \(X_t.\) Other informationally equivalent representations can be obtained by replacing the \({\overline W}\) process with its negative counterpart.

5.2.1. Kalman smoother#

The Kalman filter provides recursive formulas for computing the distribution of a hidden state vector \(X_t\) conditional on a signal history \(\{Z_\tau : \tau = 1,2, ..., t\}\) and an initial distribution \(Q_0\) for \(X_0\). This conditional distribution has the form \(X_t \sim \textrm{Normal}(\overline{X}_t, \Sigma_t)\); the Kalman filtering equations provide recursive formulas for the conditional mean \(\overline{X}_t\) and the conditional covariance matrix \(\Sigma_t\).

Knowing outcomes \(\{\overline{X}_\tau, \Sigma_\tau\}_{\tau =1}^T\) from the Kalman filter provides the foundation for the Kalman smoother. The Kalman smoother uses past, present, and future values of \(Z_\tau\) to learn about current values of the state \(X_{\tau}\). The Kalman smoother is a recursive algorithm that computes sufficient statistics for the distribution of \(X_t\) conditional on the entire sample \(\{Z_t\}_{t=1}^T\), namely, a mean vector and covariance matrix pair \(\widehat{X}_t, \widehat{\Sigma}_t\). The Kalman smoother takes outputs \(\{\overline{X}_t, \Sigma_t\}_{t=0}^T\) from the Kalman filter as inputs and then works backwards through the following steps starting from \(t = T\).

  • Reversed time regression. Write the joint distribution of \((X_t, X_{t+1}, Z_{t+1})\) conditioned on \(\left( \overline{X}_t , \Sigma_t \right)\) as

\[\begin{split}\begin{bmatrix} X_t \\ X_{t+1} \\ Z_{t+1} \end{bmatrix} \sim \textrm{Normal}\left( \begin{bmatrix} \overline{X}_t \\ \mathbb{A} \overline{X}_t \\ \mathbb{N} +\mathbb{D} \overline{X}_t \end{bmatrix}, \begin{bmatrix} \Sigma_t & \Sigma_t \mathbb{A}' & \Sigma_t \mathbb{D}' \\ \mathbb{A} \Sigma_t & \mathbb{A} \Sigma_t \mathbb{A}' + \mathbb{B} \mathbb{B}' & \mathbb{A} \Sigma_t\mathbb{D}' + \mathbb{B} \mathbb{F}' \\ \mathbb{D} \Sigma_t & \mathbb{D} \Sigma_t \mathbb{A}' + \mathbb{F}\mathbb{B}' & \mathbb{D} \Sigma_t \mathbb{D}' + \mathbb{F}\mathbb{F}' \end{bmatrix} \right)\end{split}\]

From this joint distribution, construct the conditional distribution for \(X_t\), given \(X_{t+1}, Z_{t+1}\) and \(\left( \overline{X}_t , \Sigma_t \right)\). Compute the conditional mean of \(X_t - \overline{X}_t\) by using the population least squares formula

(5.12)#\[\widehat{\mathbb{K}}_1 \left( X_{t+1} - \mathbb{A}\overline{X}_t\right) + \widehat{\mathbb{K}}_2 \left(Z_{t+1} - \mathbb{N}- \mathbb{D} \overline{X}_t \right) \]

where the regression coefficient matrix is

\[\begin{bmatrix} \widehat{\mathbb{K}}_1 & \widehat{\mathbb{K}}_2 \end{bmatrix} = \widehat{\mathbb{K}} \doteq \begin{bmatrix} \Sigma_t \mathbb{A}' & \Sigma_t \mathbb{D}' \end{bmatrix} \begin{bmatrix} \mathbb{A} \Sigma_t \mathbb{A}' + \mathbb{B} \mathbb{B}' & \mathbb{A} \Sigma_t\mathbb{D}' + \mathbb{B} \mathbb{F}' \cr \mathbb{D} \Sigma_t \mathbb{A}' + \mathbb{F}\mathbb{B}' & \mathbb{D} \Sigma_t \mathbb{D}' + \mathbb{F}\mathbb{F}' \end{bmatrix} ^{-1}\]

and the residual covariance matrix equals

(5.13)#\[\Sigma_t - \begin{bmatrix} \Sigma_t \mathbb{A}' & \Sigma_t \mathbb{D}' \end{bmatrix} \begin{bmatrix} \mathbb{A} \Sigma_t \mathbb{A}' + \mathbb{B} \mathbb{B}' & \mathbb{A} \Sigma_t\mathbb{D}' + \mathbb{B} \mathbb{F}' \cr \mathbb{D} \Sigma_t \mathbb{A}' + \mathbb{F}\mathbb{B}' & \mathbb{D} \Sigma_t \mathbb{D}' + \mathbb{F}\mathbb{F}' \end{bmatrix}^{-1} \begin{bmatrix} \mathbb{A} \Sigma_t \cr \mathbb{D} \Sigma_t \end{bmatrix}\]
  • Iterated expectations. Notice that the above reverse regression includes \(X_{t+1} - \mathbb{A} \overline{X}_t\) among the regressors. Because \(X_{t+1}\) is hidden, that is more information than we have. We can condition down to information that we actually have by instead using \({\widehat{X}}_{t+1} - \mathbb{A} \overline{X}_t\) as the regressor where \({\widehat{X}}_{t+1}\) is the conditional expectation of \(X_{t+1}\) given the full sample of data \(\{Z_{t}\}_{t=1}^T\) and \(\widehat{\Sigma}_{t+1}\) is the corresponding conditional covariance matrix. This gives us a backwards recursion for \({\widehat{X}}_t\):

\[{\widehat{X}}_t - \overline{X}_t = \widehat{\mathbb{K}}_1 \left( {\widehat{X}}_{t+1} - \mathbb{A}\overline{X}_t\right) + \widehat{\mathbb{K}}_2 \left(Z_{t+1} - \mathbb{N} - \mathbb{D} \overline{X}_t \right) \]

The law of iterated expectations implies that the regression coefficient matrices \(\widehat{\mathbb{K}}_1, \widehat{\mathbb{K}}_2\) equal the ones we have already computed. But since we are using less information, the conditional covariance matrix increases by \(\widehat{\mathbb{K}}_1 \widehat{\Sigma}_{t+1}\widehat{\mathbb{K}}_1'\). This implies the backwards recursion:

\[{\widehat{\Sigma}}_t = \Sigma_t - \begin{bmatrix} \Sigma_t \mathbb{A}' & \Sigma_t \mathbb{D}' \end{bmatrix} \begin{bmatrix} \mathbb{A} \Sigma_t \mathbb{A}' + \mathbb{B} \mathbb{B}' & \mathbb{A} \Sigma_t\mathbb{D}' + \mathbb{B} \mathbb{F}' \cr \mathbb{D} \Sigma_t \mathbb{A}' + \mathbb{F}\mathbb{B}' & \mathbb{D} \Sigma_t \mathbb{D}' + \mathbb{F}\mathbb{F}' \end{bmatrix}^{-1} \begin{bmatrix} \mathbb{A} \Sigma_t \cr \mathbb{D} \Sigma_t \end{bmatrix} + \widehat{\mathbb{K}}_1 \widehat{\Sigma}_{t+1}\widehat{\mathbb{K}}_1' \]
  • Take \({\widehat{\Sigma}}_T = \Sigma_T\) and \({\widehat{X}}_T = \overline{X}_T\) as terminal conditions.

5.3. Mixtures#

Suppose now that \(\{ X_t : t \ge 0\}\) evolves as an \(n\)-state Markov process with transition probability matrix \(\mathbb{P}\).
A date \(t+1\) vector of signals \(Z_{t+1}\) has density \(\psi_i(z^*)\) if hidden state \(i\) is realized, meaning that \(X_t\) is the \(i\)th coordinate vector. We want to compute the probability that \(X_t\) is in state \(i\) conditional on the signal history. The vector of conditional probabilities equals \(Q_t = E[X_t | Z^t, Q_0]\), where \(Q_0\) is a vector of initial probabilities and \(Z^t\) is the available signal history up to date \(t\). We construct \(\{Q_t : t\ge 1\}\) recursively:

Find the joint distribution of \((X_{t+1},Z_{t+1} )\) conditional on \(X_t\). Conditional distributions of \(Z_{t+1} \) and \(X_{t+1}\) are statistically independent by assumption. Write the joint density conditioned on \(X_t\) as:

(5.14)#\[\begin{split}\begin{matrix} \left(\mathbb{P}'X_t \right) & \times & (X_t)' \text{vec} \left\{ \psi_i(z^*) \right\} \\ \uparrow & & \uparrow \\ X_{t+1} \ \ \text{density} & & Z_{t+1} \ \ \text{density} \end{matrix}\end{split}\]

where \(\text{vec}(r_i)\) is a column vector with \(r_i\) in the \(i\)th component. We have expressed conditional independence by forming a joint conditional distribution as a product of two conditional densities, one for \(X_{t+1}\) and one for \(Z_{t+1}\).

  1. Find the joint distribution of \(X_{t+1},Z_{t+1}\) conditioned on \(Q_t\). Since \(X_t\) is not observed, we form the appropriate average of (5.14) conditioned on \(Z^t, Q_0\):

    (5.15)#\[\mathbb{P}' \text{diag}\{Q_t\} \text{vec} \left\{ \psi_i(z^*) \right\},\]

    where \(\text{diag}(Q_t)\) is a diagonal matrix with the entries of \(Q_t\) on the diagonal. Thus, \(Q_t\) encodes all pertinent information about \(X_t\) that is contained in the history of signals. Conditional on \(Q_t\), \(X_{t+1}\) and \(Z_{t+1}\) are not statistically independent.

  1. Find the distribution of \(Z_{t+1}\) conditional on \(Q_t\). Summing (5.15) over the hidden states gives

    \[(\mathbf{1}_n)'\mathbb{P}' \text{diag}\{Q_t\} \text{vec} \left\{ \psi_i(z^*) \right\} = Q_t\cdot \text{vec} \left\{ \psi_i(z^*) \right\}.\]

    Thus, \(Q_t\) is a vector of weights used to form a mixture distribution. Suppose, for instance, that \(\psi_i\) is a normal distribution with mean \(\mu_i\) and covariance matrix \(\Sigma_i\). Then the distribution of \(Z_{t+1}\) conditioned on \(Q_t\) is a mixture of normals with mixing probabilities given by entries of \(Q_t\).

  1. Obtain \(Q_{t+1}\) by dividing the joint density of \((Z_{t+1} ,X_{t+1})\) conditional on \(Q_t\) by the marginal density for \(Z_{t+1}\) conditioned on \(Q_t\) and then evaluating this ratio at \(Z_{t+1}\). In this way, we construct the density for \(X_{t+1}\) conditioned on \((Q_t,Z_{t+1})\). It takes the form of a vector \(Q_{t+1}\) of conditional probabilities. Thus, we are led to

    (5.16)#\[Q_{t+1} = \left( \frac{1}{Q_t\cdot \text{vec} \left\{ \psi_i(Z_{t+1} ) \right\}}\right)\mathbb{P}' \text{diag}(Q_t) \text{vec} \left\{ \psi_i(Z_{t+1}) \right\}\]

Together, step 3 and step4 define a Markov process for \(Q_{t+1}\). As indicated in step3, \(Z_{t+1}\) is drawn from a (history-dependent) mixture of densities \(\psi_i\). As indicated in step4, the vector \(Q_{t+1}\) equals the exact function of \(Z_{t+1}\) and \(Q_t\) described in (5.16).

5.4. Recursive Regression#

A statistician wants to infer unknown parameters of a linear regression model. By treating regression coefficients as hidden states that are constant over time, we can cast this problem in terms of a hidden Markov model. By assigning a prior probability distribution to statistical models that are indexed by parameter values, the statistician can construct a stationary stochastic process as a mixture of statistical models.[5] From increments to a data history, the statistician learns about parameters sequentially. By assuming that the statistician adopts a conjugate prior à la [Luce and Raiffa, 1957], we can construct explicit updating formulas.

Consider the first-order vector autoregressive model

(5.17)#\[\begin{split}X_{t+1} = {\mathbb A} X_t + {\mathbb B} W_{t+1} \\ Z_{t+1} = {\mathbb N} + {\mathbb D} X_{t} + {\mathbb F} W_{t+1}\end{split}\]

where \(W_{t+1}\) is an i.i.d. normal random vector with mean vector \(0\) and covariance matrix \(I\), \(X_t\) is an observable state vector, and \({\mathbb A}, {\mathbb B}, {\mathbb D}, {\mathbb F}, {\mathbb N}\) are matrices containing unknown coefficients. When \({\mathbb A}\) is a stable matrix, the vector \({\mathbb N}\) is interpretable as the vector of means of the observation vector \(Z_{t+1}\) (conditioned on invariant events).

We now construct a special case of this representation. Suppose that \(Z_{t+1}\) and \(W_{t+1}\) share the same dimensions, that \({\mathbb F}\) is nonsingular, and that \(X_t\) consists of:

\[X_t = \begin{bmatrix} Z_t - {\mathbb N} \cr Z_{t-1} - {\mathbb N} \cr ... \cr Z_{t-\ell +1} - {\mathbb N} \end{bmatrix}\]

We obtain a finite-order vector autoregression by writing:

(5.18)#\[\begin{split} Z_{t+1} & = {\mathbb N} + {\mathbb D} X_t + {\mathbb F} W_{t+1} \cr \cr & = {\widetilde {\mathbb N}} + {\mathbb D} \begin{bmatrix} Z_t \cr Z_{t-1} \cr ... \cr Z_{t-\ell +1} \end{bmatrix} + {\mathbb F} W_{t+1} \end{split}\]

where

\[{\widetilde {\mathbb N}} \eqdef {\mathbb N} - {\mathbb D} \begin{bmatrix} {\mathbb N} \cr {\mathbb N} \cr ... \cr {\mathbb N} \end{bmatrix} \]

Our plan is to estimate the coefficients of the matrices \({\mathbb N} , {\mathbb D},\) and \({\mathbb F}\). Notice that \({\mathbb N}\) potentially can be recovered from \({\widetilde {\mathbb N}}\) and \({\mathbb D}\). The matrix \({\mathbb F}\) is not fully identified without further a priori restrictions. What is identified is \({\mathbb F}{\mathbb F}'\). This identification challenge is the topic of so-called “structural vector autoregressions.” This literature seeks to motivate sufficient restrictions imposed on the matrix \({\mathbb F}\) to achieve a unique identification. Our previous construction of a permanent shock is one way to contribute to this identification. As with most of the VAR literature, we do not consider the case of over-identification in what follows. With exact identification, we impose a convenient normalization on \({\mathbb F}\). Other observationally equivalent and substantively motivated \({\mathbb F}\)’s can be constructed from our estimation.

5.4.1. Conjugate prior updating#

In practice, priors are often selected as a matter of convenience. They are constructed with tractability in mind. Conjugate priors are an example of this convenience. In a later chapter, we will use this one motivation for exploring prior sensitivity. In this section, we will proceed recursively. In so doing, we will use the insight that “today’s posterior is tomorrow’s prior.” By following suggestions offered by [Zellner, 1962], [Box and Tiao, 1992], [Sims and Zha, 1999], and especially [Zha, 1999], we transform system (5.18) in a way that justifies estimating the unknown coefficients by applying least squares equation-by-equation.

Start by factoring the matrix

(5.19)#\[{\mathbb F}{\mathbb F}' = {\mathbb J} \Delta {\mathbb J}'\]

where \({\mathbb J}\) is lower triangular with ones on the diagonal and \(\Delta\) is diagonal.[6] The inverse \({\mathbb J}^{-1}\) is also lower triangular with ones on the diagonal. We make the choice of a lower triangular for the matrix \({\mathbb J}\) as a matter of convenience and not necessarily one based on economic interpretation. The literature on structural VAR’s explores the identifiction of the matrix \({\mathbb F}\) from knowledge of the covariance matrix of \(U_{t+1}\) given by

\[{\mathbb F}{\mathbb F}'\]

using a variety of different approaches. Constructing a matrix \({\mathbb F}\) is to identifying a vector of substantively interesting shocks. The construction of a common permanent shock for a vector time series gives a way identify one of the shocks of interest, but there are many other choices in the literature including different configurations of zeroes in the matrix \({\mathbb F}\). Some researchers stop with only partial identification of the matrix \({\mathbb F}.\) For instance, see [Uhlig, 2005] for a comprehensive discusson of how to use sign restrictions to achieve partial identification. Our strategy going forward to is to impose a particularly convenient initial indentification. Posterier probabilities for ther pontially more interesting forms of identificaiton can then be inferred from our computations.

Construct

(5.20)#\[{\mathbb J}^{-1} Z_{t+1} = {\mathbb J}^{-1}{\widetilde {\mathbb N}} + {\mathbb J}^{-1} {\mathbb D} \begin{bmatrix} Z_t \cr Z_{t-1} \cr ... \cr Z_{t-\ell +1} \end{bmatrix} + U_{t+1}\]

where

\[U_{t+1} = {\mathbb J}^{-1} {\mathbb F} W_{t+1}\]

so that \(E U_{t+1} U_{t+1}' = \Delta\). The \(i^{th}\) entry of \(U_{t+1}\) is uncorrelated with, and consequently statistically independent of, the \(j\)th components of \(Z_{t+1}\) for \(j = 1, 2, \ldots ,i-1\). As a consequence, each equation in system (5.20) can be interpreted as a regression equation in which the left-hand side variable in equation \(i\) is the \(i^{th}\) component of \(Z_{t+1}\). The regressors are a constant, \(Z_t, Z_{t-1} ..., Z_{t-\ell +1} \), and the \(j^{th}\) components of \(Z_{t+1}\) for \(j = 1, \ldots, i-1\). The \(i\)th equation is an unrestricted regression with a disturbance term \(U_{t+1,i}\) that is uncorrelated with disturbances \(U_{t+1,j}\) to all other equations \(j \neq i\). The system of equations (5.20) is thus recursive. The first equation determines the first entry of \(Z_{t+1}\), the second equation determines the second entry of \(Z_{t+1}\) given the first entry, and so forth.

We construct estimates of the coefficient matrices \({\mathbb A},{\mathbb B},{\mathbb D},{\mathbb F}, {\mathbb N}\) and the covariance matrix \(\Delta = E U_{t+1} U_{t+1}'\) from these regression equations, with the reminder that knowledge of \({\mathbb J}\) and \(\Delta\) may not directly reveal the matrix \({\ mathbb F}\) that is of substantive interest.

Consider, in particular, the \(i\)th regression formed in this way and express it as the scalar regression model:

\[Z_{t+1}^{[i]} = {R_{t+1}^{[i]}}'\beta^{[i]} + U_{t+1}^{[i]}\]

where \({R_{t+1}^{[i]}}\) is the appropriate vector of regressors in the \(i\)th equation of system (5.20). To simplify notation, we omit superscripts and understand that we are estimating one equation at a time. To avoid notational confusion, we let the left side variable of the regression be denoted \(Y_{t+1}\). The disturbance \(U_{t+1}\) is a normally distributed random variable with mean zero and variance \(\sigma^2\). Furthermore, \(U_{t+1}\) is statistically independent of \(R_{t+1}\). Information observed as of date \(t\) consists of \(X_0\) and \(Z^{t} = [Z_t', \ldots, Z_1']'\). Suppose that in addition \(Z_{t+1}\) and \(R_{t+1}\) are also observed at date \(t+1\) but that \(\beta\) and \(\sigma^2\) are unknown.

Let the distribution of \(\beta\) conditioned on \(Z^t\), \(X_0\), and \(\sigma^2\) be normal with mean \(b_t\) and precision matrix \(\zeta \Lambda_t\) where \(\zeta = \frac{1}{\sigma^2}\). Here the precision matrix equals the inverse of a conditional covariance matrix of the unknown parameters. At date \(t+1\), we add \(Z_{t+1}\) to the conditioning set. So we want the distribution of \(\beta\) conditioned on \(Z^{t+1}\), \(X_0\), and \(\sigma^2\). It is also normal but now has precision \(\zeta \Lambda_{t+1}.\)

To deduce the recursive updating conditioned on \(\zeta\), we observe that both the date \(t\) prior distribution and the date \(t+1\) conditional density for \(Y_{t+1}\) are normal. After multiplication, the terms inside the exponential involving \(\beta\) are

\[\begin{split} &\exp\left( \zeta Y_{t+1}{R_{t+1}}' \beta - {\frac \zeta 2} \beta'{R_{t+1}}{R_{t+1}}'\beta + \zeta {b_{t}}'\Lambda_t \beta - \frac \zeta 2 {\beta}' \Lambda_t \beta \right) \cr &\propto \exp \left[ - \zeta \left( \beta - b_{t+1}\right)'\Lambda_{t+1}\left( \beta - b_{t+1} \right) \right] \end{split}\]

where

(5.21)#\[\Lambda_{t+1} = {R_{t+1}}{R_{t+1}}' + \Lambda_t \]

and

(5.22)#\[\Lambda_{t+1} b_{t+1} = \left[\Lambda_t b_t + {R_{t+1}}Y_{t+1} \right] .\]

Recursion (5.21) implies that \(\Lambda_{t+1} - \Lambda_t\) is a positive semidefinite matrix, which confirms that additional information improves estimation accuracy. Evidently from recursion (5.21), \(\Lambda_{t+1}\) cumulates cross-products of the regressors and adds them to an initial \(\Lambda_0\). The updated conditional mean \(b_{t+1}\) for the normal distribution of unknown coefficients can be deduced from \(\Lambda_{t+1}\) via the updating equation (5.22). Solving difference equation (5.22) backwards shows how \(\Lambda_{t+1} b_{t+1}\) cumulates cross-products of \(R_{t+1}\) and \(Y_{t+1}\) and adds the outcome to an initial condition \(\Lambda_0 b_0\).

So far we pretended that we know \(\sigma^2\) by conditioning on \(\sigma^2\), which is equivalent to conditioning on its inverse \(\zeta\). Assume now that we don’t know \(\sigma^2\) but instead summarize our uncertainty about it with a date \(t\) gamma density for \(\zeta\) conditioned on \(Z^t\), \(X_0\) so that it is proportional to

\[(\zeta)^{\frac{c_{t}}{2}} \exp(- d_t \zeta /2),\]

where the density is expressed as a function of \(\zeta\), so that \(d_t \zeta\) has a chi-square density with \(c_t +21\) degrees of freedom. The implied density for \(\zeta\) conditioned on time \(t+1\) information is also a gamma density with updated parameters:

\[\begin{split}\begin{align*} c_{t+1} & = c_t + 1 \\ d_{t+1} & = (Y_{t+1})^2 - (b_{t+1})'\Lambda_{t+1}b_{t+1} + (b_{t})'\Lambda_t b_{t} + d_t. \end{align*}\end{split}\]

The distribution of \(\beta\) conditioned on \(Z^{t+1}\), \(X_0\), and \(\zeta\) is normal with mean \(b_{t+1}\) and precision matrix \(\zeta \Lambda_{t+1}\). The distribution of \(\zeta\) conditioned on \(Y^{t+1}\), \(X_0\) has a gamma density, so that it is proportional to[7]

\[(\zeta)^{\frac{c_{t + 1}}{2}} \exp(- d_{t + 1} \zeta /2),\]

Standard least squares regression statistics can be rationalized by positing a prior that is not informative. This is commonly done by using an “improper” prior that does not integrate to unity. [8] Setting \(\Lambda_0 = 0\) effectively imposes a uniform but improper prior over \(\beta\). Although \(\Lambda_t\)’s early in the sequence are singular, we can still update \(\Lambda_{t+1} b_{t+1}\) via (5.22); \(b_{t+1}\) are not uniquely determined until \(\Lambda_{t+1}\) becomes nonsingular. After enough observations have been accumulated to make \(\Lambda_{t+1}\) become nonsingular, the implied normal distributions for the unknown parameters become proper. When \(\Lambda_0 = 0\), the specification of \(b_0\) is inconsequential and \(b_{t+1}\) becomes a standard least squares estimator. An “improper gamma” prior over \(\sigma\) that is often associated with an improper normal prior over \(\beta\) sets \(c_0\) to minus two and \(d_0\) to zero. This is accomplished by assuming a uniform prior distribution for the logarithm of the precision \(\zeta\) or for the logarithm of \(\sigma^2\). With this combination of priors, \(d_{t+1}\) becomes a sum of squared regression residuals. [9]

From the posterior of the coefficients of this transformed system we can compute posteriors of nonlinear functions of those coefficients. We accomplish this by using a random number generator repeatedly to take pseudo-random draws from the posterior probability of the coefficients, forming those nonlinear functions, and then using the resulting histograms of those nonlinear functions to approximate the posterior probability distribution of those nonlinear functions. For example, many applied macroeconomic papers report impulse responses as a way to summarize model features. Impulse responses are nonlinear functions of the \((\mathbb A, \mathbb B)\).

The conjugate prior approach described above does not generate a posterior for which either the prior or the implied posteriors for the matrix \({\mathbb A}\) have stable eigenvalues with probability one. We therefore modify that approach to impose that \({\mathbb A}\) is a stable matrix. We do this by rescaling the posterior probability so that it integrates to one over the region of the parameter space for which \({\mathbb A}\) is stable. We in effect condition on \({\mathbb A}\) being stable. This is easy to implement by rejection sampling.[10]

The standard deviation of the martingale increment is a nonlinear function of parameters in \(({\mathbb A}, {\mathbb B})\). We construct a posterior distribution via Monte Carlo simulation. We draw from the posterior of the multivariate regression system and, after conditioning on stability of the \({\mathbb A}\) matrix, compute the nonlinear functions of interest. From the simulation, we construct joint histograms to approximate posterior distributions of functions of interest.[11]

As an illustration, in Fig. 5.2, we show posterior histograms for the standard deviations of shocks to short-term consumption growth and of the martingale increment to consumption based on the application described in Section An economic rationale of Chapter 4. The standard deviation of the short-term shock contribution is about one-half that of the standard deviation of the martingale increment. Fig. 5.2 tells us that short-term risk can be inferred with much more accuracy than long-term risk can be. This evidence says that while there could be a long-run risk component to consumption, it is poorly measured. The fat tail in the right side of the distribution of the long-run standard deviation is induced by Monte Carlo draws for which some eigenvalues of \(\mathbb A\) have absolute values very close to unity. [12]

../_images/VAR_sigma_c.png

Fig. 5.2 Posterior density for conditional standard deviation of consumption growth.#

../_images/VAR_sigma_z.png

Fig. 5.3 Posterior distribution for the standard deviation of the martingale increment.#

Remark 5.4

Following [Carter and Kohn, 1994], we consider a Gibbs sampling approach for making inferences about a linear state-space model with hidden states and unknown parameters. Let \(\theta\) denote a stand-in for the unknown parameters of the state-space model, and let \(\vartheta\) denote a stand-in for the entire collection of unknown states. All of the computations that follow condition on a time series of observations, \(Z_1, Z_2, ...., Z_T,\) although we suppress this conditioning in the notation that follows. The Kalman smoother gives a conditional distribution, \(P(d\vartheta \mid \theta).\) Given the observations of the composite state vector, \(\vartheta,\) we extend the conjugate prior method to construct \({\widetilde P}(d\theta \mid \vartheta).\) Finally, let \(Q(d \vartheta, d\theta)\) denote the joint probability for \((\vartheta, \theta)\), which is the target of computation.

To compute the joint distribution, form a Markov process with a one-period transition from \((\vartheta, \theta)\) to \((\vartheta^+, \theta^+)\) by following the two-step construction for use in simulation:

  • generate \(\vartheta^+\) using \(P(d\vartheta^+ \mid \theta)\);

  • generate \(\theta^+\) using \({\widetilde P}(d\theta^+ \mid \vartheta^+).\)

The transition distribution for the Markov process is:

\[{\widetilde P}(d\theta^+ \mid \vartheta^+)P(d\vartheta^+ \mid \theta). \]

We verify that \(Q\) is a stationary distribution by first computing

\[\begin{split} & \int_{\vartheta, \theta} {\widetilde P}(d\theta^+ \mid \vartheta^+) P(d\vartheta^+ \mid \theta) Q( d\vartheta, d\theta) \cr &= {\widetilde P}(d\theta^+ \mid \vartheta^+) \int_\theta P(d\vartheta^+ \mid \theta) \int_\vartheta Q( d\vartheta, d\theta). \end{split}\]

Note that \(\int_\vartheta Q( d\vartheta, d\theta)\) is the marginal distribution over \(\theta\) and thus

\[P(d\vartheta^+ \mid \theta) \int_\vartheta Q( d\vartheta, d\theta) = Q(d \vartheta^+, d \theta).\]

It follows that

\[\int_\theta P(d\vartheta^+ \mid \theta) \int_\vartheta Q( d\vartheta, d\theta) = \int_\theta Q(d \vartheta^+, d \theta)\]

The integral on the right side is the marginal distribution over \(\vartheta^+.\) Finally, note that

\[{\widetilde P}(d\theta^+ \mid \vartheta^+)\int_\theta Q(d \vartheta^+, d \theta) = Q( d\vartheta^+, d\theta^+) ,\]

showing that \(Q\) is indeed the stationary distribution of this constructed Markov process. This allows us to approximate \(Q\) by simulating the Markov process. This approach is an example of a numerical method called Gibbs sampling.

5.5. VAR Regimes#

Following [Sclove, 1983] and [Hamilton, 1989], suppose that there are multiple VAR regimes \(({\mathbb A}_i, {\mathbb B}_i, {\mathbb D}_i, {\mathbb F}_i)\) for \(i=1,2,...,n\), where indices \(i\) are governed by a Markov process with transition matrix \(\mathbb{P}\). In regime \(i\) we have

(5.23)#\[\begin{split}X_{t+1} = {\mathbb A}_i X_t + {\mathbb B}_i W_{t+1} \\ Y_{t+1} - Y_t = {\mathbb D}_i X_{t} + {\mathbb F}_i W_{t+1},\end{split}\]

where \(\{W_{t+1}\}_{t=0}^\infty\) is an i.i.d. sequence of \(\textrm{Normal} (0,I)\) random vectors conditioned on \(X_0\), and \(\mathbb{F}_i\) is nonsingular.

We can think of \(X_t\) and a regime indicator \(Z_t\) jointly as forming a Markov process. When regime \(i\) is realized, \(Z_t\) equals a coordinate vector with one in the \(i^{th}\) coordinate and zeros at other coordinates. We study a situation in which regime indicator \(Z_t\) is not observed. Let \(Q_t\) denote an \(n\)-dimensional vector of probabilities over the hidden states \(Z_t\) conditioned on \(Y^t\), \(X_0\), and \(Q_0\), where \(Q_0\) is the date zero vector of initial probabilities for \(Z_0\). Equivalently, \(Q_t\) is \(E(Z_t \vert Y^t, X_0, Q_0)\).

The vector of conditional probabilities \(Q_t\) solves a filtering problem. We describe the solution of this problem by representing \((X_t,Q_t)\) as a Markov process via the following four steps.

  1. Find the joint distribution for \((Z_{t+1},Y_{t+1} - Y_{t})\) conditioned on \((Z_t,X_t)\). Conditional distributions of \(Z_{t+1}\) and \(Y_{t+1} \) are statistically independent by assumption. Conditioned on \(Z_t\), \(X_t\) conveys no information about \(Z_{t+1}\) and thus the conditional density of \(Z_{t+1}\) is given by entries of \(\mathbb{P}'Z_t\). Conditioned on \(Z_t = i\), \(Y_{t+1}-Y_t \) is normal with mean \({\mathbb D}_i X_t\) and covariance matrix \({\mathbb F}_i({\mathbb F}_i)'\). Let \(\psi_i(y^*,X_t)\) be the normal density function for \(Y_{t+1}-Y_t\) conditioned on \(X_t\) when \(Z_t\) is in regime \(i\). We can write the joint density conditioned on \((Z_t, X_t)\) as:

\[\begin{split}\begin{matrix} \underbrace{(\mathbb{P}'Z_t)} & \times & \underbrace{(Z_t)' \text{vec} \left\{ \psi_i(y^*,X_t) \right\}} \\ \uparrow & & \uparrow \\ Z_{t+1} \ \ \text{density} & & Y_{t+1} -Y_t \ \ \text{density} \end{matrix}\end{split}\]

where \(\text{vec}(r_i)\) is a column vector with \(r_i\) in the \(i^{th}\) entry. We have imposed conditional independence by forming a joint conditional distribution as a product of two conditional densities, one for \(Z_{t+1}\) and one for \(Y_{t+1}-Y_t\).

  1. Find the joint distribution of \((Z_{t+1}, Y_{t+1}-Y_t)\) conditioned on \((X_t,Q_t)\). Since \(Z_t\) is not observed, we form the appropriate average of the above conditioned on the \(Y^t, X_0, Q_0\):

\[\mathbb{P}' \text{diag}\{Q_t\} \text{vec} \left\{ \psi_i(y^*,X_t) \right\}\]

where \(\text{diag}\{Q_t\}\) is a diagonal matrix with components of \(Q_t\) on the diagonal. Thus, \(Q_t\) encodes all pertinent information about the time \(t\) regime \(Z_t\) that is contained in \(Y^t\), \(X_0\) and \(Q_0\). Notice that conditional on \((X_t,Q_t)\), random vectors \(Y_{t+1}-Y_t\) and \(Z_{t+1}\) are not statistically independent.

  1. Find the distribution of \(Y_{t+1}-Y_t\) conditioned on \((X_t,Q_t)\). Summing the above over hidden states gives

\[(\mathbf{1}_n)'\mathbb{P}' \text{diag}\{Q_t\} \text{vec} \left\{ \psi_i(y^*,X_t) \right\} = Q_t\cdot \text{vec} \left\{ \psi_i(y^*,X_t) \right\}.\]

Thus, the distribution for \(Y_{t+1}-Y_t\) conditioned on \((X_t, Q_t)\) is a mixture of normals in which, with probability given by the \(i^{th}\) entry of \(Q_t\), \(Y_{t+1}-Y_t\) is normal with mean \({\mathbb D}_i X_t\) and covariance matrix \({\mathbb F}_i({\mathbb F}_i)'\). Similarly, the conditional distribution of \(X_{t+1}\) is a mixture of normals.

  1. Obtain \(Q_{t+1}\) by dividing the joint density for \((Y_{t+1}-Y_t,Z_{t+1})\) conditioned on \((X_t,Q_t)\) by the marginal density for \(Y_{t+1}-Y_t\) conditioned on \((X_t, Q_t)\). Division gives the density for \(Z_{t+1}\) conditioned \((Y_{t+1}-Y_t, X_t,Q_t)\), which in this case is just a vector \(Q_{t+1}\) of conditional probabilities. Thus, we are led to the recursion

(5.24)#\[Q_{t+1} = \left( \frac{1}{Q_t\cdot \text{vec} \left\{ \psi_i(Y_{t+1}, X_t) \right\}}\right) \mathbb{P}' \text{diag}(Q_t) \text{vec} \left\{ \psi_i(Y_{t+1}, X_t) \right\}.\]

Taken together, steps (3) and (4) provide the one-step-transition equation for Markov state \((X_{t+1}, Q_{t+1})\). As indicated in step (3), \(Y_{t+1} \) is a mixture of normally distributed random variables. As argued in step (4) the vector \(Q_{t+1}\) is an exact function of \(Y_{t+1} \), \(Q_t\), and \(X_t\) that is given by the above formula (5.24).

Example 5.2

Let \(X_t = [\log{(B_t/C_t)}, \log{(D_t/C_t)}]\) and \(Y_t=\log C_t\), where \(C_t\), \(B_t\) and \(D_t\) are consumption, business income and dividends, respectively. We estimate (5.23) using Gibbs sampling with 2 regimes.

../_images/log_differences.png

Fig. 5.4 Log business income less consumption \(\log{(B_t/C_t)}\) (purple) and log dividends less consumption \(\log{(D_t/C_t)}\) (green).#

../_images/three_panel_mean.png

Fig. 5.5 The log growth rates of consumption, business income and dividends are shown as solid lines in each panel. The stationary means (\(1.276\%\) and \( 0.667\%\)) are shown as dotted lines in black (high) and red (low). The median smoothed probability of being in the low-mean regime is shown in shaded green. The stationary mean is \(1.2480\%\) (the observed stationary means are \(0.8185\%\) for consumption; \(0.7336\%\) for business income and \(1.0921\%\) for dividends). The half-life for the high mean state is 46 quarters; for the low mean state it is 2 quarters. The correlation matrix for \(\Delta \log C_t\), \(\Delta \log B_t\) and \(\Delta \log D_t\) is#

\[\begin{split} \begin{pmatrix} 1.000 & 0.158 & -0.173 \\[6pt] & 1.000 & 0.013 \\[6pt] & & 1.000 \end{pmatrix} \end{split}\]
../_images/three_panel_vol_ma_2Q.png

Fig. 5.6 The residual squares of log growth rates of consumption, business income and dividends are shown as solid lines in each panel. Five-quarter window moving average is applied. The median smoothed probability of being in the high-volatility regime is shown in shaded green. The half-life for the low volatility state is 26 quarters; for the high volatility state it is 2 quarters.#

../_images/mean_vol_smooth_probs.png

Fig. 5.7 Smoothed probabilities of being in the high-mean and low-volatility regimes.#

../_images/posteriors.png

Fig. 5.8 The posterior distributions for the unconditional means and volatilities for \(\Delta \log C_t\) across the two states. Sampling 20000 draws from the posterior and burning the first 5000 draws.#