GMM Estimation

13. GMM Estimation#

Authors: Lars Peter Hansen (University of Chicago) and Thomas Sargent (NYU)

“How do you eat an elephant? One bite at a time.”

– African proverb

\(\newcommand{\eqdef}{\stackrel{\text{def}}{=}}\)

13.1. Introduction#

Generalized Method of Moments (GMM) estimation studies a family of estimators constructed from partially specified or misspecified models. Since direct application of likelihood methods sometimes can be computationally challenging, GMM methods may provide tractable alternatives. They take a different starting point than a parameterized likelihood function, but they are structured to allow for the simultaneous study of a family of estimators. With this approach, we can make relative accuracy comparisons among members of the entire family and establish an efficiency bound for the class of estimators.

This chapter takes statistical consistency as given. Supporting arguments for this chapter can be obtained with direct extensions of the Law of Large Numbers as described in Chapter 1. Such extensions often entail Laws of Large Numbers applied to so-called random functions (function-valued processes expressed as a parameter vector of interest) instead of a random vector.

Throughout this chapter, we will condition on invariant events even though we will suppress this dependence when we write expectations. Given the partially specified or misspecified nature of the model, much more than a simple parameter vector is included by this conditioning.[1]. The inferences we draw will be conditioned on a parameter vector as is common in so-called classical methods. On the other hand, given the incomplete or partial nature of our starting point, we will not provide a corresponding Bayesian, robust Bayesian, or approximate Bayesian approach to inference.

13.2. Formulation#

We study a family of GMM estimators of an unknown parameter vector \(\beta\) constructed from theoretical restrictions on conditional or unconditional moments of functions \(\phi\) that depend on \(\beta\) and on a random vector \(X_t\) that is observable to an econometrician.

As a starting point, we consider a class of restrictions large enough to include examples of both conditional and unconditional moment restrictions. Members of this class take the form

(13.1)#\[E \left[ {A_t}' \phi(X_t, b) \right] = 0 \textrm{ if and only if } b = \beta\]

for all sequences of selection matrices \(A \in \mathcal A\) where \(A = \{A_t : t \ge 1\} \) and where

the vector of functions \(\phi\) is \(r\) dimensional.
the unknown parameter vector \(\beta\) is \(k\) dimensional and in a parameter space \({\mathcal P}\).
\(A_t\) denotes a time \(t\) selection matrix for a subset of the valid moment restrictions that is used to construct a particular statistical estimator \(b\) of \(\beta\).
\({\mathcal A}\) is a collection of sequences of (possibly random) selection matrices that characterize valid moment restrictions.
the mathematical expectation is taken with respect to a statistical model that generates the \(\{X_t : t \ge 1 \}\) process.

A sample counterpart of the population moment conditions (13.1) is

(13.2)#\[\frac 1 N \sum_{t=1}^N {A_t}' \phi(X_t, b_N) = 0 .\]

Applying a Law of Large Numbers to (13.2) motivates a generalized method of moments estimator \(b_N\) of the \(k \times 1\) vector \(\beta\).

Different sequences of selection matrices \(\{A_t : t \ge 1 \}\) and \(\{\widetilde A_t: t \ge 1\}\) generally give rise to different properties for the estimator \(b_N\). An exception is when

\[{\widetilde A}_t = A_t{\mathbb L} \]

for some \(k \times k\) nonsingular matrix \({\mathbb L}\). In this latter case, the same moment conditions are used for estimation and hence will give rise to the same GMM estimator \(b_N\).

We study limiting properties of estimator \(b_N\) conditioned on a statistical model. In many settings, the parameter vector \(\beta\) only incompletely characterizes the statistical model. In such settings, we are led in effect to implement a version of what is known as semi-parametric estimation: while \(\beta\) is the finite-dimensional parameter vector that we want to estimate, we acknowledge that, in addition to \(\beta\), a potentially infinite-dimensional nuisance parameter vector pins down the complete statistical model on which we condition when we apply the law of large numbers and other limit theorems.

Example 13.1

Unconditional moment restrictions

Suppose that

\[E \left[ \phi(X_t, \beta) \right] = 0\]

where \(r \ge k\). Let \(\mathcal{A}_t\) be the set of all constant \(r \times k\) matrices \(\mathbb{A}\) of constants. Rewrite the restrictions as:

\[{\mathbb{A}}' E \left[ \phi(X_t, \beta) \right] = 0\]

for all \(r \times k\) matrices \(\mathbb{A}\). [Sargan, 1958] and [Hansen, 1982] assumed moment restrictions like these. The following give two rather different applications.

Nonlinear instrumental variables

Let

\[\phi(X_t,\beta) = Z_t \eta(Y_t, \beta )\]

where \(Z_t\) is an \(r\) dimensional vector of instrumental variables and \(\eta(Y_t, \beta)\) is a scalar disturbance term in an equation of interest. The vector of instrumental variables are presumed to be orthogonal to \(\eta(Y_t, \beta )\), which gives rise to vector of moment conditions. When there are more instrumental variables than parameters, we are led to study a family of estimators rather than a single one. This has a direct extension to the case where there are multiple equations and \(\eta(Y_t, \beta)\) is a random vector.

Moment matching

Moment matching is an approach that has or at least should have close ties to calibration as is done often in economic dynamics. See [Hansen and Heckman, 1996] for a discussion of the merits of this link.

Suppose that

\[\phi(X_t, b) = \psi(X_t) - \overline{\psi}(b)\]

where

\[E\left[\psi(X_t) \right] = \overline{\psi}(\beta).\]

The random vector \(E \left[\psi(X_t) \right]\) defines moments to be matched and \(\overline{\psi}(\beta)\) are population values of those moments under a statistical model with parameter vector \(\beta\). Often that statistical model is a “structural” economic model with nonlinearities and other complications that, for a given, \(\beta\) make it challenging to compute the moments \(\overline{\psi}\) analytically. To proceed, the proposal is to approximate those moments for a given \(b\) by computing a sample mean from a long simulation of the statistical model at parameter vector \(b\). By running simulations and computing associated sample means for many alternative \(b\) vectors, we can assemble an approximation to the function \(\overline{\psi}(b)\). [Lee and Ingram, 1991] and [Duffie and Singleton, 1993] used versions of this approach. Notice that in contrast to some other applications of GMM estimation that allow the appearance of unknown nuisance parameters in the statistical model assumed to generate the data, this approach assumes that, given \(b\), the model completely determines a sample path that we can at least simulate. This method is used in macroeconomics, corporate finance, and asset pricing, sometimes formally and sometimes informally. While the model may be misspecified along some dimensions, with this method the target moments are presumed to be correctly specified. Later in this chapter, we will describe a way to assess potential model misspecification in this setup.

Example 13.2

Conditional moment restrictions

Assume the conditional moment restrictions

\[E\left[\phi(X_{t}, \beta) \mid {\mathfrak A}_{t-\ell} \right] = 0\]

for a particular \(\ell \ge 1\) and \(Y_t = X_t\). Let \(\mathcal A_{t}\) be the set of all \(r \times k\) matrices, \(A_t\), of bounded random variables that are \({\mathfrak A}_{t-\ell}\) measurable. Then the preceding conditional moment restrictions are mathematically equivalent to the unconditional moment restrictions

\[E\left[{A_t}'\phi(Y_{t}, \beta) \right] = 0\]

for all random matrix processes \(\{A_t : t \ge 0\} \in {\mathcal A}\). This formulation is due to [Hansen, 1985]. Also see a closely related analysis of [Chamberlain, 1987] with a formal link to semiparametric efficiency bounds.

The following example gives a substantive illustration

Asset pricing

A common way to construct conditional moment conditions in macro-asset pricing is to construct an \(\ell\) period scalar “stochastic discount factor” as a function of data and an unknown parameter vector. For instance, see [Hansen and Singleton, 1982] and [Hansen and Richard, 1987] for initial applications of this formulation, and more extensive discussions, see the books: [Cochrane, 2001] and [Singleton, 2006]. A stochastic discount factor discounts the future in a state-dependent way to capture compensations for exposures to uncertainty. Denote the \(\ell\)-period stochastic discount factor by \(\psi(X_t,\beta).\) This discount factor may be used to represent asset prices:

\[E\left[ R_t \psi(X_t, \beta) \mid {\mathfrak A}_{t-\ell} \right] = {\bf 1}_r\]

where \(R_t\) is an \(r\)-dimensional vector of \(\ell\)-period gross returns and \({\bf 1}_r\) is an \(r\) dimensional vector of ones. Let

\[\phi(X_t, b) = R_t \psi(X_t, b) - {\bf 1}_r.\]

Collections \(\mathcal A\) of selection processes for both of these examples satisfy the following “linearity” restriction.

Restriction 13.1. If \(A^1\) and \(A^2\) are both in \(\mathcal A\) and \(\mathbb{L}_1\) and \(\mathbb{L}_2\) are \(k \times k\) matrices of real numbers, then \(A^1 \mathbb{L}_1 + A^2\mathbb{L}_2\) is in \(\mathcal A\).

A common practice is to use the approach provided in Example 13.2 while substantially restricting the set of moment conditions used for parameter estimation. One possibility is to create unconditional moment restrictions like those in Example 13.1 from a collection of conditional moment restrictions, and thereby reduce the class of GMM estimators under consideration. For instance, let \(A_t^1\) and \(A_t^2\) be two ad hoc choices of selection matrices where no linear combination of columns of \(A_t^2\) duplicate those of \(A_t^1.\)

\[\begin{split}\phi^+(X_t, b) = \begin{bmatrix} {A_t^1}' \\ {A_t^2}' \end{bmatrix} \phi(X_t, b)\end{split}\]

where \(X_t\) now includes variables used to construct \(A_t^1\) and \(A_t^2\). We use \(2 k \times k\) selection matrices \(\mathbb A\) to form moment conditions

\[{\mathbb A}'E\left[ \phi^+(X_t, b)\right] = 0,\]

and study an associated family of GMM estimators. This strategy reduces an infinite number of moment conditions to a finite number. There are extensions of this approach. For instance, we could use more than two \(A_t^j\)’s to construct \(\phi^+\), or we could just augment \(A_t^1\) with a subset of columns of \(A_t^2\) when forming \(\phi^+\).

13.3. Central limit approximation#

The process

\[\left\{ \sum_{t=1}^N{A_t}' \phi(X_t, \beta) : N \ge 1\right\} .\]

can be verified to have stationary and ergodic increments conditioned on the statistical model. So there exists a Proposition 3.1 decomposition of the process. Provided that

\[E\left[ {A_t}' \phi(X_t, \beta)\right] =0 \]

under the statistical model that generates the data, the trend term in the decomposition of Proposition 3.1 is zero, implying that the martingale dominates the behavior of sample averages for large \(N\). In particular, Proposition 3.2 gives a central limit approximation for

\[\frac 1 {\sqrt{N}} \sum_{t=1}^N{A_t}' \phi(X_t, \beta) \]

Let \(A = \{ A_t : t \ge 0 \}\) and suppose that

\[\sum_{j=0}^\infty E\left[ {A_{t+j}}' \phi(X_{t+j}, \beta) \mid {\mathfrak A}_t \right] \]

converges in mean square. Define the one-step-ahead forecast error:

\[G_t(A) = \sum_{j=0}^\infty E \left[ {A_{t+j}}' \phi(X_{t+j}, \beta) \mid {\mathfrak A}_t \right] - \sum_{j=0}^\infty E \left[ {A_{t+j}}' \phi(X_{t+j}, \beta) \mid {\mathfrak A}_{t-1} \right] \]

Paralleling the construction of the martingale increment in Proposition 3.2,

\[\frac 1 {\sqrt{N}} \sum_{t=1}^N{A_t}' \phi(X_t, \beta) \approx {\frac 1 {\sqrt{N} }} \sum_{t=1}^N G_t(A) \]

where by the approximation sign \(\approx\) we intend to assert that the difference between the right side and left side converges in mean square to zero as \(N \rightarrow \infty\). Consequently, the covariance matrix in the central limit approximation is

\[E \left[ G_t(A) G_t(A)' \right].\]

Recall Restriction 13.1. For the preceding construction of the martingale increment, it is straightforward to verify that

\[G_t( A^1 {\mathbb L}_1 + A^2{\mathbb L}_2) = ({\mathbb L}_1)' G_t(A^1) + ({\mathbb L}_2)'G_t(A^2)\]

follows from the linearity of conditional expectations.

Example 13.1 (cont.)

Consider again Example 13.1 in which \(A_t = \mathbb{A}\) for all \(t\ge 0\) and

\[G_t(A) = \mathbb{A}' F_t\]

where

\[F_t = \sum_{j=0}^\infty E \left[\phi(X_{t+j}, \beta) \mid \mathfrak{A}_t \right] -\sum_{j=0}^\infty E \left[ \phi(X_{t+j}, \beta) \mid \mathfrak{A}_{t-1} \right]. \]

Define the covariance matrix

\[\mathbb{V} = E \left( F_t F_t ' \right)\]

and note that

\[E\left[ G_t(A)G_t(A)' \right] = \mathbb{A}' \mathbb{V} \mathbb{A} .\]

Example 13.2 (cont.)

In Example 13.2

\[E\left[\phi(X_{t}, \beta) \mid {\mathfrak A}_{t-\ell} \right] = 0\]

and hence

\[E\left[{A_t}' \phi(X_{t}, \beta) \mid {\mathfrak A}_{t-\ell} \right] = 0\]

whenever entries of \(A_t\) are restricted to be \({\mathfrak A}_{t-\ell}\) measurable.
Consequently

\[E\left[{A_{t+j}}' \phi(X_{t+j}, \beta) \mid {\mathfrak A}_{t} \right] = 0\]

for \(j \ge \ell\) so that the infinite sums used to construct \(G_t(A)\) simplify to finite sums.

While martingale approximation provides a way to establish a central limit approximation, especially for partially specified models, a formal construction of \(G_t(A)\) is challenging in practice. There is another limiting representation commonly appealed to for implementation:

\[E \left[G_t(A)G_t(A)'\right] = \lim_{N \rightarrow \infty} \left( \frac 1 N \right) E\left( \left[ \sum_{t=1}^N {A_{t}}' \phi(X_{t}, \beta)\right]\left[ \sum_{t=1}^N {A_{t}}' \phi(X_{t}, \beta)\right]' \right) \]

Importantly, this computation is the covariance matrix of a scaled (by \(1/\sqrt{N}\)) partial sum and not the covariance matrix of the separate contributions to the sum. Example 13.2 provides further simplification as many of the expected cross terms are zero.

Remark 13.1

When applications of GMM methods call for general (but weak) forms of temporal dependence, the reliable estimation of the covariance matrix \({\mathbb V}\) needed for central limit approximation can be very challenging. Various researchers have proposed methods that come from what is called “spectral density” estimation. Within this latter literature, the matrix \({\mathbb V}\) is the spectral density matrix for the process \(\{ {A_t}'\phi(X_t, \beta) : t \ge 0\}.\) One common approach proposed by [Bartlett, 1950] and popularized in economics by [Newey and West, 1987], starts from \(L\) sample autocovariances and then downweights them[2]. The autocovariance estimator of order \(\ell\) is

\[C_N(j) \eqdef \frac 1 N \sum_{t=j+1}^N {A_{t}}' \phi(X_{t}, \beta) \phi(X_{t-j}, \beta)' {A_{t-j}}\]

For \({\underline N} << N,\) form

\[C_N(0) + \frac {{\underline N}-1}{{\underline N}} \left[ C_N(1) + C_N(1)' \right] + \cdots + \frac 1 {{\underline N}} \left[ C_{N}({\underline N} -1) + C_N({\underline N} -1)' \right]. \]

Large sample justifications entail arguments in which \({\underline N}\) and \(N\) both go to infinity, but \({\underline N}\) at a much slower rate. Such arguments are of limited value for how to pick \({\underline N}\) in practice. Exploring sensitivity in the choice of \({\underline N}\) is prudent practice. Spectral methods can be notoriously unreliable for high dimensions (large values of \(r\)).[3]

Consider next the case in which restriction Example 13.2 is satisfied for some \(\ell\). Then one can justify using

\[C_N(0) + \left[ C_N(1) + C_N(1)' \right] + \cdots + \left[ C_{N}(\ell-1) + C_N(\ell -1)' \right].\]

This estimator of the limiting covariance matrix may turn out not to be positive definite. As an alternative, [Eichenbaum et al., 1988] build on a proposal due to [Durbin, 1959] and propose an improved estimator that is guaranteed to be positive definite.

In the case of moment matching, assuming a correct specification, one could use very large sample simulations conditioned on each of the hypothetical parameters to approximate the construction of \({\mathbb V}(b)\). Remarkably, this is not often done in economics, even though it could improve the quality of the inferences.

13.4. Mean value approximation#

Write

\[\begin{split}\frac{1}{\sqrt{N}} \sum_{t=1}^NA_t' \phi(X_t, b_N) & \approx \frac{1}{\sqrt{N}} \sum_{t=1}^NA_t' \phi(X_t, \beta) + \frac{1}{N} \sum_{t=1}^NA_t' \left[\frac{\partial \phi}{\partial b'} (X_t, \beta)\right] \sqrt{N} (b_N - \beta) \\ & \approx \frac{1}{\sqrt{N}} \sum_{t=1}^NA_t' \phi(X_t, \beta) + \nabla(A)'\sqrt{N} (b_N - \beta) \end{split}\]

where

\[\nabla(A) \overset{\text{def}}{=} E\left(\left[\frac {\partial \phi}{\partial b'} (X_t, \beta)\right]' A_t \right).\]

Since

\[\frac{1}{\sqrt{N}} \sum_{t=1}^NA_t' \phi(X_t, b_N) \approx 0, \]

\[\nabla(A)' \sqrt{N} (b_N - \beta) \approx - \frac{1}{\sqrt{N}} \sum_{t=1}^NA_t' \phi(X_t, \beta) .\]

So long as \(\nabla(A)\) is nonsingular,

\[\sqrt{N} (b_N - \beta) \approx - \left[\nabla(A)'\right]^{-1} \frac{1}{\sqrt{N}} \sum_{t=1}^NA_t' \phi(X_t, \beta) .\]

This approximation provides an essential input into finding an “efficiency bound” for GMM estimation. Notice that the covariance matrix for such an approximation is:

\[\textbf{cov}(A) = \left[\nabla(A)'\right]^{-1} E\left[ G_t(A) G_t(A)' \right] \left[\nabla(A) \right]^{-1}\]

We want to know how small we can make this matrix by choosing a selection process. The answer to this question gives what we call a GMM efficiency bound. We show how to characterize a greatest lower bound in later section.

Example 13.1 (cont.)

Consider again Example 13.1. In this case \(A_t = {\mathbb A}\) for all \(t \ge 0\) and

\[\nabla(A) = {\mathbb D}' {\mathbb A} \]

where

\[{\mathbb D} \overset{\text{def}}{=} E\left[\frac {\partial \phi}{\partial b'} (X_t, \beta)\right] \]

and

\[\textrm{ cov}({\mathbb A} ) = \left({\mathbb A}' {\mathbb D}\right)^{-1} {\mathbb A}' {\mathbb V} {\mathbb A} \left({\mathbb D}' {\mathbb A}\right)^{-1}\]

13.5. GMM Efficiency Bound#

Recall

\[\textrm{cov}(A) = \left[\nabla(A)'\right]^{-1} E\left[ G_t(A) G_t(A)' \right] \left[\nabla (A) \right]^{-1}\]

We seek a greatest lower bound on the covariance matrix on the right.

Suppose that \(\left[\nabla(A)'\right]^{-1}\) is nonsingular and impose that

\[\left[\nabla(A)\right] = {\mathbb I}\]

If not, post multiply \(A\) by a nonsingular matrix \({\mathbb J}\). That leaves the GMM estimator unaltered. Thus, we have

\[\textrm{ cov}(A) = E\left[ G_t(A) G_t(A)' \right] \]

subject to \(\left[\nabla(A)\right] = {\mathbb I}\)
Find an \(A^d\) such that for all \(A \in {\mathcal A}\)

(13.3)#\[\nabla(A) = E\left[ G_t(A^d) G_t(A)' \right] .\]
Form

\[A_t^* \eqdef A^d_t \left( E\left[ G_t(A^d) G_t(A^d)' \right]\right)^{-1} \]

for all \(A \in \mathcal A\), where we think of (13.3) as a set of first-order (necessary and sufficient) conditions for our constrained optimization. Rather than derive them as such, we essentially use “guess and verify” in what follows.

\[G_t(A^*) = \left( E\left[ G_t(A^d) G_t(A^d)' \right]\right)^{-1} G_t(A^d)\]

and

\[E \left[ G_t(A^*) G_t(A)'\right] = \left( E\left[ G_t(A^d) G_t(A^d)' \right]\right)^{-1}\]

provided that \(\left[\nabla(A)\right] = {\mathbb I}.\)

Therefore,

\[0 \leq E \left( \left[ G_t(A) - G_t(A^*) \right] \left[ G_t(A) - G_t(A^*) \right]' \right) = \textrm{cov}(A) - \left( E\left[ G_t(A^d) G_t(A^d)' \right]\right)^{-1} .\]

Result 13.1. Given a solution to equation (13.3)

(13.4)#\[\inf_{A \in \mathcal A} \textrm{cov}(A) = \left( E\left[ G_t(A^d) G_t(A^d)' \right]\right)^{-1}\]

Remark 13.2

In Result 13.1, we might be tempted to think that \(G_t(A^d)\) plays the same role that the “score vector” increment does in maximum likelihood estimation. But because there is a possibly infinite dimensional vector of nuisance parameters here, a better analogy is that \(G_t(A^d)\) acts much like the residual vector in a regression of parameters of interest score increments on nuisance parameter score increments along the lines suggested in
Section 6.6: Score processes . By undertaking to infer the parameter vector \(\beta\) from conditional or unconditional moment restrictions, we have purposefully pushed all nuisance parameters into the background.

Remark 13.3

The representation

\[E\left(\left[\frac {\partial \phi}{\partial b'} (X_t, \beta)\right]' A_t \right) = \nabla(A) = E\left[ G_t(A^d) G_t(A)' \right] \]

used to compute the efficiency bound is an application of the Riesz Representation Theorem. To understand this, introduce the \(k\)-dimensional coordinate vectors \({\sf u}_i\) for \(i=1,2,...,k\) and consider:

(13.5)#\[({\sf u}_i)'\nabla(A) {\sf u}_j = E\left(\left[\frac {\partial \phi}{\partial b_i} (X_t, \beta)\right] \cdot A_t {\sf u}_j\right) .\]

Note that

The integer \(i\) selects the coordinate of \(b\) with respect to which we are differentiating.
If \(A^1\) and \(A^2\) are both in \({\mathcal A}\), then so are linear combinations. Therefore \(({\sf u}_i)'\nabla(A) {\sf u}_j\) is a linear functional defined on a linear space of random variables of the form \(\phi(X_t, \beta)' A_t {\sf u}_j\) for a given \(i\).
The martingale approximations for the scalar process with stationary increments

\[\left\{ \sum_{t=1}^N \phi(X_t, \beta)' A_t {\sf u}_j : N \ge 1 \right\}\]

has martingale increment \(G_t(A) {\sf u}_j\).
The Riesz Representation Theorem asserts that the linear functional \(({\sf u}_i)'\nabla(A) {\sf u}_j\) can be represented as an inner product

\[({\sf u}_i)'\nabla(A) {\sf u}_j = E\left[ \rho_t G_t(A) {\sf u}_j \right]\]

where the scalar random variable \(\rho_t\) is in the mean square closure of

\[\left\{ G_t(A) {\sf u}_j : A \in {\mathcal A} \right\}.\]
We can represent \(\rho_t\) as

\[\rho_t = G_t(A^d){\sf u}_j\]

for some selection process \(A_d \in {\mathcal A}\) or more generally as a limit point of a sequence of such selection processes.

The preceding construction pins down row \(j\) of \(A^d\). Repeating an analogous construction for each \(j = 1,2,...,k\) gives the selection matrix \(A^d\).

The GMM efficiency bound presumed that we could solve equation (13.5). The Riesz Representation Theorem requires that \(\rho_t\) be in a mean square closure of a linear space. Provided that the linear functionals \(({\sf u}_i)'\nabla(A) {\sf u}_j\) are mean square continuous, the efficiency bound can be represented in terms of the limit point of a sequence of GMM estimators associated with alternative selection processes even when the limit points are not attained.

Example 13.1 (cont.)

Consider Example 13.1 in which we assumed that \(A_t = \mathbb{A}\). Then

\[{\mathbb A}' E \left(F_t F_t'\right){\mathbb A}^d = \mathbb{A}' \mathbb{V} \mathbb{A}^d,\]

and the equation of interest is

\[\mathbb{A}'{\mathbb V} \mathbb{A}^d = \mathbb{A}'\mathbb{D} .\]

Since the selection matrix \(\mathbb{A}\) can be chosen freely,

\[\mathbb{A}^d = \mathbb{V}^{-1} \mathbb{D} \]

and the GMM efficiency bound is

\[\left(\mathbb{D}' \mathbb{V}^{-1} \mathbb{D}\right)^{-1} . \]

Example 13.2 (cont.)

Consider again Example 13.2 in the special case in which \(\ell = 1\) so that the conditional moment condition of interest is:

\[E \left[ \phi(X_t, \beta) \mid {\mathfrak A}_{t-1} \right] = 0.\]

Let

\[E \left[ \phi(X_t, \beta) \phi(X_t, \beta)' \mid {\mathfrak A}_{t-1} \right] = V_{t-1} ,\]

which we assume to be nonsingular. To compute the efficiency bound, we wish to solve the following equation for \(A_t^d\)

(13.6)#\[E\left( {A_t^d}' V_{t-1} A_t \right) = \nabla(A) = E\left(\left[\frac {\partial \phi}{\partial b'} (X_t, \beta)\right]' A_t \right) .\]

Given the flexibility in the choice of the random \(A_t\) with entries that are \({\mathfrak A}_{t-1}\) measurable, this equation is equivalent to

\[V_{t-1} A_t^d = E\left( \left[\frac {\partial \phi}{\partial b'} (X_t, \beta)\right] \mid {\mathfrak A}_{t-1} \right)\]

where we have taken transposes of the expressions in (13.6). Thus

\[A_t^d = \left(V_{t-1}\right)^{-1} E\left( \left[\frac {\partial \phi}{\partial b'} (X_t, \beta)\right] \mid {\mathfrak A}_{t-1} \right) \]

and the efficiency bound is:

\[\left[ E\left( \left[\frac {\partial \phi}{\partial b'} (X_t, \beta)\right]' \mid {\mathfrak A}_{t-1} \right) \left(V_{t-1}\right)^{-1}E\left( \left[\frac {\partial \phi}{\partial b'} (X_t, \beta)\right] \mid {\mathfrak A}_{t-1} \right) \right]^{-1}.\]

See [Robinson, 1987], [Newey, 1990], and [Ai and Chen, 2003] for prominent contributions to a literature that proposed and justified econometric methods for attaining this bound.

Next consider a conditional moment justification for two-stage least squares. Add the following special restrictions. Suppose that \(r=1\) and that \(V_{t-1} = \mathsf{v} > 0\) where \(\mathsf{v}\) is constant. Further suppose that

\[\phi(X_t, b) = Y_t^1 - Y_t^2 \cdot b\]

Finally, suppose that

\[E\left( Y_t^2 \mid \mathfrak{A}_{t-1} \right) = \Pi Z_{t-1}\]

where \(Z_{t-1}\) has more entries than \(Y_t^2\). Notice that \(\Pi\) can be computed as a least squares regression. Then

\[A_t^d = \left(\frac{1}{\mathsf{v}}\right) {Z_{t-1}}' \Pi'\]

The scaling by \(\frac{1}{\mathsf{v}}\) is inconsequential to the construction of a selection process. The matrix of regression coefficients can be replaced by the finite sample least squares regression coefficients without altering the statistical efficiency.

To obtain this rationale for two-stage least squares, we had to impose a special structure, one that does not prevail in many important applications. For instance, suppose that \(V_{t-1}\) depends on conditioning information so that a form of conditional heteroskedasticity is present. That dependence shows up in essential ways in how \(A^d_t\) should be constructed. Further, suppose that the expectation \(E\left[ Y_t^2 \mid {\mathfrak A}_{t-1} \right]\) depends nonlinearly on \(Z_{t-1}\). In that case, to attain or to approximate the efficiency bound, a least squares regression should account for potential nonlinearity.

Remark 13.4

To implement the conditional moment version of the GMM efficiency bound requires the estimation of conditional moments for which the model may not provide functional forms. Reliable nonparametric estimation becomes particularly challenging in high dimensions. Suppose instead that a researcher adopts a convenient parametric approximation to these conditional moments. While this will induce a form of misspecification, it will not necessarily undermine either the consistency of the resulting GMM estimator or its asymptotic distribution. The misspecification may only cause a reduction in the statistical efficiency of the estimator as measured by the asymptotic covariance matrix of the resulting GMM estimator.

Example 13.1 (cont.)

Suppose now that \(\ell = 2\) and consider an unconditional moment formulation. Then even if the covariance structure is homoskedastic and conditional expectations are linear, the two-stage least squares approach will no longer be statistically efficient. We illustrate why by mapping into the framework of Example 13.1.

Use as our \(\phi\) function

\[\phi(X_t, b) = Z_{t-2}\left[Y_t^1 - \left(Y_t^2 \right)' b \right]\]

in forming an unconditional moment restriction and constructing the efficient GMM estimator where the entries of \(Z_{t-2}\) are in the date \(t-2\) conditioning information set (\({\mathfrak A}_{t-2}\) measurable). Now form:

\[{\mathbb V} = E \left[ \phi(X_t, \beta) \phi(X_{t}, \beta)'\right] + E \left[ \phi(X_t, \beta) \phi(X_{t+1}, \beta)'\right] + E \left[ \phi(X_t, \beta) \phi(X_{t-1}, \beta)'\right] \]

pertinent to the central limit approximation. Then an efficient selection matrix \({\mathbb A}^d\) is given by:

\[{\mathbb A}^d = {\mathbb V}^{-1} E\left[Z_{t-2} (Y_t^2)'\right] \]

The matrix \({\mathbb A}^d\) is typically not proportional to a vector of regression coefficients of \(Y_t^2\) on \(Z_{t-2}\), as presumed in the justification for two-stage least squares. The temporal dependence removes this connection. Hence, a commonly-used measure of “instrument relevance” obtained by regressing the endogenous \(Y_t^2\) onto \(Z_{t-2}\) is no longer a valid diagnostic .

A special case of this analysis is when \(Y_t^2\) is in the date \(t-2\) conditioning information set. One estimator could be constructed by setting \(Z_{t-2} = Y_t^2\). The standard two-stage least squares estimator now is simply the ordinary least squares estimator. Next expand \(Z_{t-2}\) to be:

\[Z_{t-2} = \begin{bmatrix} Y_t^2 \cr Y_{t-1}^2 \end{bmatrix}.\]

The \({\mathbb A}^d\) selection matrix will typically include \(Y_{t-1}^2\) in the estimation in spite of the fact that this variable is not needed in an initial least squares regression of \(Y_t^2\) onto \(Z_{t-2}\).

Since there is great flexibility in the construction of \(Z_{t-\ell}\), there is further scope for efficiency gains attained by using the conditional moment formulation of the family of GMM estimators. [Hansen and Singleton, 1996] and [West, 2001] proposed specific appraoches for constructing and approximating the efficiency bound in Example 13.2 for a linear (in the variables) in explict time series settings

13.6. Approximate inference for testing#

Using an entirely analogous approach, derive limit approximations geared to testing any “over-identifying restrictions.” Let \(B = \{ B_t : t \ge 0\}\) be an \(r \times {\tilde k}\) selection process constructed to test the following vector of \(\tilde{k}\) means.

\[E \left[{B_t}' \phi(X_t, \beta) \right] = 0.\]

Restriction 13.2. For any \({\tilde k} \times k\) matrix of real numbers \({\mathbb K}\), \( B {\mathbb K} \in {\mathcal A}\).

Thus, we can build selection processes for testing equations from the columns of the process \(B\).

Suppose that

\[\sum_{j=0}^\infty E\left[ {B_{t+j}}' \phi(X_{t+j}, \beta) \mid {\mathfrak A}_t \right] \]

converges in mean square so that we can apply a central limit approximation.
Construct

\[\widetilde {\nabla}(B) \overset{\text{def}}{=} E\left( \left[\frac {\partial \phi}{\partial b'} (X_t, \beta)\right]' B_t \right) .\]

Since Restriction 13.2 is satisfied, notice that

\[\widetilde {\nabla}(B) {\mathbb K} = \nabla (B {\mathbb K} )\]

for all \({\tilde k} \times k\) matrices \({\mathbb K}\) of real numbers.

By imitating an earlier argument

\[\begin{align*} \frac 1 {\sqrt{N}} \sum_{t=1}^N {B_t}' \phi(X_t, b_N) \approx & \frac 1 {\sqrt N} \sum_{t=1}^N {B_t}' \phi(X_t, \beta) + \widetilde {\nabla}(B)' \sqrt{N} (b_N - \beta) \cr \approx & \frac 1 {\sqrt N} \sum_{t=1}^N {B_t}' \phi(X_t, \beta)\cr & - \widetilde {\nabla}(B)' \left[\nabla(A)'\right]^{-1} \frac 1 {\sqrt N} \sum_{t=1}^N {A_t}' \phi(X_t, \beta) \cr \approx & \frac 1 {\sqrt N} \sum_{t=1}^N \left[{B_t}' - \widetilde {\nabla}(B) '\left[\nabla(A)'\right]^{-1} {A_t}'\right] \phi(X_t, \beta) \end{align*} \]

This formula includes an explicit adjustment for estimation of \(\beta\). Notice that if \(A_t = B_t\), then the right side is zero and the limiting distribution is degenerate. This approximation is used to construct tests that account for having used GMM to estimate a parameter vector \(\beta\).

Example 13.1 (cont.)

Consider again unconditional moment restrictions specified in Example 13.1. Let the selection process for testing be constant over time so that \(B_t = {\mathbb B}\). Then

\[\frac 1 {\sqrt{N}} \sum_{t=1}^N {B_t}' \phi(X_t, b_N) \approx \frac 1 {\sqrt N} \sum_{t=1}^N \left[{\mathbb B}' - {\mathbb B}' {\mathbb D} \left({\mathbb A}' {\mathbb D} \right)^{-1} {\mathbb A}'\right]\phi(X_t, \beta) .\]

Remark 13.5

Moment-matching is also accompanied by some form of testing. This practice often occurs in the study of dynamic economic models, but it is also prevalent in other disciplines as the following quote illustrates:

Some hydrologists have suggested a two-step calibration scheme in which the available dependent data set is divided into two parts. In the first step, the independent parameters of the model are adjusted to reproduce the first part of the data. Then in the second step the model is run and the results are compared with the second part of the data. In this scheme, the first step is labeled “calibration,” and the second step is labeled “verification”. … The use of the term verification in this context is highly misleading … .[Oreskes et al., 1994]

[Oreskes et al., 1994] prefer the term “confirmation” for this second step. While not dismissing the second step, the authors are quick to remind readers of its limitations. The formulas just presented for testing support this step.

13.7. Statistical tests based on efficient GMM estimators#

First suppose that we have statistically efficient selection process. Thus the selection matrix is some nonsingular matrix transformation of \(A^d\) where \(A^d\) satisfies the “first-order conditions” (13.3). Recall the approximation

\[\frac 1 {\sqrt{N}} \sum_{t=1}^N {B_t}' \phi(X_t, b_N) \approx \frac 1 {\sqrt N} \sum_{t=1}^N \left[{B_t}' - \widetilde {\nabla}(B) '\left[\nabla(A^d)'\right]^{-1} {A_t^d}'\right] \phi(X_t, \beta) , \]

which includes an adjustment for estimation. Let \({\widetilde G}_t(B)\) denote the increment in the martingale approximation for

\[\sum_{t=1}^N {B_t}' \phi(X_t, \beta) .\]

We use the representation implied by (13.3) to write:

\[\widetilde{\nabla}(B) = E\left[ G_t(A^d) G_t(B)' \right] .\]

These constructions allow us to write:

(13.7)#\[\frac 1 {\sqrt{N}} \sum_{t=1}^N {B_t}' \phi(X_t, b_N) \approx \frac 1 {\sqrt N} \sum_{t=1}^N {\widehat G}_t(B)\]

where

\[{\widehat G}_t(B) \overset{\text{def}}{=} {\widetilde G}_t(B) - E\left[{\widetilde G}_t(B) G_t(A^d)'\right] \left( E\left[{ G}_t(A^d) G_t(A^d)'\right] \right)^{-1} G_t(A^d) \]

The term, \({\widehat G}_t(B)\), that appears inside the sum on the right side of (13.7) is the population least squares residual from regressing \({\widetilde G}_t(B)\) onto \(G_t(A^d)\). This regression residual can also be interpreted as a martingale increment for a stationary increments process.

Suppose that \({\widehat G}_t(B)\) has a nonsingular covariance matrix. Consider the quadratic form used for building a test:

\[{\frac 1 N} \left[\sum_{t=1}^N \phi(X_t, b_N)' B_t \right] \left(E \left[{\widehat G}_t(B){\widehat G}_t(B)' \right]\right)^{-1} \left[\sum_{t=1}^N {B_t}' \phi(X_t, b_N)\right] \Rightarrow \chi^2 ({\tilde k} ) .\]

Example 13.1 (cont.)

Consider Example 13.1 again. We have already shown that

\[{\mathbb A}^d = {\mathbb V}^{-1} {\mathbb D} .\]

Suppose we let \(B_t = {\mathbb I}\) as a special case. Then

(13.8)#\[{\frac 1 {\sqrt{N}}} \sum_{t=1}^N \phi(X_t, b_N) \approx \left({\mathbb I} - {\mathbb D}\left[{\mathbb D}'{\mathbb V}^{-1}{\mathbb D}\right]^{-1}{\mathbb D}'{\mathbb V}^{-1} \right) {\frac 1 {\sqrt{N}}} \sum_{t=1}^N \phi(X_t, \beta)\]

Other choices of \({\mathbb B}\) can be deduced through a premultiplication.

Remark 13.6

To continue our study of Example 13.1, form the population problem:

\[\min_b E\left[ \phi(X_t, b) \right]' {\mathbb V}^{-1} E \left[\phi(X_t,b)\right] .\]

This has a minimizer at \(b = \beta\) provided that the unconditional moment conditions are satisfied. If \(b = \beta\) is the only possible parameter vector that satisfies the population moment conditions, then \(b = \beta\) is the unique solution to the population minimization problem stated here. Suppose that we construct an estimator by solving a minimization problem:

(13.9)#\[\min_b {\frac 1 N} \sum_{t=1}^N \phi(X_t,b)' {\mathbb V}^{-1} \sum_{t=1}^N \phi(X_t,b) .\]

First-order necessary conditions are

\[{\frac 1 N} \sum_{t=1}^N \left[\frac {\partial \phi }{\partial b'}(X_t,b_N) \right]' {\mathbb V}^{-1} \sum_{t=1}^N \phi(X_t,b_N) = 0 .\]

Assume that we already know that the solution \(b_N\) of the above first-order conditions provides a consistent estimator of parameter vector \(\beta\). Then we can show that

\[{\frac 1 N} \sum_{t=1}^N \left[\frac {\partial \phi }{\partial b'} (X_t,b_N)\right] \rightarrow {\mathbb D}\]

where convergence is with probability one. Thus, in this case the implied selection matrix

\[{\mathbb A} = {\mathbb V}^{-1} {\mathbb D} \]

provides an estimator that attains the efficiency bound.

Remark 13.7

There is an interesting variation of the approach described in Remark 13.6. For any \(b\), let \(\mathbb{V}(b)\) be the population covariance matrix in the martingale increment used in the Central Limit approximation for the process

\[\frac{1}{\sqrt{N}} \sum_{t=1}^N \phi(X_t, \beta) .\]

Assume that \(\mathbb{V}(b)\) is nonsingular for every \(b\) in a parameter space. Form the population minimization problem:

\[\min_b E\left[ \phi(X_t, b) \right]' \left[\mathbb{V}(b)\right]^{-1} E \left[\phi(X_t,b)\right] .\]

If \(b = \beta\) is the only vector that satisfies the associated population first-order conditions, then \(b = \beta\) is again the unique solution to the above population minimization problem.

Now form sample counterparts of both \(E \left[\phi(X_t,b)\right]\) and \(\mathbb{V}(b)\) as functions of \(b\). Minimizing a sample counterpart of the above population minimization problem gives rise to a “continuously-updated GMM estimator”. See [Hansen et al., 1996]. The parameter vector and an appropriately scaled minimized objective function have the same limiting distributions as those described in Remark 13.6.

Remark 13.8

It can be numerically challenging to find minimizers to the continuously-updated GMM estimator in high dimensions. Similarly, it can be difficult to construct confidence sets based on this objective. [Chernozhukov and Hong, 2003] and [Chen et al., 2018] devise and justify simulation-based methods for inference applicable to the continuously-weighted GMM objective function. They do so by adapting insights from simulation-based approaches for Bayesian inference. The Bayesian-type calculations can be numerically more tractable. [Chernozhukov and Hong, 2003] treat the continuously-updated objective function as a “log-likelihood function”, and then use large sample approximations to justify the Bayesian-like calculations. They justify this approach even though the continuously-updated objective is not formally a log-likelihood function. [Chen et al., 2018] show how to modify this approach to attain additional robustness and reliability. An attractive feature of these methods is that they exploit the shape of the continuously-updated GMM objective function, but in so doing they impose a “prior” distribution to help guide this exploration.

13.8. Quantifying model misspecification and its ramifications.#

In our development of statistical tests of model misspecification, we deduced limiting distributions of test statistics when the model was correctly specified. We now conduct our investigation allowing for the model to be misspecified and deducing some of its consequences. In contrast to much of the econometrics literature, we presume that the misspecification is global (not local) in nature. We choose the global perspective because it better supports interpretations of the potential misspecifications. Specifically, we modify the target of estimation to include an estimate of the underlying probability distribution. At the same time as we estimate the parameter vector, we estimate the distribution of the observable data that supports this estimation. We relax the assumption that the moment conditions are satisfied under the data generating process and instead find distributions that are potentially statistically close to the data generating process. In many models of interest, the expectations used for the moment conditions could be the subjective beliefs of the agents “inside the model” with expectations that differ from the actual data generating process.

There is a substantial literature on what is called generalized empirical likelihood that addresses this, with the primary motivation to improve second-order efficiency, and constructs that we do not study here. Since this literature features a finite number of unconditional moment conditions, these second-order efficiency gains are applied to a second-best formulation of statistical efficiency.

Instead of focusing on refined inferences, we follow a suggestion in [Hansen, 2014] by thinking of the beliefs as those of economic agents within the models we build. This builds on an idea suggested by [Brown and Back, 1993] to use the implied probabilities from a GMM estimation to isolate the aspect of the empirical distribution of the data that is potentially most problematic from the perspective of the model that is being estimated.[4] While our suggestions for implementation differ from theirs, there is an overlapping aim. In what follows, we use two special cases of the generalized empirical likelihood methods as diagnostics that are informative about potential misspecification.

13.8.1. Relative entropy divergence#

We temporarily hold fixed a hypothetical parameter value \(b,\) and we allow the model to be misspecified.
Consider the following population:

\[\begin{split} \min_{M \ge 0} & E \left[ M(\log M) \right] \cr & E \left[M \phi(X_t, b)\right] = 0\cr & E \left(M - 1 \right) = 0. \end{split}\]

Here the random variable \(M\) is used as a potential change in the probability measure. The minimization problem restricts \(M\) so that the moment conditions are satisfied under the change in probabilities. The outcome is a minimum divergence measure supported by an \(M^*\) that achieves this minimum divergence. Should the original model be correctly specified, this solution will be \(M^* = 1\); however under misspecification, this will not be the case. Thus the minimizing objective is a measure of model misspecification. With unknown parameters, the bound can be smaller by searching over the parameter space.

To solve the minimization problem (holding fixed \(b\)), we introduce multipliers \(\lambda,\) and \(\zeta\) on the two constraints. The Lagrangian minimization problem separates so that we may solve it without taking expectations. The first-order conditions for this Lagrangian are:

\[\lambda \cdot \phi(X_t, b) + \zeta + \log(M) + 1 = 0.\]

Thus

(13.10)#\[ \log M = - \lambda \cdot \phi(X_t, b) - \zeta -1. \]

Plugging this solution back into the Lagrangian gives

\[\begin{split} E \left( M\log M \right) + E\left[\left(M\lambda \cdot \phi(X_t, b) + \zeta\right) \right] - \zeta & = - E\left( M \right) - \zeta\cr. \end{split}\]

First maximize over \(\zeta\) after substituting for formula (13.10)

(13.11)#\[\max_{{\zeta}} - E\left( \exp\left[- \lambda \cdot \phi(X_t, b) - \zeta -1\right] \right). \]

The first-order conditions are:

\[E\left( \exp\left[- \lambda \cdot \phi(X_t, b) - \zeta -1\right] \right) - 1 = 0.\]

implying that

\[\exp(\zeta + 1) = E\left( \exp\left[- \lambda \cdot \phi(X_t, b) \right] \right)\]

Substituting this back into the objective (13.11) gives the objective to be maximized with respect to \( \lambda\):

\[\max_{\lambda} - \log E\left( \exp\left[ - \lambda \cdot \phi(X_t, b) \right] \right).\]

Thus the counterpart to a misspecified population generalized moment problem solves the min-max problem:

\[ \min_{b \in {\mathcal P}} \max_{\lambda \in {\mathbb R}^r} - \log E\left( \exp\left[ - \lambda \cdot \phi(X_t, b) \right] \right).\]

Under a correct model specification, [Kitamura and Stutzer, 1997] establish that the minimizing solution to a sample counterpart to this population objective has the same first-order properties as an efficient GMM estimator when \(\{ \phi(X_t, \beta) : t \ge 0 \}\) is itself a martingale difference sequence. But our interest in these computations is to provide a measure of misspecification and an assessment of what aspect of the observations are most challenging to the model under investigation.

13.8.2. Quadratic divergence#

Consider a counterpart with a quadratic divergence to the problem that we just analyzed:

\[\begin{split} \min_{M \ge 0} & {\frac 1 2} E\left[ (M - 1)^2 \right] \cr & E \left[M \phi(X_t, b)\right] = 0\cr & E \left(M - 1 \right) = 0. \end{split}\]

Given multipliers, \(\lambda,\) and \(\zeta\) on the constraints, abstracting from the nonnegativity constraint on \(M,\) the first-order conditions for the inner minimization are:

\[\lambda \cdot \phi(X_t, b) + \zeta + M - 1 = 0\]

Thus

(13.12)#\[M(b, \lambda, \zeta) = 1 - \zeta - \lambda \cdot \phi(X_t, b) \]

provided that the right-hand side is nonnegative. Otherwise \(M(b, \lambda, \zeta)=0\). This leads us to write:

\[M(b, \lambda, \zeta) = \left[ 1 - \zeta - \lambda \cdot \phi(X_t, b) \right]_+\]

where \([\cdot]_+\) imposes zero when the argument is negative. We then solve for the multipliers \(\lambda\) and \(\zeta\) by requiring that \(M(b, \lambda, \zeta)\) have mean one and that the moment conditions are satisfied under the implied change in probability measure. We do not have an analytical solution, but solve

\[ \max_{\lambda, \zeta} \frac 1 2 E\left(\left[M(b, \lambda, \zeta) - 1\right]^2\right)\]

using a numerical method. When there are unknown parameters, we include a minimization of \(b \in {\mathcal P}.\)

As an intellectual curiosity, suppose we dispense with restriction that \(M(b, \lambda, \zeta)\) be nonnegative. Restricting it to have mean one implies that:

\[\zeta = 1 - \lambda \cdot E \phi(X_t, b),\]

and thus

\[M(b, \lambda) - 1 = - \lambda \cdot \left[ \phi(X_t, b) - E \phi(X_t, b)\right] .\]

Let

\[\begin{align} {\mathbb W}(b) & \eqdef E\left(\left[ \phi(X_t, b) - E\left[ \phi(X_t, b)\right]\right] \phi(X_t, b)'\right) \cr & = E\left( \left[ \phi(X_t, b) - E\left[ \phi(X_t, b)\right]\right] \left[ \phi(X_t, b) - E\left[ \phi(X_t, b)\right]\right]'\right), \end{align}\]

which is the covariance matrix of the random vector: \(\phi(X_t, b)\). Consistent with our previous analysis, we restrict the moment conditions not to be redundant so that \({\mathbb W}(b)\) is nonsingular. Note that \({\mathbb W}(b)\) is only the pertinent asymptotic covariance matrix for a central limit approximation when \(\{\phi(X_t, b)\}\) is the first-difference of a martingale. Imposing the constraint that under the probability measure induced by \(M(b, \lambda)\), \(\phi(X_t, b),\) has mean zero gives the formula;

\[{\mathbb W}(b) \lambda =E\left[ \phi(X_t, b)\right] .\]

Thus

(13.13)#\[\begin{align} M(b, \lambda) - 1 & = - E\left[ \phi(X_t, b)\right]' {\mathbb W}(b)^{-1} \left[ \phi(X_t, b) - E \phi(X_t, b)\right] \cr {\frac 1 2} E\left[ (M - 1)^2 \right] & = {\frac 1 2} E\left[ \phi(X_t, b)\right]' {\mathbb W}(b)^{-1} E\left[ \phi(X_t, b)\right], \end{align}\]

and the \(b\) of interest is determined by:

\[\min_{b \in {\mathcal P}} {\frac 1 2} E\left[ \phi(X_t, b)\right]' {\mathbb W}(b)^{-1} E\left[ \phi(X_t, b)\right].\]

This is recognizable as the population counterpart to the continuously-updated objective for GMM proposed by [Hansen et al., 1996] for the special case in which \({\mathbb W}(b) = {\mathbb V}(b).\)

Remark 13.9

As one way to confront, approximately, forms of weak dependence, we could apply the same population calculations but replace \(\phi(X_t, b)\) with

\[ \frac 1 {\sqrt{{\underline N}}} \sum_{t=1}^{\underline N} \phi(X_t, b),\]

and repeat the computations with the quadratic and relative entropy divergences. This extends the approach of [Bartlett, 1950] described in this remark to a measure of misspecification. In practice, it presumes that \({\underline N}\) be substantially less than the sample size and that the number of coordinates of \(\phi\) not be too large.

13.8.3. Bounding expectations under model misspecification#

So far, we have used a statistical divergence to identify an adjustment to the data generating process that allows for the moment conditions to be satisfied. Following a formulation of [Chen et al., 2020], we now make two changes. First we represent changes in the underlying probabilities as changes in expectations relative to the actual data generation. Second we replace the minimization with a weaker divergence inequality via an “ambiguity set” of expectations.

Consider the relative entropy divergence. A probability distribution is characterized by the expectations it assigns to a rich class of scalar functions, \({h},\) of the underlying random vector \(X_t\). When this class is sufficiently rich, the expectations determine the probabilities. We form a relative entropy bound by taking the minimum divergence outcome as a starting point and inflating it by some percentage. Call this outcome \(\kappa\). For a hypothetical parameter vector, \(b\), solve

\[\begin{split} \min_{M \ge 0} & E \left[ M h (X_t) \right] \cr &E \left[M \log M\right] \le \kappa \cr & E \left[M \phi(X_t, b)\right] = 0 \cr & E \left(M - 1 \right) = 0. \end{split}\]

This provides a sharp lower bound on \(E[M h(X_t) ]\). To get a sharp upper bound, repeat the same calculation by computing a sharp lower bound for \(- E[M h(X_t)]\) and multiply the outcome by minus one. Observe that inflating \(\kappa\) we no longer identify a unique \(M\) as there is a convex set of \(M\)’s that satisfy the constraints. The outcome of the computations are upper and lower bounds on expectations.

Remark 13.10

The minimization over alternative \(M\)’s for alternative \(h\)’s gives an example of what [Peng, 2004] calls a nonlinear expectation operator (with argument \(h\)). This expectation emerges under a variety of alternative specifications of ambiguity.

We leverage our previous calculations by proceeding differently. First solve the minimum divergence problem and compute the implied expectation for \(E[M h (X_t)] = \underline{\sf{r}}.\) Solve:

\[\begin{split} \min_{b \in {\mathcal P}} \min_{M \ge 0} & E\left[ M(\log M) \right] \cr & E \left[M \phi(X_t, b)\right] = 0\cr & E \left[M h(X_t) \right] - \sf{r} = 0 \cr & E \left(M - 1 \right) = 0. \end{split}\]

for \(\sf{r} > \underline{ \sf{r}}\). The added constraint will induce a larger divergence bound. By increasing \(\sf{r}\) we may attain a divergence of \(\kappa\).

With this formulation, we have an immediate extension of our previous analysis leading to the following problem

\[\min_{b \in {\mathcal P}} \max_{\lambda \in {\mathbb R}^r, \nu \in {\mathbb R} }- \log E\left( \exp\left[ - \lambda \cdot \phi(X_t, b) - {\nu} h (X_t, b) + {\nu} \sf{r} \right] \right)\]

where we now allow the function \(h\) to depend on the parameter vector, \(b\). We use \(\nu\) to denote the multiplier on the constraint:

\[E \left[M h(X_t,b) \right] - \sf{r} = 0.\]

As a special case, we may set \(h\) equal to one of the entries of \(b\) and deduce implied upper and lower bounds on the different parameter coefficients and hence deduce an ambiguity interval for one of the components of the parameter vector.

[Chen et al., 2024] describe inferential methods that support such an analysis expressed in terms of large sample approximations.

Remark 13.11

So far, we have treated this as an unconditional problem including both the moment conditions and the divergence measures. [Chen et al., 2020] show how to extend this analysis to the case with conditional moment restrictions along with an intertemporal measure of statistical divergence.

GMM Estimation

Contents

13. GMM Estimation#

13.1. Introduction#

13.2. Formulation#

13.3. Central limit approximation#

13.4. Mean value approximation#

13.5. GMM Efficiency Bound#

13.6. Approximate inference for testing#

13.7. Statistical tests based on efficient GMM estimators#

13.8. Quantifying model misspecification and its ramifications.#

13.8.1. Relative entropy divergence#

13.8.2. Quadratic divergence#

13.8.3. Bounding expectations under model misspecification#

13.9. Refinements and extensions#

13.9.1. Decomposing the GMM moment conditions#

13.9.2. Indirect inference#

13.9.3. Multiperiod forecasting#

13.9.4. Multiperiod conditional moment restrictions#