13. GMM Estimation#
Authors: Lars Peter Hansen (University of Chicago) and Thomas Sargent (NYU)
“How do you eat an elephant? One bite at a time.”
– African proverb
\(\newcommand{\eqdef}{\stackrel{\text{def}}{=}}\)
13.1. Introduction#
Generalized Method of Moments (GMM) estimation studies a family of estimators constructed from partially specified or misspecified models. Since direct application of likelihood methods sometimes can be computationally challenging, GMM methods may provide tractable alternatives. They take a different starting point than a parameterized likelihood function, but they are structured to allow for the simultaneous study of a family of estimators. With this approach, we can make relative accuracy comparisons among members of the entire family and establish an efficiency bound for the class of estimators.
This chapter takes statistical consistency as given. Supporting arguments for this chapter can be obtained with direct extensions of the Law of Large Numbers as described in Chapter 1. Such extensions often entail Laws of Large Numbers applied to so-called random functions (function-valued processes expressed as a parameter vector of interest) instead of a random vector.
Throughout this chapter, we will condition on invariant events even though we will suppress this dependence when we write expectations. Given the partially specified or misspecified nature of the model, much more than a simple parameter vector is included by this conditioning.[1]. The inferences we draw will be conditioned on a parameter vector as is common in so-called classical methods. On the other hand, given the incomplete or partial nature of our starting point, we will not provide a corresponding Bayesian, robust Bayesian, or approximate Bayesian approach to inference.
13.2. Formulation#
We study a family of GMM estimators of an unknown parameter vector \(\beta\) constructed from theoretical restrictions on conditional or unconditional moments of functions \(\phi\) that depend on \(\beta\) and on a random vector \(X_t\) that is observable to an econometrician.
As a starting point, we consider a class of restrictions large enough to include examples of both conditional and unconditional moment restrictions. Members of this class take the form
for all sequences of selection matrices \(A \in \mathcal A\) where \(A = \{A_t : t \ge 1\} \) and where
the vector of functions \(\phi\) is \(r\) dimensional.
the unknown parameter vector \(\beta\) is \(k\) dimensional and in a parameter space \({\mathcal P}\).
\(A_t\) denotes a time \(t\) selection matrix for a subset of the valid moment restrictions that is used to construct a particular statistical estimator \(b\) of \(\beta\).
\({\mathcal A}\) is a collection of sequences of (possibly random) selection matrices that characterize valid moment restrictions.
the mathematical expectation is taken with respect to a statistical model that generates the \(\{X_t : t \ge 1 \}\) process.
A sample counterpart of the population moment conditions (13.1) is
Applying a Law of Large Numbers to (13.2) motivates a generalized method of moments estimator \(b_N\) of the \(k \times 1\) vector \(\beta\).
Different sequences of selection matrices \(\{A_t : t \ge 1 \}\) and \(\{\widetilde A_t: t \ge 1\}\) generally give rise to different properties for the estimator \(b_N\). An exception is when
for some \(k \times k\) nonsingular matrix \({\mathbb L}\). In this latter case, the same moment conditions are used for estimation and hence will give rise to the same GMM estimator \(b_N\).
We study limiting properties of estimator \(b_N\) conditioned on a statistical model. In many settings, the parameter vector \(\beta\) only incompletely characterizes the statistical model. In such settings, we are led in effect to implement a version of what is known as semi-parametric estimation: while \(\beta\) is the finite-dimensional parameter vector that we want to estimate, we acknowledge that, in addition to \(\beta\), a potentially infinite-dimensional nuisance parameter vector pins down the complete statistical model on which we condition when we apply the law of large numbers and other limit theorems.
Example 13.1
Unconditional moment restrictions
Suppose that
where \(r \ge k\). Let \(\mathcal{A}_t\) be the set of all constant \(r \times k\) matrices \(\mathbb{A}\) of constants. Rewrite the restrictions as:
for all \(r \times k\) matrices \(\mathbb{A}\). [Sargan, 1958] and [Hansen, 1982] assumed moment restrictions like these. The following give two rather different applications.
Nonlinear instrumental variables
Let
where \(Z_t\) is an \(r\) dimensional vector of instrumental variables and \(\eta(Y_t, \beta)\) is a scalar disturbance term in an equation of interest. The vector of instrumental variables are presumed to be orthogonal to \(\eta(Y_t, \beta )\), which gives rise to vector of moment conditions. When there are more instrumental variables than parameters, we are led to study a family of estimators rather than a single one. This has a direct extension to the case where there are multiple equations and \(\eta(Y_t, \beta)\) is a random vector.
Moment matching
Moment matching is an approach that has or at least should have close ties to calibration as is done often in economic dynamics. See [Hansen and Heckman, 1996] for a discussion of the merits of this link.
Suppose that
where
The random vector \(E \left[\psi(X_t) \right]\) defines moments to be matched and \(\overline{\psi}(\beta)\) are population values of those moments under a statistical model with parameter vector \(\beta\). Often that statistical model is a “structural” economic model with nonlinearities and other complications that, for a given, \(\beta\) make it challenging to compute the moments \(\overline{\psi}\) analytically. To proceed, the proposal is to approximate those moments for a given \(b\) by computing a sample mean from a long simulation of the statistical model at parameter vector \(b\). By running simulations and computing associated sample means for many alternative \(b\) vectors, we can assemble an approximation to the function \(\overline{\psi}(b)\). [Lee and Ingram, 1991] and [Duffie and Singleton, 1993] used versions of this approach. Notice that in contrast to some other applications of GMM estimation that allow the appearance of unknown nuisance parameters in the statistical model assumed to generate the data, this approach assumes that, given \(b\), the model completely determines a sample path that we can at least simulate. This method is used in macroeconomics, corporate finance, and asset pricing, sometimes formally and sometimes informally. While the model may be misspecified along some dimensions, with this method the target moments are presumed to be correctly specified. Later in this chapter, we will describe a way to assess potential model misspecification in this setup.
Example 13.2
Conditional moment restrictions
Assume the conditional moment restrictions
for a particular \(\ell \ge 1\) and \(Y_t = X_t\). Let \(\mathcal A_{t}\) be the set of all \(r \times k\) matrices, \(A_t\), of bounded random variables that are \({\mathfrak A}_{t-\ell}\) measurable. Then the preceding conditional moment restrictions are mathematically equivalent to the unconditional moment restrictions
for all random matrix processes \(\{A_t : t \ge 0\} \in {\mathcal A}\). This formulation is due to [Hansen, 1985]. Also see a closely related analysis of [Chamberlain, 1987] with a formal link to semiparametric efficiency bounds.
The following example gives a substantive illustration
Asset pricing
A common way to construct conditional moment conditions in macro-asset pricing is to construct an \(\ell\) period scalar “stochastic discount factor” as a function of data and an unknown parameter vector. For instance, see [Hansen and Singleton, 1982] and [Hansen and Richard, 1987] for initial applications of this formulation, and more extensive discussions, see the books: [Cochrane, 2001] and [Singleton, 2006]. A stochastic discount factor discounts the future in a state-dependent way to capture compensations for exposures to uncertainty. Denote the \(\ell\)-period stochastic discount factor by \(\psi(X_t,\beta).\) This discount factor may be used to represent asset prices:
where \(R_t\) is an \(r\)-dimensional vector of \(\ell\)-period gross returns and \({\bf 1}_r\) is an \(r\) dimensional vector of ones. Let
Collections \(\mathcal A\) of selection processes for both of these examples satisfy the following “linearity” restriction.
Restriction 13.1. If \(A^1\) and \(A^2\) are both in \(\mathcal A\) and \(\mathbb{L}_1\) and \(\mathbb{L}_2\) are \(k \times k\) matrices of real numbers, then \(A^1 \mathbb{L}_1 + A^2\mathbb{L}_2\) is in \(\mathcal A\).
A common practice is to use the approach provided in Example 13.2 while substantially restricting the set of moment conditions used for parameter estimation. One possibility is to create unconditional moment restrictions like those in Example 13.1 from a collection of conditional moment restrictions, and thereby reduce the class of GMM estimators under consideration. For instance, let \(A_t^1\) and \(A_t^2\) be two ad hoc choices of selection matrices where no linear combination of columns of \(A_t^2\) duplicate those of \(A_t^1.\)
where \(X_t\) now includes variables used to construct \(A_t^1\) and \(A_t^2\). We use \(2 k \times k\) selection matrices \(\mathbb A\) to form moment conditions
and study an associated family of GMM estimators. This strategy reduces an infinite number of moment conditions to a finite number. There are extensions of this approach. For instance, we could use more than two \(A_t^j\)’s to construct \(\phi^+\), or we could just augment \(A_t^1\) with a subset of columns of \(A_t^2\) when forming \(\phi^+\).
13.3. Central limit approximation#
The process
can be verified to have stationary and ergodic increments conditioned on the statistical model. So there exists a Proposition 3.1 decomposition of the process. Provided that
under the statistical model that generates the data, the trend term in the decomposition of Proposition 3.1 is zero, implying that the martingale dominates the behavior of sample averages for large \(N\). In particular, Proposition 3.2 gives a central limit approximation for
Let \(A = \{ A_t : t \ge 0 \}\) and suppose that
converges in mean square. Define the one-step-ahead forecast error:
Paralleling the construction of the martingale increment in Proposition 3.2,
where by the approximation sign \(\approx\) we intend to assert that the difference between the right side and left side converges in mean square to zero as \(N \rightarrow \infty\). Consequently, the covariance matrix in the central limit approximation is
Recall Restriction 13.1. For the preceding construction of the martingale increment, it is straightforward to verify that
follows from the linearity of conditional expectations.
Example 13.1 (cont.)
Consider again Example 13.1 in which \(A_t = \mathbb{A}\) for all \(t\ge 0\) and
where
Define the covariance matrix
and note that
Example 13.2 (cont.)
In Example 13.2
and hence
whenever entries of \(A_t\) are restricted to be \({\mathfrak A}_{t-\ell}\) measurable.
Consequently
for \(j \ge \ell\) so that the infinite sums used to construct \(G_t(A)\) simplify to finite sums.
While martingale approximation provides a way to establish a central limit approximation, especially for partially specified models, a formal construction of \(G_t(A)\) is challenging in practice. There is another limiting representation commonly appealed to for implementation:
Importantly, this computation is the covariance matrix of a scaled (by \(1/\sqrt{N}\)) partial sum and not the covariance matrix of the separate contributions to the sum. Example 13.2 provides further simplification as many of the expected cross terms are zero.
Remark 13.1
When applications of GMM methods call for general (but weak) forms of temporal dependence, the reliable estimation of the covariance matrix \({\mathbb V}\) needed for central limit approximation can be very challenging. Various researchers have proposed methods that come from what is called “spectral density” estimation. Within this latter literature, the matrix \({\mathbb V}\) is the spectral density matrix for the process \(\{ {A_t}'\phi(X_t, \beta) : t \ge 0\}.\) One common approach proposed by [Bartlett, 1950] and popularized in economics by [Newey and West, 1987], starts from \(L\) sample autocovariances and then downweights them[2]. The autocovariance estimator of order \(\ell\) is
For \({\underline N} << N,\) form
Large sample justifications entail arguments in which \({\underline N}\) and \(N\) both go to infinity, but \({\underline N}\) at a much slower rate. Such arguments are of limited value for how to pick \({\underline N}\) in practice. Exploring sensitivity in the choice of \({\underline N}\) is prudent practice. Spectral methods can be notoriously unreliable for high dimensions (large values of \(r\)).[3]
Consider next the case in which restriction Example 13.2 is satisfied for some \(\ell\). Then one can justify using
This estimator of the limiting covariance matrix may turn out not to be positive definite. As an alternative, [Eichenbaum et al., 1988] build on a proposal due to [Durbin, 1959] and propose an improved estimator that is guaranteed to be positive definite.
In the case of moment matching, assuming a correct specification, one could use very large sample simulations conditioned on each of the hypothetical parameters to approximate the construction of \({\mathbb V}(b)\). Remarkably, this is not often done in economics, even though it could improve the quality of the inferences.
13.4. Mean value approximation#
Write
where
Since
So long as \(\nabla(A)\) is nonsingular,
This approximation provides an essential input into finding an “efficiency bound” for GMM estimation. Notice that the covariance matrix for such an approximation is:
We want to know how small we can make this matrix by choosing a selection process. The answer to this question gives what we call a GMM efficiency bound. We show how to characterize a greatest lower bound in later section.
Example 13.1 (cont.)
Consider again Example 13.1. In this case \(A_t = {\mathbb A}\) for all \(t \ge 0\) and
where
and
13.5. GMM Efficiency Bound#
Recall
We seek a greatest lower bound on the covariance matrix on the right.
Suppose that \(\left[\nabla(A)'\right]^{-1}\) is nonsingular and impose that
\[\left[\nabla(A)\right] = {\mathbb I}\]If not, post multiply \(A\) by a nonsingular matrix \({\mathbb J}\). That leaves the GMM estimator unaltered. Thus, we have
\[\textrm{ cov}(A) = E\left[ G_t(A) G_t(A)' \right] \]subject to \(\left[\nabla(A)\right] = {\mathbb I}\)
Find an \(A^d\) such that for all \(A \in {\mathcal A}\)
(13.3)#\[\nabla(A) = E\left[ G_t(A^d) G_t(A)' \right] .\]Form
\[A_t^* \eqdef A^d_t \left( E\left[ G_t(A^d) G_t(A^d)' \right]\right)^{-1} \]
for all \(A \in \mathcal A\), where we think of (13.3) as a set of first-order (necessary and sufficient) conditions for our constrained optimization. Rather than derive them as such, we essentially use “guess and verify” in what follows.
and
provided that \(\left[\nabla(A)\right] = {\mathbb I}.\)
Therefore,
\[0 \leq E \left( \left[ G_t(A) - G_t(A^*) \right] \left[ G_t(A) - G_t(A^*) \right]' \right) = \textrm{cov}(A) - \left( E\left[ G_t(A^d) G_t(A^d)' \right]\right)^{-1} .\]
Result 13.1. Given a solution to equation (13.3)
Remark 13.2
In Result 13.1, we might be tempted to think that \(G_t(A^d)\) plays the same role that the “score vector” increment does in maximum likelihood estimation. But because there is a possibly infinite dimensional vector of nuisance parameters here, a better analogy is that \(G_t(A^d)\) acts much like the residual vector in a regression of parameters of interest score increments on nuisance parameter score increments along the lines suggested in
Section 6.6: Score processes . By undertaking to infer the parameter vector \(\beta\) from conditional or unconditional moment restrictions, we have purposefully pushed all nuisance parameters into the background.
Remark 13.3
The representation
used to compute the efficiency bound is an application of the Riesz Representation Theorem. To understand this, introduce the \(k\)-dimensional coordinate vectors \({\sf u}_i\) for \(i=1,2,...,k\) and consider:
Note that
The integer \(i\) selects the coordinate of \(b\) with respect to which we are differentiating.
If \(A^1\) and \(A^2\) are both in \({\mathcal A}\), then so are linear combinations. Therefore \(({\sf u}_i)'\nabla(A) {\sf u}_j\) is a linear functional defined on a linear space of random variables of the form \(\phi(X_t, \beta)' A_t {\sf u}_j\) for a given \(i\).
The martingale approximations for the scalar process with stationary increments
\[\left\{ \sum_{t=1}^N \phi(X_t, \beta)' A_t {\sf u}_j : N \ge 1 \right\}\]has martingale increment \(G_t(A) {\sf u}_j\).
The Riesz Representation Theorem asserts that the linear functional \(({\sf u}_i)'\nabla(A) {\sf u}_j\) can be represented as an inner product
\[({\sf u}_i)'\nabla(A) {\sf u}_j = E\left[ \rho_t G_t(A) {\sf u}_j \right]\]where the scalar random variable \(\rho_t\) is in the mean square closure of
\[\left\{ G_t(A) {\sf u}_j : A \in {\mathcal A} \right\}.\]We can represent \(\rho_t\) as
\[\rho_t = G_t(A^d){\sf u}_j\]for some selection process \(A_d \in {\mathcal A}\) or more generally as a limit point of a sequence of such selection processes.
The preceding construction pins down row \(j\) of \(A^d\). Repeating an analogous construction for each \(j = 1,2,...,k\) gives the selection matrix \(A^d\).
The GMM efficiency bound presumed that we could solve equation (13.5). The Riesz Representation Theorem requires that \(\rho_t\) be in a mean square closure of a linear space. Provided that the linear functionals \(({\sf u}_i)'\nabla(A) {\sf u}_j\) are mean square continuous, the efficiency bound can be represented in terms of the limit point of a sequence of GMM estimators associated with alternative selection processes even when the limit points are not attained.
Example 13.1 (cont.)
Consider Example 13.1 in which we assumed that \(A_t = \mathbb{A}\). Then
and the equation of interest is
Since the selection matrix \(\mathbb{A}\) can be chosen freely,
and the GMM efficiency bound is
Example 13.2 (cont.)
Consider again Example 13.2 in the special case in which \(\ell = 1\) so that the conditional moment condition of interest is:
Let
which we assume to be nonsingular. To compute the efficiency bound, we wish to solve the following equation for \(A_t^d\)
Given the flexibility in the choice of the random \(A_t\) with entries that are \({\mathfrak A}_{t-1}\) measurable, this equation is equivalent to
where we have taken transposes of the expressions in (13.6). Thus
and the efficiency bound is:
See [Robinson, 1987], [Newey, 1990], and [Ai and Chen, 2003] for prominent contributions to a literature that proposed and justified econometric methods for attaining this bound.
Next consider a conditional moment justification for two-stage least squares. Add the following special restrictions. Suppose that \(r=1\) and that \(V_{t-1} = \mathsf{v} > 0\) where \(\mathsf{v}\) is constant. Further suppose that
Finally, suppose that
where \(Z_{t-1}\) has more entries than \(Y_t^2\). Notice that \(\Pi\) can be computed as a least squares regression. Then
The scaling by \(\frac{1}{\mathsf{v}}\) is inconsequential to the construction of a selection process. The matrix of regression coefficients can be replaced by the finite sample least squares regression coefficients without altering the statistical efficiency.
To obtain this rationale for two-stage least squares, we had to impose a special structure, one that does not prevail in many important applications. For instance, suppose that \(V_{t-1}\) depends on conditioning information so that a form of conditional heteroskedasticity is present. That dependence shows up in essential ways in how \(A^d_t\) should be constructed. Further, suppose that the expectation \(E\left[ Y_t^2 \mid {\mathfrak A}_{t-1} \right]\) depends nonlinearly on \(Z_{t-1}\). In that case, to attain or to approximate the efficiency bound, a least squares regression should account for potential nonlinearity.
Remark 13.4
To implement the conditional moment version of the GMM efficiency bound requires the estimation of conditional moments for which the model may not provide functional forms. Reliable nonparametric estimation becomes particularly challenging in high dimensions. Suppose instead that a researcher adopts a convenient parametric approximation to these conditional moments. While this will induce a form of misspecification, it will not necessarily undermine either the consistency of the resulting GMM estimator or its asymptotic distribution. The misspecification may only cause a reduction in the statistical efficiency of the estimator as measured by the asymptotic covariance matrix of the resulting GMM estimator.
Example 13.1 (cont.)
Suppose now that \(\ell = 2\) and consider an unconditional moment formulation. Then even if the covariance structure is homoskedastic and conditional expectations are linear, the two-stage least squares approach will no longer be statistically efficient. We illustrate why by mapping into the framework of Example 13.1.
Use as our \(\phi\) function
in forming an unconditional moment restriction and constructing the efficient GMM estimator where the entries of \(Z_{t-2}\) are in the date \(t-2\) conditioning information set (\({\mathfrak A}_{t-2}\) measurable). Now form:
pertinent to the central limit approximation. Then an efficient selection matrix \({\mathbb A}^d\) is given by:
The matrix \({\mathbb A}^d\) is typically not proportional to a vector of regression coefficients of \(Y_t^2\) on \(Z_{t-2}\), as presumed in the justification for two-stage least squares. The temporal dependence removes this connection. Hence, a commonly-used measure of “instrument relevance” obtained by regressing the endogenous \(Y_t^2\) onto \(Z_{t-2}\) is no longer a valid diagnostic .
A special case of this analysis is when \(Y_t^2\) is in the date \(t-2\) conditioning information set. One estimator could be constructed by setting \(Z_{t-2} = Y_t^2\). The standard two-stage least squares estimator now is simply the ordinary least squares estimator. Next expand \(Z_{t-2}\) to be:
The \({\mathbb A}^d\) selection matrix will typically include \(Y_{t-1}^2\) in the estimation in spite of the fact that this variable is not needed in an initial least squares regression of \(Y_t^2\) onto \(Z_{t-2}\).
Since there is great flexibility in the construction of \(Z_{t-\ell}\), there is further scope for efficiency gains attained by using the conditional moment formulation of the family of GMM estimators. [Hansen and Singleton, 1996] and [West, 2001] proposed specific appraoches for constructing and approximating the efficiency bound in Example 13.2 for a linear (in the variables) in explict time series settings
13.6. Approximate inference for testing#
Using an entirely analogous approach, derive limit approximations geared to testing any “over-identifying restrictions.” Let \(B = \{ B_t : t \ge 0\}\) be an \(r \times {\tilde k}\) selection process constructed to test the following vector of \(\tilde{k}\) means.
Restriction 13.2. For any \({\tilde k} \times k\) matrix of real numbers \({\mathbb K}\), \( B {\mathbb K} \in {\mathcal A}\).
Thus, we can build selection processes for testing equations from the columns of the process \(B\).
Suppose that
converges in mean square so that we can apply a central limit approximation.
Construct
Since Restriction 13.2 is satisfied, notice that
for all \({\tilde k} \times k\) matrices \({\mathbb K}\) of real numbers.
By imitating an earlier argument
This formula includes an explicit adjustment for estimation of \(\beta\). Notice that if \(A_t = B_t\), then the right side is zero and the limiting distribution is degenerate. This approximation is used to construct tests that account for having used GMM to estimate a parameter vector \(\beta\).
Example 13.1 (cont.)
Consider again unconditional moment restrictions specified in Example 13.1. Let the selection process for testing be constant over time so that \(B_t = {\mathbb B}\). Then
Remark 13.5
Moment-matching is also accompanied by some form of testing. This practice often occurs in the study of dynamic economic models, but it is also prevalent in other disciplines as the following quote illustrates:
Some hydrologists have suggested a two-step calibration scheme in which the available dependent data set is divided into two parts. In the first step, the independent parameters of the model are adjusted to reproduce the first part of the data. Then in the second step the model is run and the results are compared with the second part of the data. In this scheme, the first step is labeled “calibration,” and the second step is labeled “verification”. … The use of the term verification in this context is highly misleading … .[Oreskes et al., 1994]
[Oreskes et al., 1994] prefer the term “confirmation” for this second step. While not dismissing the second step, the authors are quick to remind readers of its limitations. The formulas just presented for testing support this step.
13.7. Statistical tests based on efficient GMM estimators#
First suppose that we have statistically efficient selection process. Thus the selection matrix is some nonsingular matrix transformation of \(A^d\) where \(A^d\) satisfies the “first-order conditions” (13.3). Recall the approximation
which includes an adjustment for estimation. Let \({\widetilde G}_t(B)\) denote the increment in the martingale approximation for
We use the representation implied by (13.3) to write:
These constructions allow us to write:
where
The term, \({\widehat G}_t(B)\), that appears inside the sum on the right side of (13.7) is the population least squares residual from regressing \({\widetilde G}_t(B)\) onto \(G_t(A^d)\). This regression residual can also be interpreted as a martingale increment for a stationary increments process.
Suppose that \({\widehat G}_t(B)\) has a nonsingular covariance matrix. Consider the quadratic form used for building a test:
Example 13.1 (cont.)
Consider Example 13.1 again. We have already shown that
Suppose we let \(B_t = {\mathbb I}\) as a special case. Then
Other choices of \({\mathbb B}\) can be deduced through a premultiplication.
Remark 13.6
To continue our study of Example 13.1, form the population problem:
This has a minimizer at \(b = \beta\) provided that the unconditional moment conditions are satisfied. If \(b = \beta\) is the only possible parameter vector that satisfies the population moment conditions, then \(b = \beta\) is the unique solution to the population minimization problem stated here. Suppose that we construct an estimator by solving a minimization problem:
First-order necessary conditions are
Assume that we already know that the solution \(b_N\) of the above first-order conditions provides a consistent estimator of parameter vector \(\beta\). Then we can show that
where convergence is with probability one. Thus, in this case the implied selection matrix
provides an estimator that attains the efficiency bound.
Remark 13.7
There is an interesting variation of the approach described in Remark 13.6. For any \(b\), let \(\mathbb{V}(b)\) be the population covariance matrix in the martingale increment used in the Central Limit approximation for the process
Assume that \(\mathbb{V}(b)\) is nonsingular for every \(b\) in a parameter space. Form the population minimization problem:
If \(b = \beta\) is the only vector that satisfies the associated population first-order conditions, then \(b = \beta\) is again the unique solution to the above population minimization problem.
Now form sample counterparts of both \(E \left[\phi(X_t,b)\right]\) and \(\mathbb{V}(b)\) as functions of \(b\). Minimizing a sample counterpart of the above population minimization problem gives rise to a “continuously-updated GMM estimator”. See [Hansen et al., 1996]. The parameter vector and an appropriately scaled minimized objective function have the same limiting distributions as those described in Remark 13.6.
Remark 13.8
It can be numerically challenging to find minimizers to the continuously-updated GMM estimator in high dimensions. Similarly, it can be difficult to construct confidence sets based on this objective. [Chernozhukov and Hong, 2003] and [Chen et al., 2018] devise and justify simulation-based methods for inference applicable to the continuously-weighted GMM objective function. They do so by adapting insights from simulation-based approaches for Bayesian inference. The Bayesian-type calculations can be numerically more tractable. [Chernozhukov and Hong, 2003] treat the continuously-updated objective function as a “log-likelihood function”, and then use large sample approximations to justify the Bayesian-like calculations. They justify this approach even though the continuously-updated objective is not formally a log-likelihood function. [Chen et al., 2018] show how to modify this approach to attain additional robustness and reliability. An attractive feature of these methods is that they exploit the shape of the continuously-updated GMM objective function, but in so doing they impose a “prior” distribution to help guide this exploration.
13.8. Quantifying model misspecification and its ramifications.#
In our development of statistical tests of model misspecification, we deduced limiting distributions of test statistics when the model was correctly specified. We now conduct our investigation allowing for the model to be misspecified and deducing some of its consequences. In contrast to much of the econometrics literature, we presume that the misspecification is global (not local) in nature. We choose the global perspective because it better supports interpretations of the potential misspecifications. Specifically, we modify the target of estimation to include an estimate of the underlying probability distribution. At the same time as we estimate the parameter vector, we estimate the distribution of the observable data that supports this estimation. We relax the assumption that the moment conditions are satisfied under the data generating process and instead find distributions that are potentially statistically close to the data generating process. In many models of interest, the expectations used for the moment conditions could be the subjective beliefs of the agents “inside the model” with expectations that differ from the actual data generating process.
There is a substantial literature on what is called generalized empirical likelihood that addresses this, with the primary motivation to improve second-order efficiency, and constructs that we do not study here. Since this literature features a finite number of unconditional moment conditions, these second-order efficiency gains are applied to a second-best formulation of statistical efficiency.
Instead of focusing on refined inferences, we follow a suggestion in [Hansen, 2014] by thinking of the beliefs as those of economic agents within the models we build. This builds on an idea suggested by [Brown and Back, 1993] to use the implied probabilities from a GMM estimation to isolate the aspect of the empirical distribution of the data that is potentially most problematic from the perspective of the model that is being estimated.[4] While our suggestions for implementation differ from theirs, there is an overlapping aim. In what follows, we use two special cases of the generalized empirical likelihood methods as diagnostics that are informative about potential misspecification.
13.8.1. Relative entropy divergence#
We temporarily hold fixed a hypothetical parameter value \(b,\) and we allow the model to be misspecified.
Consider the following population:
Here the random variable \(M\) is used as a potential change in the probability measure. The minimization problem restricts \(M\) so that the moment conditions are satisfied under the change in probabilities. The outcome is a minimum divergence measure supported by an \(M^*\) that achieves this minimum divergence. Should the original model be correctly specified, this solution will be \(M^* = 1\); however under misspecification, this will not be the case. Thus the minimizing objective is a measure of model misspecification. With unknown parameters, the bound can be smaller by searching over the parameter space.
To solve the minimization problem (holding fixed \(b\)), we introduce multipliers \(\lambda,\) and \(\zeta\) on the two constraints. The Lagrangian minimization problem separates so that we may solve it without taking expectations. The first-order conditions for this Lagrangian are:
Thus
Plugging this solution back into the Lagrangian gives
First maximize over \(\zeta\) after substituting for formula (13.10)
The first-order conditions are:
implying that
Substituting this back into the objective (13.11) gives the objective to be maximized with respect to \( \lambda\):
Thus the counterpart to a misspecified population generalized moment problem solves the min-max problem:
Under a correct model specification, [Kitamura and Stutzer, 1997] establish that the minimizing solution to a sample counterpart to this population objective has the same first-order properties as an efficient GMM estimator when \(\{ \phi(X_t, \beta) : t \ge 0 \}\) is itself a martingale difference sequence. But our interest in these computations is to provide a measure of misspecification and an assessment of what aspect of the observations are most challenging to the model under investigation.
13.8.2. Quadratic divergence#
Consider a counterpart with a quadratic divergence to the problem that we just analyzed:
Given multipliers, \(\lambda,\) and \(\zeta\) on the constraints, abstracting from the nonnegativity constraint on \(M,\) the first-order conditions for the inner minimization are:
Thus
provided that the right-hand side is nonnegative. Otherwise \(M(b, \lambda, \zeta)=0\). This leads us to write:
where \([\cdot]_+\) imposes zero when the argument is negative. We then solve for the multipliers \(\lambda\) and \(\zeta\) by requiring that \(M(b, \lambda, \zeta)\) have mean one and that the moment conditions are satisfied under the implied change in probability measure. We do not have an analytical solution, but solve
using a numerical method. When there are unknown parameters, we include a minimization of \(b \in {\mathcal P}.\)
As an intellectual curiosity, suppose we dispense with restriction that \(M(b, \lambda, \zeta)\) be nonnegative. Restricting it to have mean one implies that:
and thus
Let
which is the covariance matrix of the random vector: \(\phi(X_t, b)\). Consistent with our previous analysis, we restrict the moment conditions not to be redundant so that \({\mathbb W}(b)\) is nonsingular. Note that \({\mathbb W}(b)\) is only the pertinent asymptotic covariance matrix for a central limit approximation when \(\{\phi(X_t, b)\}\) is the first-difference of a martingale. Imposing the constraint that under the probability measure induced by \(M(b, \lambda)\), \(\phi(X_t, b),\) has mean zero gives the formula;
Thus
and the \(b\) of interest is determined by:
This is recognizable as the population counterpart to the continuously-updated objective for GMM proposed by [Hansen et al., 1996] for the special case in which \({\mathbb W}(b) = {\mathbb V}(b).\)
Remark 13.9
As one way to confront, approximately, forms of weak dependence, we could apply the same population calculations but replace \(\phi(X_t, b)\) with
and repeat the computations with the quadratic and relative entropy divergences. This extends the approach of [Bartlett, 1950] described in this remark to a measure of misspecification. In practice, it presumes that \({\underline N}\) be substantially less than the sample size and that the number of coordinates of \(\phi\) not be too large.
13.8.3. Bounding expectations under model misspecification#
So far, we have used a statistical divergence to identify an adjustment to the data generating process that allows for the moment conditions to be satisfied. Following a formulation of [Chen et al., 2020], we now make two changes. First we represent changes in the underlying probabilities as changes in expectations relative to the actual data generation. Second we replace the minimization with a weaker divergence inequality via an “ambiguity set” of expectations.
Consider the relative entropy divergence. A probability distribution is characterized by the expectations it assigns to a rich class of scalar functions, \({h},\) of the underlying random vector \(X_t\). When this class is sufficiently rich, the expectations determine the probabilities. We form a relative entropy bound by taking the minimum divergence outcome as a starting point and inflating it by some percentage. Call this outcome \(\kappa\). For a hypothetical parameter vector, \(b\), solve
This provides a sharp lower bound on \(E[M h(X_t) ]\). To get a sharp upper bound, repeat the same calculation by computing a sharp lower bound for \(- E[M h(X_t)]\) and multiply the outcome by minus one. Observe that inflating \(\kappa\) we no longer identify a unique \(M\) as there is a convex set of \(M\)’s that satisfy the constraints. The outcome of the computations are upper and lower bounds on expectations.
Remark 13.10
The minimization over alternative \(M\)’s for alternative \(h\)’s gives an example of what [Peng, 2004] calls a nonlinear expectation operator (with argument \(h\)). This expectation emerges under a variety of alternative specifications of ambiguity.
We leverage our previous calculations by proceeding differently. First solve the minimum divergence problem and compute the implied expectation for \(E[M h (X_t)] = \underline{\sf{r}}.\) Solve:
for \(\sf{r} > \underline{ \sf{r}}\). The added constraint will induce a larger divergence bound. By increasing \(\sf{r}\) we may attain a divergence of \(\kappa\).
With this formulation, we have an immediate extension of our previous analysis leading to the following problem
where we now allow the function \(h\) to depend on the parameter vector, \(b\). We use \(\nu\) to denote the multiplier on the constraint:
As a special case, we may set \(h\) equal to one of the entries of \(b\) and deduce implied upper and lower bounds on the different parameter coefficients and hence deduce an ambiguity interval for one of the components of the parameter vector.
[Chen et al., 2024] describe inferential methods that support such an analysis expressed in terms of large sample approximations.
Remark 13.11
So far, we have treated this as an unconditional problem including both the moment conditions and the divergence measures. [Chen et al., 2020] show how to extend this analysis to the case with conditional moment restrictions along with an intertemporal measure of statistical divergence.
13.9. Refinements and extensions#
In this final section, we provide an informal analysis of some related GMM applications and direct extensions with similar formulations.
13.9.1. Decomposing the GMM moment conditions#
We focus exclusively on the case of unconditional moment restrictions (See Example 13.1.) Factor the inverse asymptotic covariance matrix:
and thus
As a consequence, under the Central Limit approximation
where \(\Rightarrow\) denotes convergence in distribution. We now return to the formula (13.8) pertinent for testing using an efficient GMM estimator. Premultiplying by \(\Lambda\) gives
Form
Using these constructions, note that the two matrices
are idempotent. The product of the two matrices is a matrix of zeros and the sum of the two is \({\mathbb I}\). Idempotent matrices have eigenvalues that are either zero or one. The first of the two matrices in (13.16) has rank \(k\) with \(r-k\) zero eigenvalues and the second one has rank \(r-k\) with \(k\) zero eigenvalues.
Given this construction and factorization (13.14)
The term on the left-hand side of the equality is distributed asymptotically as a chi-square with \(r\) degrees of freedom. The first term on the right-hand side is distributed asymptotically as a chi-square with \(k\) degrees of freedom. This limiting distribution can be used to form confidence sets for the underlying parameter vector. The second term on the right-hand side adjusts the quadratic form minimization with weighting matrix \({\mathbb V}^{-1}\) for parameter estimation as we discussed previously.
To elaborate, consider the second term on the right-hand side of (13.17). Using the fact that
is idempotent along with approximation (13.15),
has an asymptotic chi-square distribution with \(r-k\) degrees of freedom.
Returning to the first term of (13.17), use the asymptotic approximation to the minimized quadratic form and construct the confidence set:
In this set construction, \(\kappa\) is critical value determined from a chi-square distribution with \(k\) degrees of freedom. While the direct computation of such a set is challenging, [Chen et al., 2018] suggests a simulation-based alternative that is often easier to implement.
Previously, we observed that the minimizer using \({\mathbb V}^{-1}\) as a weighting matrix in a quadratic form minimization (13.9) yielded an estimator that attained the GMM efficiency bound. With the results just deduced, the limiting distribution of the minimized criterion in (13.9) is \(\chi^2\) with \(r - k\) degrees of freedom. Thus by solving this minimization problem we achieve both GMM efficiency and a test of the over-identifying restrictions.
While our derivations used the population covariance matrix \({\mathbb V}\), this is typically not feasible to obtain in practice. The analogous limit approximations work using a consistent estimator of \({\mathbb V}\), allowing this estimator to depend explicitly on the parameter vector \(b\) under consideration as is the case for the continuously-updated GMM approach.
13.9.2. Indirect inference#
We previously discussed moment-matching as an approach to calibration. Indirect inference is an extension of moment matching and was initially proposed by [Smith, 1993] and [Gourieroux et al., 1993]. It works with two probability models: (1) a “structural economic model” with a vector of parameters \(\beta\) that characterize preferences, technology, information flows, and other features of the theoretical economic model; and (2) an “auxiliary model” with no pretense of being “structural” in terms of economic theory and that depends on a vector of parameters \(\delta\). The parameter vector \(\delta\) is assumed to have \(r \ge k\) dimensions where \(k\) is the dimension of the structural parameters of interest. Although the structural model can be solved and simulated on a computer, it is presumed to be too complicated to allow for a tractable characterization of its likelihood process. In contrast, the auxiliary model is chosen in part so that the likelihood process for the auxiliary model can be calculated and maximized conveniently. Even though the auxiliary model gives an incorrect accounting for the data generation, the misspecified likelihood estimator has large sample properties that can be characterized. The vector \(\delta\) is formally the large-sample limit of the misspecified likelihood estimator. Given that the underlying model provides a correctly specified full accounting of the data, we may write \(\delta(\beta)\). [Gourieroux et al., 1993] define this to be the binding function, which they take to be tractable to compute via simulation methods. Indirect inference proceeds as follows:
Construct the \(r\)-dimensional vector \({\hat \delta}_N\) from data as the estimate from a misspecified likelihood implementation and treat this as the counterpart to a vector of empirically constructed target moments.
Construct a binding function \(\delta(\beta)\) using a numerical solution of the economic model of interest as a function of the unknown parameters.
Construct a family of estimators of \(\beta\) with alternative \(k \times r\) selection matrices, \({\mathbb A}\), and find a value an admissible parameter value, \(b_N,\) so that \({\mathbb A}'{\hat \delta} = {\mathbb A}' \delta(b_N)\).
Use the GMM counterpart with a choice of a selection matrix or weighting matrix that attains the efficiency bound for this class of estimators.
As an initial example, [Smith, 1993] used a vector autoregression as an auxiliary model while the underlying structural model had a nonlinear time series evolution. [Gourieroux et al., 1993] provide other examples. The estimated value of the parameter vector \(\delta\) of the auxiliary model is used in place of the vector of the sample estimator of the target moments in the moment-matching approach. Simulation methods provide an approximation to the limiting distribution of the estimator of \(\delta\).
[Gallant and Tauchen, 1996] subsequently proposed a related approach. Instead of using the estimated values of \(\delta\) as the target moments, they suggest using the implied score process for the auxiliary model divided by the sample size as the counterpart to the sample moments. Again the large sample limiting properties can be computed via simulation. [Gallant and Tauchen, 1996] emphasize that if the auxiliary model is a good approximation to the ``reduced-form’’ representation of the actual data generation, then the approach will be nearly statistically efficient in the sense of Fisher Information. See Chapter 6.6 for a discussion of score processes and Fisher Information.
13.9.3. Multiperiod forecasting#
Consider estimating a multiperiod forecasting equation:
where
One common approach is to let \(A_t = Z_{t - \ell}\). In this case, the implied estimator of \(\beta\) is just the least squares estimator:
While we may nest in the family of GMM estimators, the limiting distribution of the least squares estimator will typically not attain the GMM efficiency bound.
The more efficient GMM estimators bring in potential contributions to the moment conditions from \(Z_{t-i}\) for \(i > \ell\). Forward shifts in time such as \(Z_{t-i}\) for \(i < \ell\) cannot be used because they could be correlated with the forecast error:
Estimators such as generalized least squares that correct for serial correlation will often not even be statistically consistent, because they presume that the forecast error is uncorrelated with the \(Z\)’s at all leads and lags.
Example 13.3
In Chapter 1 (see Example 1.8 ), we introduced a class of moving-average models:
where \(\{W_t : -\infty < t < \infty \}\) is a vector stationary process for which
and
for all \(-\infty < t < +\infty\).
Following [Frisch, 1933] and [Sims, 1980], suppose we view the coefficients \(\alpha_j\) as a target of estimation. [Sims, 1980]’s suggestion is to infer these coefficients from a finite-order vector autoregression, inclusive of shock identification. A literature on local projections initially advocated by [Jordà, 2005] proceeds differently where \(\alpha_\ell\) is based on regressions of \(Y_{t+\ell}\) onto \(W_t\).
In what follows we sketch and contrast two approaches to implementation. For simplicity, we abstract from the actual construction of \(W_{t+1}\), but instead just presume it can be identified and inferred from data. Consider the forecasting equation:
to be estimated by least squares. We could use the methods described in this chapter to study that estimation problem by starting with the unconditional moment restriction:
Notice that we have not included information prior to date \(t\) in this moment implication, and thus we do not have the simplicity needed to view this as a multi-period conditional moment restriction.
[Montiel Olea and Plagborg-Møller, 2021] retrieved this simplicity and more by adding controls for past information. Imagine that we could construct:
and construct the \(\ell\)-period conditional moment condition:
where \(\alpha_\ell\) continues to be the target of estimation. This implies the least-squares unconditional moment restriction:
While this approach allows us to pose the estimation challenge as a multi-period forecasting problem, it is an infeasible and a bit nonsensical approach as \({\overline Y}_{t-1}\) is clearly not known. Suppose, however, that we form a finite-dimensional proxy:
where \(Z_{t-1}\) is observable but \(\pi_\ell\) is unknown. We now treat \(\pi_\ell\) as an additional parameter to be estimated along with \(\alpha_\ell\) in a least squares regression. Moreover,
so when conceived of as regressors, \(W_t\) and \(Z_{t-1}\) are orthogonal. Thus we may proceed in two steps, which is useful pedagogically. First regress \(Y_{t+\ell}\) onto \(Z_{t-1}\) to estimate \(\pi_\ell\), and then regress the residual onto \(W_t\) to estimate the target of interest, \(\alpha_\ell\). By incorporating \(Z_{t-1}\) into the regression framework, we control for information embedded in shocks prior to date \(t\). As is standard in least-squares theory, this two-step approach is equivalent to running a multi-period regression of \(Y_{t+\ell}\) onto \(W_t\) and \(Z_{t-1}\). [Montiel Olea and Plagborg-Møller, 2021] showed that this formulation leads to a further simplification of the limiting distribution for the estimator \(\alpha_\ell\) by strengthening the restrictions on the shock process \(\{W_{t+1}\}\).[5]
This approach has a direct extension to the case in which interest is in the responses for just one of the entries of \(W_{t}.\) Additional observable proxies need to be included as regressors that capture the same incremental information as the shocks that are not included in the regression.
13.9.4. Multiperiod conditional moment restrictions#
Consider a linear specification in which \(Y_t\) is a \(k + 1\) dimensional vector of variables, \(\alpha\) is a \(k + 1\) dimensional parameter vector, and the following conditional moment restriction holds:
Construct a vector of \(r\) unconditional moment restrictions as follows. Let \(Z_{t-\ell}\) be an \(r \times 1\) vector of variables available at date \(t - \ell\). Then
In other words, \(\alpha\) is in the null space of the \(r \times (k+1)\) matrix:
Moreover, \(\alpha\) is only identified up to scale. A common approach to identification is to impose a one on an entry in the \(\alpha\) vector. Notice that when \(r \ge k+1\), the model imposes a reduced-rank restriction on the matrix given in (13.18). We may test this using a minimized version of the continuously-updated weighting-matrix version of a GMM estimator. The continuously-updated objective will be homogeneous of degree zero in \(\alpha\) and insensitive to which coefficient is normalized to one. Since the identification is only up to scale, the resulting chi-square test will have \(r - k\) degrees of freedom.
Example 13.4
Suppose that \(\ell = 1\) and write:
where \(Y_t^1\) and \(Y_t^2\) are scalars. Let \(Y_t^1\) be the logarithm of the return on the wealth portfolio, and let \(Y_t^2\) be the first difference in the logarithm of consumption. Treat both as endogenous random variables. For risk aversion equal to one in a recursive utility specification as in representation homog1a in Chapter 11,
abstracting from a constant term. (See [Epstein and Zin, 1991] for an initial citation for this result.) The parameter of interest, \(\rho\), is the inverse of the intertemporal substitution elasticity. Notice that the normalization is imposed on the first element of \(\alpha\):
and \(\rho\) becomes the target of estimation.
Alternatively, provided that \(\rho \ne 1\), we could impose:
where the normalization is now imposed on the second entry of \(\alpha\) and \(1/\rho\) becomes the target of estimation. Estimates of \(\rho\) and \(1/\rho\) can be recovered with both approaches. When using the continuously-updated weighting matrix approach, estimation with each of the two normalizations will recover the same estimate of \(\rho\). This invariance property is not guaranteed for implementations of GMM that are extensions of two-stage least squares unless \(Z_{t-\ell}\) has only one component.
There is substantial research activity that investigates weak instruments, providing both tests and approximations for statistical inference. To explore the latter, econometricians often proceed ``locally’’ by letting models change with sample size. Consider first testing whether the regression of \(Y_{t}^2\) onto \(Z_{t-1}\) has all zero coefficients, motivated, for instance, by the first normalization given in Example 13.4. Failure to reject this outcome is a sign of so-called “weak instruments”. Notice that this finding could well just suggest that \(\frac 1 {\rho - 1}\) is close to zero and hence \(\rho\) is very large. Had we adopted the second normalization instead, the weak instruments test would check if the regression coefficients of \(Y_t^1\) onto \(Z_{t-1}\) are zero. Now a failure to reject tells us that \(\rho\) could be very close to one. A weak instrument finding alone does not necessarily convey a lack of identifying information.
Following [Cragg and Donald, 1997] and [Arellano et al., 2012], we find it more fruitful to explore underidentification.[6] In terms of the example, this perspective suggests testing whether both sets of regression coefficients are zero. Under such a null hypothesis, there would be no restriction on \(\alpha\), for example. More generally, we can aim to identify two vectors \(\alpha_1\) and \(\alpha_2\), restricting them to be orthogonal, and take as our moment matrix:
subject to \(\alpha_1 \cdot \alpha_2 = 0\). A GMM-based test showing that it is hard to reject this specification is a warning that there might be underidentification.