12. GMM Estimation [1]#
Related papers:
Hurwicz (1966): On the Structural Form of Interdependent Systems
12.1. Introduction#
Generalized method moments (GMM) estimation studies a family of estimators constructed from partially specified or partially misspecified models. Direct application of likelihood methods sometimes can be challenging to construct, and GMM methods may be tractable alternatives. By studying an entire family of estimators, we are able to make relative accuracy comparisons among the entire family.
This paper takes statistical consistency as given. Supporting arguments for this chapter can be obtained with direct extensions of the Law of Large Numbers. Such extensions often entail Laws of Large numbers applied to so-called random processes (indexed by say a parameter vector of interest) instead of a random vector.
Throughout this chapter, we will condition on invariant events even though we will suppress this dependence when we write conditional expectations. Given the partially specified or misspecified nature of the model, much more than a simple parameter vector is reflected by this conditioning.
12.2. Formulation#
We study a family of GMM estimators of an unknown parameter vector \(\beta\) constructed from theoretical restrictions on conditional or unconditional moments of functions \(\phi\) that depend on \(\beta\) and on a random vector \(X_t\) that is observable to an econometrician.
As a starting point, we consider a class of restrictions large enough to include examples of both conditional and unconditional moment restrictions. Members of this class take the form
for all sequences of selection matrices \(A \in \mathcal A\) where \(A = \{A_t : t \ge 1\} \) and where
the vector of functions \(\phi\) is \(r\) dimensional.
the unknown parameter vector \(\beta\) is \(k\) dimensional.
\({\mathcal A}\) is a collection of sequences of (possibly random) selection matrices that characterize valid moment restrictions.
\(A_t\) denotes a time \(t\) selection matrix for a subset of the valid moment restrictions that is used to construct a particular statistical estimator \(b\) of \(\beta\).
the mathematical expectation is taken with respect to a statistical model that generates the \(\{X_t : t \ge 1 \}\) process (captured implicitly by conditioning on invariant events).
A sample counterpart of the population moment conditions (12.1) is
Applying a Law of Large Numbers to (12.2) motivates a “generalized method of moments” estimator \(b_N\) of the \(k \times 1\) vector \(\beta\).
Different sequences of selection matrices \(\{A_t : t \ge 1 \}\) and \(\{\widetilde A_t: t \ge 1\}\) generally give rise to different properties for the estimator \(b_N\). An exception is when
for some \(k \times k\) nonsingular matrix \({\mathbb L}\) do distinct selection matrices \({\widetilde A}_t\) and \(A_t\) give rise to the same \(b_N\).
We study limiting properties of estimator \(b_N\) conditioned on a statistical model. In many settings, the parameter vector \(\beta\) only incompletely characterizes the statistical model. In such settings, we are led in effect to implement a version of what is known as semi-parametric estimation: while \(\beta\) is the finite-dimensional parameter vector that we want to estimate, we acknowledge that, in addition to \(\beta\), a potentially infinite dimensional nuisance parameter vector pins down the complete statistical model on which we condition when we apply the law of large numbers and other limit theorems.
Unconditional moment restrictions
Suppose that
where \(r \ge k\). Let \(\mathcal{A}_t\) be the set of all constant \(r \times k\) matrices \(\mathbb{A}\) of constants. Rewrite the restrictions as:
for all \(r \times k\) matrices \(\mathbb{A}\). [Sargan, 1958] and [Hansen, 1982] assumed moment restrictions like these. For instance,
where \(Z_t\) is an \(r\) dimensional vector of instrumental variables and \(\eta(Y_t, \beta)\) is a scalar disturbance term in an equation of interest. The vector of instrumental variables are presumed to be uncorrelated with \(Z_t\).
Conditional moment restrictions
Assume the conditional moment restrictions
for a particular \(\ell \ge 1\) and \(Y_t = X_t\). Let \(\mathcal A_{t}\) be the set of all \(r \times k\) matrices, \(A_t\), of bounded random variables that are \({\mathfrak A}_{t-\ell}\) measurable.
Then the preceding conditional moment restrictions are mathematically equivalent to the unconditional moment restrictions
for all random matrices \(A_t \in {\mathcal A}_t\). This formulation is due to [Hansen, 1985]. Also see a closely analysis of [Chamberlain, 1987].
Collections \(\mathcal A\) of selection processes for both of these examples satisfy the following “linearity” restriction.
Restriction 9.1. If \(A^1\) and \(A^2\) are both in \(\mathcal A\) and \(\mathbb{L}_1\) and \(\mathbb{L}_2\) are \(k \times k\) matrices of real numbers, then \(A^1 \mathbb{L}_1 + A^2\mathbb{L}_2\) is in \(\mathcal A\).
A common practice is to use the idea provided in Example 12.2 while substantially restricting the set of moment conditions used for parameter estimation. Thus, from a collection of conditional moment restrictions, we can create unconditional moment restrictions like those in Example 12.1 and thereby reduce the class of GMM estimators under consideration. For instance, let \(A_t^1\) and \(A_t^2\) be two ad hoc choices of selection matrices. Form
where \(X_t\) now includes variables used to construct \(A_t^1\) and \(A_t^2\). We presume that no linear combination of columns of \(A_t^2\) duplicate any columns in \(A_t^1\). Otherwise, we would omit such columns and adjust \(\phi^+\) accordingly. Let \(r^+ \ge r\) denote the remaining non-redundant columns. We use \(r^+ \times k\) selection matrices \(\mathbb A\) to form moment conditions
and study an associated family of GMM estimators. This strategy reduces an infinite number of moment conditions to a finite number. There are extensions of this approach. For instance, we could use more than two \(A_t^j\)’s to construct \(\phi^+\).
“Moment matching” estimators are another special case of Example 12.1. Suppose that
where
The random vector \(\psi(Y)\) defines moments to be matched and \(\kappa(\beta)\) are population values of those moments under a statistical model with parameter vector \(\beta\). Often that statistical model is a “structural” economic model with nonlinearities and other complications that, for a given, \(\beta\) make it challenging to compute the moments \(E\left[\psi(X_t) \right]\) analytically. To proceed, the proposal is to approximate those moments for a given \(b\) by computing a sample mean from a long simulation of the statistical model at parameter vector \(b\). By running simulations and computing associated sample means for many alternative \(b\) vectors, we can assemble an approximation to the function \(\kappa(b)\). [Lee and Ingram, 1991] and [Duffie and Singleton, 1993] used versions of this approach. Notice that in contrast to some other applications of GMM estimation that allow the appearance of unknown nuisance parameters in the statistical model assumed to generate the data, this approach assumes that, given \(b\), the model completely determines a sample path that we can at least simulate.
“Indirect inference” works with two statistical models in hand: (1) a “structural economic model” with a vector of parameters \(\beta\) that characterize preferences, technology, information flows, and other features of the theoretical economic model; and (2) an “auxiliary model” with no pretense of being “structural” in terms of economic theory and having a vector of parameters \(\delta\) that let the model fit well. Although the structural model can be solved and simulated on a computer, it is too complicated to allow writing down its likelihood process analytically. The likelihood process for the auxiliary model can be calculated analytically. “Moments matching” estimation in the style of [Gallant and Tauchen, 1996] proceeds in two steps, the first of which uses maximum likelihood estimation of parameters of the auxiliary model prepares a random vector \(\psi(X_t)\) whose moments are to be matched, the second of which proceeds as in Remark 12.1 to use simulations of the structural model to approximate the function \(\kappa (\beta)\). In the first step, the parameter vector \(\delta\) of the auxiliary model is estimated by the method of maximum likelihood and the sample path of the associated score vector is evaluated at the maximum likelihood estimate \(\hat{\delta}\). As an input into the second step, the associated Fisher information matrix is computed. The second step forms a GMM criterion consisting of a quadratic form in the score vector with weighting matrix being the inverse of the Fisher information matrix computed in the first step. Repeated simulations of the structural model are used to search for a \(b_N\) that best matches score-criterion from the auxiliary model.
12.3. Central limit approximation#
The process
can be verified to have stationary and ergodic increments conditioned on the statistical model. So there exists a Proposition 2.2.2 decomposition of the process. Provided that
under the statistical model that generates the data, the trend term in the decomposition of Proposition 2.2.2 is zero, implying that the martingale dominates the behavior of sample averages for large \(N\). In particular, Proposition 2.3.1 in Chapter 2 gives a central limit approximation for
Let \(A = \{ A_t : t \ge 0 \}\) and suppose that
converges in mean square. Define the one-step-ahead forecast error:
Paralleling the construction of the martingale increment in Proposition 2.2.2,
where by the approximation sign \(\approx\) we intend to assert that the difference between the right side and left side converges in mean square to zero as \(N \rightarrow \infty\). Consequently, the covariance matrix in the central limit approximation is
Recall Restriction 9.1. For the preceding construction of the martingale increment, it is straightforward to verify that
follows from the linearity of conditional expectations.
Consider again Example 12.1 in which \(A_t = \mathbb{A}\) for all \(t\ge 0\) and
where
Define the covariance matrix
and note that
In Example 12.2
and hence
whenever entries of \(A_t\) are restricted to be \({\mathfrak A}_{t-\ell}\) measurable.
Consequently
for \(j \ge \ell\) so that the infinite sums used to construct \(G_t(A)\) simplify to finite sums.
12.4. Mean value approximation#
Write
where
Since
So long as \(\nabla(A)\) is nonsingular,
This approximation underlies an “efficiency bound” for GMM estimation. Notice that the covariance matrix in a central limit approximation is:
We want to know how small we can make this matrix by choosing a selection process.
Consider again Example 12.1. In this case \(A_t = {\mathbb A}\) for all \(t \ge 0\) and
where
and
For purposes of devising a test of the “over-identifying restrictions,” let \(B = \{ B_t : t \ge 0\}\) be an \(r \times {\tilde k}\) matrix process constructed to verify
Restriction 9.2. For any \({\tilde k} \times k\) matrix of real numbers \({\mathbb K}\), \( B {\mathbb K} \in {\mathcal A}\).
Thus, we can build selection processes for estimation equations from the columns of the process \(B\).
Suppose that
converges in mean square so that we can apply a central limit approximation.
Construct
Since Restriction 9.2 is satisfied, notice that
for all \({\tilde k} \times k\) matrices \({\mathbb K}\) of real numbers.
By imitating an earlier argument
Notice that if \(A_t = B_t\), then the right side is zero and the limiting distribution is degenerate. This approximation is used to construct tests that account for having used GMM to estimate a parameter vector \(\beta\).
Consider again unconditional moment restrictions specified in Example 12.1. Let the selection process for testing be constant over time so that \(B_t = {\mathbb B}\). Then
12.5. GMM Efficiency Bound#
Recall
We seek a greatest lower bound on the covariance matrix on the right.
Suppose that \(\left[\nabla(A)'\right]^{-1}\) is nonsingular and impose that
\[\left[\nabla(A)\right] = {\mathbb I}\]If not, post multiply \(A\) by a nonsingular matrix \({\mathbb J}\). That leaves the GMM estimator unaltered. Thus, we have
\[\textrm{ cov}(A) = E\left[ G_t(A) G_t(A)' \right] \]subject to \(\left[\nabla(A)\right] = {\mathbb I}\)
Find an \(A^d\) such that for all \(A \in {\mathcal A}\)
(12.3)#\[\nabla(A) = E\left[ G_t(A^d) G_t(A)' \right] .\]Form
\[A_t^* = A^d_t \left( E\left[ G_t(A^d) G_t(A^d)' \right]\right)^{-1} \]for all \(A \in \mathcal A\). These form a set of first-order sufficient conditions for our constrained minimization problem.
Then\[G_t(A^*) = \left( E\left[ G_t(A^d) G_t(A^d)' \right]\right)^{-1} G_t(A^d)\]and
\[E \left[ G_t(A^*) G_t(A)'\right] = \left( E\left[ G_t(A^d) G_t(A^d)' \right]\right)^{-1}\]provided that \(\left[\nabla(A)\right] = {\mathbb I}.\)
Therefore,
\[0 \leq E \left( \left[ G_t(A) - G_t(A^*) \right] \left[ G_t(A) - G_t(A^*) \right]' \right) = \textrm{cov}(A) - \left( E\left[ G_t(A^d) G_t(A^d)' \right]\right)^{-1} .\]
Result 9.1. Given a solution to equation (12.3)
In the result 9.1 efficiency bound, we might be tempted to think that \(G_t(A^d)\) plays the same role that the “score vector” increment does in maximum likelihood estimation. But because there is a possibly infinite dimensional vector of nuisance parameters here, a better analogy is that \(G_t(A^d)\) acts much like the residual vector in a regression of parameters of interest score increments on nuisance parameter score increments. By undertaking to infer the parameter vector \(\beta\) from conditional or unconditional moment restrictions, we have purposefully pushed all nuisance parameters into the background.
The representation
used to compute the efficiency bound is an application of the Riesz Representation Theorem. To understand this, introduce the \(k\)-dimensional coordinate vectors \({\sf u}_i\) for \(i=1,2,...,k\) and consider:
Note that
The integer \(i\) selects the coordinate of \(b\) with respect to which we are differentiating.
If \(A^1\) and \(A^2\) are both in \({\mathcal A}\), then so are linear combinations. Therefore \(({\sf u}_i)'\nabla(A) {\sf u}_j\) is a linear functional defined on a linear space of random variables of the form \(\phi(X_t, \beta)' A_t {\sf u}_j\) for a given \(i\).
The martingale approximations for the scalar process with stationary increments
\[\left\{ \sum_{t=1}^N \phi(X_t, \beta)' A_t {\sf u}_j : N \ge 1 \right\}\]has martingale increment \(G_t(A) {\sf u}_j\).
The Riesz Representation Theorem asserts that the linear functional \(({\sf u}_i)'\nabla(A) {\sf u}_j\) can be represented as an inner product
\[({\sf u}_i)'\nabla(A) {\sf u}_j = E\left[ R_t G_t(A) {\sf u}_j \right]\]where the scalar random variable \(R_t\) is in the mean square closure of
\[\left\{ G_t(A) {\sf u}_j : A \in {\mathcal A} \right\}.\]We can represent \(R_t\) as
\[R_t = G_t(A^d){\sf u}_j\]for some selection process \(A_d \in {\mathcal A}\) or more generally as a limit point of a sequence of such selection processes.
The preceding construction pins down row \(j\) of \(A^d\). Repeating an analogous construction for each \(j = 1,2,...,k\) gives the selection matrix \(A^d\).
The GMM efficiency bound presumed that we could solve equation (12.5). The Riesz Representation Theorem requires that \(R_t\) be in a mean square closure of a linear space. Provided that the linear functionals \(({\sf u}_i)'\nabla(A) {\sf u}_j\) are mean square continuous, the efficiency bound can be represented in terms of the limit point of a sequence of GMM estimators associated with alternative selection processes even when the limit points are not attained.
Consider Example 12.1 in which we assumed that \(A_t = \mathbb{A}\). Then
Therefore,
and the GMM efficiency bound is
Consider again Example 12.2 in the special case in which \(\ell = 1\).
Let
wish to solve the following equation for \(A_t^d\)
Given the flexibility in the choice of the random \(A_t\) with entries that are \({\mathcal A}_{t-1}\) measurable, this equation is equivalent to
where we have taken transposes of the expressions in (12.6). Thus
and the efficiency bound is:
Two-stage least squares. Add the following special restrictions to Example 12.8. Suppose that \(r=1\) and that \(V_{t-1} = \mathsf{v} > 0\) where \(\mathsf{v}\) is constant. Further suppose that
Finally, suppose that
where \(Z_{t-1}\) has more entries than \(Y_t^2\). Notice that \(\Pi\) can be computed as a least squares regression. Then
The scaling by \(\frac{1}{\mathsf{v}}\) is inconsequential to the construction of a selection process. The matrix of regression coefficients can be replaced by the finite sample least squares regression coefficients without altering the statistical efficiency.
Example 12.9 has a special structure that does not prevail in some important applications. For instance, suppose that \(V_{t-1}\) depends on conditioning information so that a form of conditional heteroskedasticity is present. That dependence shows up in essential ways in how \(A^d_t\) should be constructed. Further, suppose that the expectation \(E\left( X_t^2 \mid {\mathfrak A}_{t-1} \right)\) potentially depends nonlinearly on \(Z_{t-1}\). In that case, to attain or to approximate the efficiency bound, a least squares regression should account for potential nonlinearity. Finally, suppose that \(\ell > 1\). Then even if the covariance structure is homoskedastic and conditional expectations are linear, the two-squares least square approach will no longer be statistically efficient. We again have to deploy an appropriate martingale central limit approximation. In these circumstances, simply by mapping into the framework of Example 12.1, we can improve efficiency relative to least squares or two-stage least squares, for instance, by letting
[Hansen and Singleton, 1996] construct the efficiency bound in Example 12.2 for a linear data generating process.
12.6. Statistical tests#
First suppose that we have statistically efficient selection process. Recall the approximation
Let \({\widetilde G}_t(B)\) denote the increment in the martingale approximation for
From the restrictions that we have imposed on the process \(B\) used for constructing tests
Using both of these representations:
where
The term, \({\widehat G}_t(B)\), that appears inside the sum on the right side of (12.7) is the population least squares residual from regressing \({\widetilde G}_t(B)\) onto \(G_t(A^d)\). This regression residual can also be interpreted as a martingale increment for a stationary increments process.
Suppose that \({\widehat G}_t(B)\) has a nonsingular covariance matrix. Consider the quadratic form used for building a test:
This test can be implemented in practice by replacing \(E \left[{\widehat G}_t(B){\widehat G}_t(B)' \right]\) with a statistically consistent estimator of it. There is an equivalent way to represent this quadratic form:
This equivalence follows because the inverse of the covariance matrix for the regression error \({\widehat G}_t(B)\) is the upper diagonal block of the inverse of the covariance matrix:
Consider Example 12.1 again. We have already shown that
Suppose that we choose \({\mathbb B}\) with dimension \(r \times (r-k)\) so that
has full rank. Then
If we replace \(b_N\) with \(\beta\) on the left side of the above limit we find
The difference in the resulting \(\chi^2\) distribution emerges because estimating \(k\) free parameters reduces degrees of freedom by \(k\). It is straightforward to show that
an approximation that is useful for constructing confidence sets for GMM estimates of parameter vector \(\beta\).
To continue our study of Example 12.1, form the population problem:
This has a minimizer at \(b = \beta\) provided that the unconditional moment conditions are satisfied. If \(b = \beta\) is the only possible parameter vector that satisfies the population moment conditions, then \(b = \beta\) is the unique solution to the population minimization problem stated here. Suppose that we construct an estimator by solving a minimization problem:
First-order necessary conditions are
Assume that we already know that the solution \(b_N\) of the above first-order conditions provides a consistent estimator of parameter vector \(\beta\). Then we can show that
where convergence is with probability one. Thus, in this case the selection matrix
provides an estimator that attains the efficiency bound. The limiting distribution of the minimizer of criterion (12.8) is \(\chi^2\) with \(r - k\) degrees of freedom.
There is an interesting variation of the approach described in Remark 12.6. For any \(b\), let \(\mathbb{V}(b)\) be the population covariance matrix in the martingale increment used in the Central Limit approximation for the process
Assume that \(\mathbb{V}(b)\) is nonsingular for every \(b\) in a parameter space. Form the population minimization problem:
If \(b = \beta\) is the only vector that satisfies the associated population first-order conditions, then \(b = \beta\) is again the unique solution to the above population minimization problem.
Now form sample counterparts of both \(E \left[\phi(X_t,b)\right]\) and \(\mathbb{V}(b)\) as functions of \(b\). Minimizing a sample counterpart of the above population minimization problem gives rise to a “continuously-updated GMM estimator”. See [Hansen et al., 1996]. The parameter vector and an appropriately scaled minimized objective function have the same limiting distributions as those described in Remark 12.6.[2]
Consider the following conditional moment restriction:
where the random vector \(Y_t\) and parameter vector \(\alpha\) are both \(k \times 1\). We want to know whether there is an \(\alpha \ne 0\) that satisfies the conditional moment conditions. Evidently the parameter vector \(\alpha\) is only identified up to scale so that the conditional moment restrictions at most identify a one-dimensional family of parameter vectors. In practice, researchers typically achieve identification by normalizing. This can be done, for example, by arbitrarily setting a particular component of \(\alpha\) to be unity or else by restricting the norm of \(\alpha\) to be unity. If one restricts the norm of \(\alpha\) in this way, at best there will be two solutions, say, \(\alpha\) and \(-\alpha\) with the same norms. Economic interpretations should guide a normalization.
By taking an \(r\) dimensional vector \(Z_{t-\ell}\) that is in the conditioning information set at date \(t-\ell\) and thus \({\mathfrak A}_{t-\ell}\) measurable, we can form an implied unconditional moment condition,
from which we deduce that \(\alpha\) must be in the null space of the matrix
To identify a one-dimensional null space of \(\alpha\) vectors, it is necessary that the matrix
have rank \(k\), a restriction that it is straightforward to test.
Two-stage least square imposes a normalization that affects the one-dimensional null space. A one-dimensional null space is also affected when we use a fixed covariance matrix \({\mathbb V}\). In contrast, normalizations imposed in “continuously updated GMM” typically do not affect the null space.