13 Asymptotic Theory
In the previous chapter, we introduced three classes of extremum estimators — maximum likelihood, GMM, and minimum distance — with examples from each of our prototype models. Now we turn to the statistical theory that governs these estimators. Recall the two key properties we set out to establish:
- Consistency: Does \(\hat{\theta}\rightarrow\theta_{0}\) as the sample grows?
- Inference: What is the sampling distribution of \(\hat{\theta}\) around \(\theta_{0}\)?
The results in this chapter apply broadly to all extremum estimators, and then we specialize to maximum likelihood, minimum distance, and GMM in turn. Throughout, we use our prototype models to illustrate how the theory translates into practice.
The results in this section follow Newey and McFadden (1994) very closely, and you can find a more thorough and precise treatment of the theory in that text.
13.1 Definitions
We begin by formally defining the classes of estimators we will study. The broadest class is the extremum estimator.
Definition 13.1 (Extremum Estimator) \(\hat{\theta}\) is an extremum estimator if: \[\hat{\theta} = \arg\max_{\theta\in\Theta}Q_{N}(\theta)\] where \(\Theta\subset\mathbb{R}^{p}\) and \(Q_{N}(\cdot)\) is some objective function that depends on the data.
This is a very broad definition. All of the estimators we encountered in the previous chapter fall into this class. What distinguishes them is the structure of \(Q_{N}\).
13.1.1 M-estimators
An important subclass of extremum estimators arises when the objective function is an average over the sample:
Definition 13.2 (M-estimator) \(\hat{\theta}\) is an M-estimator if: \[Q_{N}(\theta) = \frac{1}{N}\sum_{n=1}^{N}m(\mathbf{w}_{n},\theta)\] for some known function \(m\).
The “M” stands for “maximum” (or “minimum”). Two of our workhorse estimators are M-estimators:
- Maximum Likelihood: \(m(\mathbf{w}_{n},\theta) = \log f(y_{n}|\mathbf{x}_{n},\theta)\). This is the log-likelihood contribution of observation \(n\).
- Nonlinear Least Squares: \(m(\mathbf{w}_{n},\theta) = -(y_{n}-\varphi(\mathbf{x}_{n},\theta))^{2}\). Here \(\varphi(\mathbf{x},\theta)\) is a regression function and the objective penalizes deviations of \(y\) from its conditional mean.
13.1.2 GMM Estimator
The GMM estimator is defined by a set of moment conditions \(\mathbb{E}[g(\mathbf{w},\theta_{0})]=\mathbf{0}\):
Definition 13.3 (GMM Estimator) \[Q_{N}(\theta) = -\frac{1}{2}\mathbf{g}_{N}(\theta)'\hat{\mathbf{W}}\mathbf{g}_{N}(\theta),\qquad\mathbf{g}_{N}(\theta)=\frac{1}{N}\sum_{n}g(\mathbf{w}_{n},\theta)\] where \(\hat{\mathbf{W}}\) is a positive definite weighting matrix.
Note that GMM is itself an M-estimator with \(m(\mathbf{w}_{n},\theta)=-\frac{1}{2}g(\mathbf{w}_{n},\theta)'\hat{\mathbf{W}}g(\mathbf{w}_{n},\theta)\) (after expanding the quadratic form). We will return to the specific properties of GMM in a later section.
13.1.3 Minimum Distance Estimator
The minimum distance estimator works with a first-stage reduced-form estimate \(\hat{\pi}\) and model restrictions \(\psi(\pi,\theta)\):
Definition 13.4 (Minimum Distance Estimator) \[Q_{N}(\theta) = -\frac{1}{2}\psi(\hat{\pi}_{N},\theta)'\hat{\mathbf{W}}\psi(\hat{\pi}_{N},\theta)\] where \(\psi(\pi_{0},\theta_{0})=\mathbf{0}\) and \(\sqrt{N}(\hat{\pi}_{N}-\pi_{0})\rightarrow_{d}\mathcal{N}(\mathbf{0},\Omega)\).
The minimum distance estimator differs from GMM in that the objective depends on the data only through the first-stage statistic \(\hat{\pi}\), rather than through the individual observations directly. We will study its asymptotic properties in a dedicated section.
13.2 Consistency
An extremum estimator solves \(\hat{\theta} = \arg\max_{\theta\in\Theta}Q_{N}(\theta)\). Let \(Q_{0}(\theta)\) denote the population analogue: the probability limit of \(Q_{N}(\theta)\).
When can we guarantee that \(\hat{\theta}\rightarrow_{p}\theta_{0}\)? Intuitively, two conditions are needed:
- Identification: The population objective \(Q_{0}(\theta)\) must be uniquely maximized at \(\theta_{0}\). If there were multiple maximizers, convergence of \(Q_{N}\) to \(Q_{0}\) would not pin down which one \(\hat{\theta}\) approaches.
- Convergence: \(Q_{N}(\theta)\) must converge to \(Q_{0}(\theta)\) in a sufficiently strong sense that the maximizer of \(Q_{N}\) tracks the maximizer of \(Q_{0}\).
These two conditions are the backbone of every consistency argument. The precise form of the convergence condition depends on the structure of the problem.
Theorem 13.1 (Consistency with Compactness) Suppose the following conditions hold:
- \(\Theta\) is a compact subset of \(\mathbb{R}^{p}\)
- \(Q_{N}(\theta)\) is continuous in \(\theta\) for all realizations of the data
- \(Q_{N}(\theta)\) is a measurable function of the data for all \(\theta\in\Theta\)
and additionally:
- Identification: \(Q_{0}(\theta)\) is uniquely maximized at \(\theta_{0}\in\Theta\)
- Uniform Convergence: \(\sup_{\theta\in\Theta}|Q_{N}(\theta)-Q_{0}(\theta)|\rightarrow_{p}0\)
Then \(\hat{\theta}\rightarrow_{p}\theta_{0}\).
Compactness is a strong assumption. Many parameter spaces of interest are not bounded (e.g. regression coefficients). The following result relaxes compactness at the cost of requiring concavity.
Theorem 13.2 (Consistency without Compactness) Suppose the following conditions hold:
- \(\theta_{0}\in\text{int}(\Theta)\)
- \(Q_{N}(\theta)\) is concave in \(\theta\) for all realizations of the data
- \(Q_{N}(\theta)\) is a measurable function of the data for all \(\theta\in\Theta\)
and additionally:
- Identification: \(Q_{0}(\theta)\) is uniquely maximized at \(\theta_{0}\in\Theta\)
- Pointwise Convergence: \(Q_{N}(\theta)\rightarrow_{p}Q_{0}(\theta)\) for all \(\theta\in\Theta\)
Then \(\hat{\theta}\rightarrow_{p}\theta_{0}\).
The key insight is that concavity turns pointwise convergence into uniform convergence on compact subsets (this follows from a result in convex analysis due to Rockafellar, 1970). Combined with the fact that \(\theta_{0}\) is an interior point, one can construct a compact set around \(\theta_{0}\) that traps \(\hat{\theta}\) with probability approaching 1 and then apply the logic of Theorem 13.1.
13.2.1 Uniform Convergence for M-Estimators
Condition (b) of Theorem 13.1 requires uniform convergence of \(Q_{N}\) to \(Q_{0}\). This is stronger than pointwise convergence and deserves some attention. For M-estimators of the form \(Q_{N}(\theta)=\frac{1}{N}\sum_{n=1}^{N}m(\mathbf{w}_{n},\theta)\), the question reduces to asking for a uniform law of large numbers. The following result provides simple sufficient conditions.
Theorem 13.3 (Uniform Law of Large Numbers) Suppose that \(\{\mathbf{w}_{n}\}_{n=1}^{N}\) is an ergodic stationary sequence and:
- \(\Theta\) is compact
- \(m(\mathbf{w},\theta)\) is continuous in \(\theta\) for all \(\mathbf{w}\)
- \(m(\mathbf{w},\theta)\) is measurable in \(\mathbf{w}\) for all \(\theta\)
- There exists \(d(\mathbf{w})\) with \(|m(\mathbf{w},\theta)|\leq d(\mathbf{w})\) for all \(\theta\in\Theta\) and \(\mathbb{E}[d(\mathbf{w})]<\infty\)
Then:
- \(\sup_{\theta\in\Theta}|Q_{N}(\theta)-Q_{0}(\theta)|\rightarrow_{p}0\); and
- \(Q_{0}(\theta) = \mathbb{E}[m(\mathbf{w},\theta)]\) is continuous in \(\theta\).
In practice, the dominance condition (4) is verified by checking \(\mathbb{E}[\sup_{\theta\in\Theta}|m(\mathbf{w},\theta)|]<\infty\). This is straightforward for many common estimators. See Newey and McFadden (1994) for a comprehensive treatment of these results.
13.2.2 Consistency of Maximum Likelihood
For maximum likelihood, the objective is \(Q_{N}(\theta)=\frac{1}{N}\sum_{n}^{N}\log f(\mathbf{w}_{n};\theta)\), and the population analogue is \(Q_{0}(\theta) = \mathbb{E}_{\theta_{0}}[\log f(\mathbf{w};\theta)]\). Identification for MLE has an elegant justification through the Kullback-Leibler inequality: for any two densities \(g\) and \(h\),
\[\mathbb{E}_{g}\left[\log\frac{g(\mathbf{w})}{h(\mathbf{w})}\right]\geq 0\]
with equality if and only if \(g=h\) almost everywhere. Applying this with \(g(\cdot) = f(\cdot;\theta_{0})\) and \(h(\cdot) = f(\cdot;\theta)\) gives:
\[\mathbb{E}_{\theta_{0}}[\log f(\mathbf{w};\theta_{0})] \geq \mathbb{E}_{\theta_{0}}[\log f(\mathbf{w};\theta)]\]
with equality if and only if \(f(\cdot;\theta)=f(\cdot;\theta_{0})\) almost everywhere. Thus, as long as different values of \(\theta\) imply different densities (a natural notion of identification for parametric models), the population log-likelihood is uniquely maximized at \(\theta_{0}\).
Theorem 13.4 (Consistency of Maximum Likelihood) Suppose that \(\{\mathbf{w}_{n}\}\) is ergodic stationary with density \(f(\mathbf{w};\theta_{0})\) and that \(\theta_{0}\in\Theta\). If:
- \(\Theta\) is compact
- \(\log f(\mathbf{w};\theta)\) is continuous in \(\theta\)
- \(f(\mathbf{w};\theta_{0})\neq f(\mathbf{w};\theta)\) with positive probability for all \(\theta\neq\theta_{0}\) (identification)
- \(\mathbb{E}[\sup_{\theta\in\Theta}|\log f(\mathbf{w};\theta)|]<\infty\) (dominance)
Then \(\hat{\theta}_{ML}\rightarrow_{p}\theta_{0}\).
Notice that identification here takes a model-specific form: we need different parameter values to imply different distributions for the data. This is a consequence of the fact that MLE relies on a fully specified parametric model.
An analogous result holds without compactness when the log-likelihood is concave in \(\theta\) (as is often the case for exponential family models), replacing (1) with \(\theta_{0}\in\text{int}(\Theta)\) and (4) with pointwise moment conditions.
13.3 Asymptotic Normality for M-Estimators
Having established when \(\hat{\theta}\rightarrow_{p}\theta_{0}\), we now turn to characterizing the rate and distribution of \(\hat{\theta}\) around \(\theta_{0}\). The answer will justify the standard errors and confidence intervals that we routinely compute in applied work.
Consider an M-estimator: \(Q_{N}(\theta) = \frac{1}{N}\sum_{n=1}^{N}m(\mathbf{w}_{n},\theta)\). Define the score (gradient) and Hessian of \(m\):
\[\mathbf{s}(\mathbf{w},\theta) = \frac{\partial m(\mathbf{w},\theta)}{\partial\theta}\qquad(p\times 1)\]
\[\mathbf{H}(\mathbf{w},\theta) = \frac{\partial^{2}m(\mathbf{w},\theta)}{\partial\theta\partial\theta'}\qquad(p\times p)\]
13.3.1 Derivation via the Mean Value Theorem
Since \(\hat{\theta}\) maximizes \(Q_{N}\), the first-order condition gives: \[\frac{1}{N}\sum_{n=1}^{N}\mathbf{s}(\mathbf{w}_{n},\hat{\theta}) = \mathbf{0}\]
A mean value expansion around \(\theta_{0}\) yields: \[\mathbf{0} = \frac{1}{N}\sum_{n}\mathbf{s}(\mathbf{w}_{n},\theta_{0}) + \left[\frac{1}{N}\sum_{n}\mathbf{H}(\mathbf{w}_{n},\bar{\theta})\right](\hat{\theta}-\theta_{0})\]
where \(\bar{\theta}\) lies between \(\hat{\theta}\) and \(\theta_{0}\) (applied row-by-row). Rearranging: \[\sqrt{N}(\hat{\theta}-\theta_{0}) = -\left[\frac{1}{N}\sum_{n}\mathbf{H}(\mathbf{w}_{n},\bar{\theta})\right]^{-1}\frac{1}{\sqrt{N}}\sum_{n}\mathbf{s}(\mathbf{w}_{n},\theta_{0})\]
Now apply two standard arguments:
- By the Central Limit Theorem: \(\frac{1}{\sqrt{N}}\sum_{n}\mathbf{s}(\mathbf{w}_{n},\theta_{0})\rightarrow_{d}\mathcal{N}(\mathbf{0},\Sigma)\) where \(\Sigma = \mathbb{E}[\mathbf{s}(\mathbf{w},\theta_{0})\mathbf{s}(\mathbf{w},\theta_{0})']\).
- By a Law of Large Numbers and continuity: \(\frac{1}{N}\sum_{n}\mathbf{H}(\mathbf{w}_{n},\bar{\theta})\rightarrow_{p}\mathbb{E}[\mathbf{H}(\mathbf{w},\theta_{0})]\), using that \(\bar{\theta}\rightarrow_{p}\theta_{0}\).
Combining these via Slutsky’s theorem gives the result.
Theorem 13.5 (Asymptotic Normality for M-estimators) Suppose that the consistency conditions hold and additionally:
- \(\theta_{0}\in\text{int}(\Theta)\)
- \(m(\mathbf{w},\theta)\) is twice continuously differentiable in \(\theta\)
- \(\frac{1}{\sqrt{N}}\sum_{n}\mathbf{s}(\mathbf{w}_{n},\theta_{0})\rightarrow_{d}\mathcal{N}(\mathbf{0},\Sigma)\) with \(\Sigma\) positive definite
- \(\mathbb{E}[\sup_{\theta\in\mathcal{N}(\theta_{0})}\|\mathbf{H}(\mathbf{w},\theta)\|]<\infty\) for some neighborhood \(\mathcal{N}(\theta_{0})\)
- \(\mathbb{E}[\mathbf{H}(\mathbf{w},\theta_{0})]\) is nonsingular
Then: \[\sqrt{N}(\hat{\theta}-\theta_{0})\rightarrow_{d}\mathcal{N}\left(\mathbf{0},\ \mathbb{E}[\mathbf{H}]^{-1}\Sigma\mathbb{E}[\mathbf{H}]^{-1}\right)\]
where \(\mathbb{E}[\mathbf{H}]=\mathbb{E}[\mathbf{H}(\mathbf{w},\theta_{0})]\) and \(\Sigma = \mathbb{E}[\mathbf{s}(\mathbf{w},\theta_{0})\mathbf{s}(\mathbf{w},\theta_{0})']\).
The asymptotic variance \(\mathbb{E}[\mathbf{H}]^{-1}\Sigma\mathbb{E}[\mathbf{H}]^{-1}\) is often called the sandwich formula. In practice, we replace the population expectations with sample analogues: \[\widehat{\mathbb{V}}[\hat{\theta}] = \hat{H}^{-1}\hat{\Sigma}\hat{H}^{-1}/N\] where \(\hat{H} = \frac{1}{N}\sum_{n}\mathbf{H}(\mathbf{w}_{n},\hat{\theta})\) and \(\hat{\Sigma} = \frac{1}{N}\sum_{n}\mathbf{s}(\mathbf{w}_{n},\hat{\theta})\mathbf{s}(\mathbf{w}_{n},\hat{\theta})'\).
13.3.2 The Information Matrix Equality
For maximum likelihood, \(m(\mathbf{w},\theta)=\log f(\mathbf{w};\theta)\), and a remarkable simplification occurs. Under standard regularity conditions, the information matrix equality holds:
\[\mathcal{I}(\theta_{0}) \equiv \mathbb{E}\left[\mathbf{s}(\mathbf{w},\theta_{0})\mathbf{s}(\mathbf{w},\theta_{0})'\right] = -\mathbb{E}\left[\mathbf{H}(\mathbf{w},\theta_{0})\right]\]
This means that for MLE, the sandwich formula collapses to: \[\sqrt{N}(\hat{\theta}_{ML}-\theta_{0})\rightarrow_{d}\mathcal{N}\left(\mathbf{0},\ \mathcal{I}(\theta_{0})^{-1}\right)\]
The matrix \(\mathcal{I}(\theta)\) is called the Fisher information matrix. The MLE variance can be estimated using either the Hessian or the outer product of the score — or indeed the sandwich (which is robust to certain forms of misspecification).
13.3.3 The Delta Method
In many settings, the parameters of direct interest are not \(\theta\) itself but some smooth function \(g(\theta)\). For instance, in Example 12.1 the search model is parameterized by \(\theta=(h,\delta,\mu,\sigma,w^{*})\) but the economically meaningful objects include \(\lambda\) and \(b\), which are nonlinear functions of \(\theta\). The delta method provides the asymptotic distribution of such transformed parameters.
Theorem 13.6 (Delta Method) If \(\sqrt{N}(\hat{\theta}-\theta_{0})\rightarrow_{d}\mathcal{N}(\mathbf{0},V)\) and \(g:\mathbb{R}^{p}\rightarrow\mathbb{R}^{k}\) is continuously differentiable at \(\theta_{0}\) with Jacobian \(\nabla g(\theta_{0})\) of full row rank, then:
\[\sqrt{N}(g(\hat{\theta})-g(\theta_{0}))\rightarrow_{d}\mathcal{N}\left(\mathbf{0},\ \nabla g(\theta_{0})\ V\ \nabla g(\theta_{0})'\right)\]
The proof is a direct application of the continuous mapping theorem to the first-order Taylor expansion \(g(\hat{\theta})\approx g(\theta_{0})+\nabla g(\theta_{0})(\hat{\theta}-\theta_{0})\).
In practice, the Jacobian \(\nabla g\) can be computed analytically or by automatic differentiation. This is convenient when \(g\) is a complex function — for instance, when it involves solving the model as in the case of the reservation wage \(w^{*}\) in the search model.
13.4 Minimum Distance Estimators
The minimum distance estimator takes a different approach from the M-estimators discussed above. Instead of directly optimizing an objective over the raw data, it works in two stages: first estimate a reduced-form object \(\pi\), then find the structural parameters \(\theta\) that best fit the model’s implications for \(\pi\).
13.4.1 Setup
Let \(\psi(\pi,\theta)\) be a vector of \(J\) model restrictions satisfying \(\psi(\pi_{0},\theta_{0})=\mathbf{0}\). For example, in the savings model from Example 12.2, \(\pi\) consists of the variances of log income at each age, and \(\psi(\pi,\theta)=\pi - \mathbf{v}(\theta)\) measures the gap between observed and model-implied moments.
Suppose we have a first-stage estimator \(\hat{\pi}\) with: \[\sqrt{N}(\hat{\pi}-\pi_{0})\rightarrow_{d}\mathcal{N}(\mathbf{0},\Omega)\]
The minimum distance estimator is: \[\hat{\theta} = \arg\min_{\theta}\psi(\hat{\pi},\theta)'\mathbf{W}_{N}\psi(\hat{\pi},\theta)\] where \(\mathbf{W}_{N}\) is a positive definite weighting matrix.
13.4.2 Asymptotic Distribution
Theorem 13.7 (Asymptotics for Minimum Distance) Suppose:
- \(\psi(\pi_{0},\theta_{0})=\mathbf{0}\) and \(\psi(\pi_{0},\theta)\neq\mathbf{0}\) for all \(\theta\neq\theta_{0}\) (identification)
- \(\sqrt{N}(\hat{\pi}-\pi_{0})\rightarrow_{d}\mathcal{N}(\mathbf{0},\Omega)\)
- \(\mathbf{W}_{N}\rightarrow_{p}\mathbf{W}\), symmetric and nonsingular
- \(\psi\) is differentiable with \(\text{rank}(\nabla_{\theta}\psi_{0})=p\)
Define \(\nabla_{\theta}\psi_{0} = \frac{\partial\psi(\pi_{0},\theta_{0})'}{\partial\theta}\) and \(\nabla_{\pi}\psi_{0}=\frac{\partial\psi(\pi_{0},\theta_{0})'}{\partial\pi}\). Then:
\[\sqrt{N}(\hat{\theta}-\theta_{0})\rightarrow_{d}\mathcal{N}(\mathbf{0},\ V_{MD})\] where: \[V_{MD} = \left(\nabla_{\theta}\psi_{0}\mathbf{W}\nabla_{\theta}\psi_{0}'\right)^{-1}\nabla_{\theta}\psi_{0}\mathbf{W}\nabla_{\pi}\psi_{0}'\Omega\nabla_{\pi}\psi_{0}\mathbf{W}\nabla_{\theta}\psi_{0}'\left(\nabla_{\theta}\psi_{0}\mathbf{W}\nabla_{\theta}\psi_{0}'\right)^{-1}\]
13.4.3 The Optimal Weighting Matrix
The variance \(V_{MD}\) depends on the choice of \(\mathbf{W}\). The optimal weighting matrix minimizes \(V_{MD}\) (in the positive semi-definite sense) and is given by:
\[\mathbf{W}^{*} = \left(\nabla_{\pi}\psi_{0}'\Omega\nabla_{\pi}\psi_{0}\right)^{-1}\]
Under this choice, the variance simplifies to: \[V_{MD}^{*} = \left(\nabla_{\theta}\psi_{0}\left(\nabla_{\pi}\psi_{0}'\Omega\nabla_{\pi}\psi_{0}\right)^{-1}\nabla_{\theta}\psi_{0}'\right)^{-1}\]
In the common case where \(\psi(\pi,\theta) = \pi-h(\theta)\) for some function \(h\), the derivatives simplify: \(\nabla_{\pi}\psi_{0} = I\) and \(\nabla_{\theta}\psi_{0} = -\nabla_{\theta}h(\theta_{0})\). Then \(\mathbf{W}^{*}=\Omega^{-1}\) and:
\[V_{MD}^{*} = \left(\nabla_{\theta}h_{0}\Omega^{-1}\nabla_{\theta}h_{0}'\right)^{-1}\]
An important special case arises when the model is just-identified: \(\text{dim}(\psi)=\text{dim}(\theta)\). In this case, one can show using the implicit function theorem that the optimally weighted minimum distance estimator achieves the same asymptotic variance as MLE. Over-identification (more moments than parameters) necessarily introduces some loss relative to MLE but gains robustness.
13.5 The Generalized Method of Moments
GMM is an extremum estimator with objective: \[Q_{N}(\theta) = -\frac{1}{2}\mathbf{g}_{N}(\theta)'\mathbf{W}_{N}\mathbf{g}_{N}(\theta),\qquad\mathbf{g}_{N}(\theta)=\frac{1}{N}\sum_{n}g(\mathbf{w}_{n},\theta)\]
where \(\mathbb{E}[g(\mathbf{w},\theta_{0})]=\mathbf{0}\) are the moment conditions. The asymptotic distribution follows from Theorem 13.5 as a special case, but the structure of the problem leads to a particularly clean expression.
Theorem 13.8 (Asymptotic Distribution of GMM) Suppose that the standard regularity conditions hold and \(\mathbf{W}_{N}\rightarrow_{p}\mathbf{W}\). Let \(G=\mathbb{E}[\nabla_{\theta}g(\mathbf{w},\theta_{0})']\) and \(S=\mathbb{E}[g(\mathbf{w},\theta_{0})g(\mathbf{w},\theta_{0})']\). Then:
\[\sqrt{N}(\hat{\theta}_{GMM}-\theta_{0})\rightarrow_{d}\mathcal{N}\left(\mathbf{0},\ (G'\mathbf{W}G)^{-1}G'\mathbf{W}S\mathbf{W}G(G'\mathbf{W}G)^{-1}\right)\]
13.5.1 The Optimal Weighting Matrix
As with minimum distance, the asymptotic variance depends on the choice of \(\mathbf{W}\). The optimal weighting matrix is:
\[\mathbf{W}^{*}=S^{-1} = \left(\mathbb{E}[g(\mathbf{w},\theta_{0})g(\mathbf{w},\theta_{0})']\right)^{-1}\]
Under this choice, the variance simplifies to:
\[V_{GMM}^{*} = (G'S^{-1}G)^{-1}\]
When the model is just-identified (number of moments equals number of parameters), the GMM estimator does not depend on \(\mathbf{W}\) at all. This is because the sample moments \(\mathbf{g}_{N}(\hat{\theta})=\mathbf{0}\) are set exactly to zero, regardless of the weighting.
13.5.2 Feasible Efficient GMM
In practice, \(S\) depends on \(\theta_{0}\) and must be estimated. A common approach is two-step GMM:
- Estimate \(\hat{\theta}_{1}\) using some initial weighting matrix (e.g. \(\mathbf{W}=I\)).
- Compute \(\hat{S}=\frac{1}{N}\sum_{n}g(\mathbf{w}_{n},\hat{\theta}_{1})g(\mathbf{w}_{n},\hat{\theta}_{1})'\).
- Re-estimate: \(\hat{\theta}_{2} = \arg\min_{\theta}\mathbf{g}_{N}(\theta)'\hat{S}^{-1}\mathbf{g}_{N}(\theta)\).
The resulting estimator \(\hat{\theta}_{2}\) is asymptotically efficient. The first-stage estimation of \(\hat{S}\) does not affect the asymptotic variance because \(\hat{S}\rightarrow_{p}S\) under standard conditions, and the weighting matrix appears in the asymptotic variance only through its probability limit.
13.6 Efficiency
Given two consistent, asymptotically normal estimators \(\hat{\theta}_{1}\) and \(\hat{\theta}_{2}\), we say \(\hat{\theta}_{1}\) is asymptotically efficient relative to \(\hat{\theta}_{2}\) if \(V_{2}-V_{1}\) is positive semi-definite, where \(V_{j}\) is the asymptotic variance of \(\hat{\theta}_{j}\). A natural question is: among the class of estimators we have discussed, is there a “best” one?
13.6.1 Efficiency of Maximum Likelihood
The answer is yes, under the assumption that the model is correctly specified. The MLE achieves the Cramér-Rao lower bound: for any consistent, asymptotically normal estimator \(\hat{\theta}\) based on the likelihood,
\[V[\hat{\theta}] - \mathcal{I}(\theta_{0})^{-1}\geq 0\]
in the positive semi-definite sense. Since the MLE has asymptotic variance \(\mathcal{I}(\theta_{0})^{-1}\), no other estimator in this class can do better.
More concretely, one can show that MLE is efficient relative to any GMM estimator that uses moment conditions implied by the model. The argument proceeds by showing that for any GMM estimator with moments \(g(\mathbf{w},\theta)\), the difference in asymptotic variances is:
\[V_{GMM} - V_{MLE} = \mathbb{E}[\mathbf{m}\mathbf{s}']^{-1}\mathbb{E}[\mathbf{U}\mathbf{U}']\mathbb{E}[\mathbf{s}\mathbf{m}']^{-1}\geq 0\]
where \(\mathbf{m}\) is the influence function of the GMM estimator and \(\mathbf{U} = \mathbf{m}-\mathbb{E}[\mathbf{m}\mathbf{s}']\mathbb{E}[\mathbf{s}\mathbf{s}']^{-1}\mathbf{s}\) is the projection residual of \(\mathbf{m}\) on \(\mathbf{s}\). This is non-negative by construction, and equals zero only when \(\mathbf{m}\) is a linear function of \(\mathbf{s}\) — i.e. when the GMM estimator fully exploits the likelihood.
13.7 Two-Step Estimators
Many structural estimators proceed in stages. In Example 7.1, we first estimated the selection equation by MLE and then used these estimates in a second-stage OLS regression. In the search model, we might first estimate the wage distribution and then back out the reservation wage. The theory of two-step estimators formalizes the effect of first-stage estimation uncertainty on second-stage inference.
13.7.1 Setup
Suppose the estimator is defined by two sets of moment conditions:
- First step: Estimate \(\hat{\gamma}\) via \(\frac{1}{N}\sum_{n}g_{1}(\mathbf{w}_{n},\hat{\gamma})=\mathbf{0}\)
- Second step: Estimate \(\hat{\beta}\) via \(\frac{1}{N}\sum_{n}g_{2}(\mathbf{w}_{n},\hat{\gamma},\hat{\beta})=\mathbf{0}\)
The key feature is that the second step depends on the first-step estimates.
13.7.2 Asymptotic Distribution
To derive the joint distribution, stack the moment conditions. Let \(\alpha=(\gamma',\beta')'\) and write the full system as: \[\frac{1}{N}\sum_{n}\begin{bmatrix}g_{1}(\mathbf{w}_{n},\gamma)\\ g_{2}(\mathbf{w}_{n},\gamma,\beta)\end{bmatrix} = \mathbf{0}\]
The Jacobian of this system has a block-triangular structure: \[\Gamma = \begin{bmatrix}\Gamma_{1\gamma} & 0 \\ \Gamma_{2\gamma} & \Gamma_{2\beta}\end{bmatrix}\]
where \(\Gamma_{1\gamma}=\mathbb{E}[\nabla_{\gamma}g_{1}']\), \(\Gamma_{2\gamma}=\mathbb{E}[\nabla_{\gamma}g_{2}']\), and \(\Gamma_{2\beta}=\mathbb{E}[\nabla_{\beta}g_{2}']\). The zero in the upper-right block reflects the fact that the first step does not depend on \(\beta\).
Applying the standard GMM formula, the asymptotic variance of \(\hat{\beta}\) is:
\[V_{\beta} = \Gamma_{2\beta}^{-1}\mathbb{E}[(g_{2}-\Gamma_{2\gamma}\Gamma_{1\gamma}^{-1}g_{1})(g_{2}-\Gamma_{2\gamma}\Gamma_{1\gamma}^{-1}g_{1})']\Gamma_{2\beta}^{-1\prime}\]
The term \(\Gamma_{2\gamma}\Gamma_{1\gamma}^{-1}g_{1}\) captures the correction for first-stage estimation error. If we ignored this term and computed standard errors using only the second-stage moment conditions, we would generally get incorrect inference.
There is an important special case: if \(\Gamma_{2\gamma}=\mathbb{E}[\nabla_{\gamma}g_{2}']=0\), then the correction vanishes and the first-stage estimation has no effect on the second-stage asymptotic variance. Intuitively, this happens when the second-step moment conditions are locally insensitive to the first-step parameters at the true values.
A classic example is the two-stage IV estimator where the first stage is a probit. Here, the second-stage moment condition takes the form \(\mathbb{E}[\Phi(\mathbf{x}'\gamma_{0})\cdot u]=0\). One can show that \(\mathbb{E}[\nabla_{\gamma}g_{2}']=0\) because the projection of \(u\) onto functions of the instruments is zero by construction.