13 Asymptotic Theory

In the previous chapter, we introduced three classes of extremum estimators — maximum likelihood, GMM, and minimum distance — with examples from each of our prototype models. Now we turn to the statistical theory that governs these estimators. Recall the two key properties we set out to establish:

Consistency: Does $\hat{\theta}\rightarrow\theta_{0}$ as the sample grows?
Inference: What is the sampling distribution of $\hat{\theta}$ around $\theta_{0}$?

The results in this chapter apply broadly to all extremum estimators, and then we specialize to maximum likelihood, minimum distance, and GMM in turn. Throughout, we use our prototype models to illustrate how the theory translates into practice.

The results in this section follow Newey and McFadden (1994) very closely, and you can find a more thorough and precise treatment of the theory in that text.

13.1 Definitions

We begin by formally defining the classes of estimators we will study. The broadest class is the extremum estimator.

Definition 13.1 (Extremum Estimator) $\hat{\theta}$ is an extremum estimator if: \[\hat{\theta} = \arg\max_{\theta\in\Theta}Q_{N}(\theta)\] where $\Theta\subset\mathbb{R}^{p}$ and $Q_{N}(\cdot)$ is some objective function that depends on the data.

This is a very broad definition. All of the estimators we encountered in the previous chapter fall into this class. What distinguishes them is the structure of $Q_{N}$.

13.1.1 M-estimators

An important subclass of extremum estimators arises when the objective function is an average over the sample:

Definition 13.2 (M-estimator) $\hat{\theta}$ is an M-estimator if: \[Q_{N}(\theta) = \frac{1}{N}\sum_{n=1}^{N}m(\mathbf{w}_{n},\theta)\] for some known function $m$.

The “M” stands for “maximum” (or “minimum”). Two of our workhorse estimators are M-estimators:

Maximum Likelihood: $m(\mathbf{w}_{n},\theta) = \log f(y_{n}|\mathbf{x}_{n},\theta)$. This is the log-likelihood contribution of observation $n$.
Nonlinear Least Squares: $m(\mathbf{w}_{n},\theta) = -(y_{n}-\varphi(\mathbf{x}_{n},\theta))^{2}$. Here $\varphi(\mathbf{x},\theta)$ is a regression function and the objective penalizes deviations of $y$ from its conditional mean.

13.1.2 GMM Estimator

The GMM estimator is defined by a set of moment conditions $\mathbb{E}[g(\mathbf{w},\theta_{0})]=\mathbf{0}$:

Definition 13.3 (GMM Estimator) \[Q_{N}(\theta) = -\frac{1}{2}\mathbf{g}_{N}(\theta)'\hat{\mathbf{W}}\mathbf{g}_{N}(\theta),\qquad\mathbf{g}_{N}(\theta)=\frac{1}{N}\sum_{n}g(\mathbf{w}_{n},\theta)\] where $\hat{\mathbf{W}}$ is a positive definite weighting matrix.

Note that GMM is itself an M-estimator with $m(\mathbf{w}_{n},\theta)=-\frac{1}{2}g(\mathbf{w}_{n},\theta)'\hat{\mathbf{W}}g(\mathbf{w}_{n},\theta)$ (after expanding the quadratic form). We will return to the specific properties of GMM in a later section.

13.1.3 Minimum Distance Estimator

The minimum distance estimator works with a first-stage reduced-form estimate $\hat{\pi}$ and model restrictions $\psi(\pi,\theta)$:

Definition 13.4 (Minimum Distance Estimator) \[Q_{N}(\theta) = -\frac{1}{2}\psi(\hat{\pi}_{N},\theta)'\hat{\mathbf{W}}\psi(\hat{\pi}_{N},\theta)\] where $\psi(\pi_{0},\theta_{0})=\mathbf{0}$ and $\sqrt{N}(\hat{\pi}_{N}-\pi_{0})\rightarrow_{d}\mathcal{N}(\mathbf{0},\Omega)$.

The minimum distance estimator differs from GMM in that the objective depends on the data only through the first-stage statistic $\hat{\pi}$, rather than through the individual observations directly. We will study its asymptotic properties in a dedicated section.

13.2 Consistency

An extremum estimator solves $\hat{\theta} = \arg\max_{\theta\in\Theta}Q_{N}(\theta)$. Let $Q_{0}(\theta)$ denote the population analogue: the probability limit of $Q_{N}(\theta)$.

When can we guarantee that $\hat{\theta}\rightarrow_{p}\theta_{0}$? Intuitively, two conditions are needed:

Identification: The population objective $Q_{0}(\theta)$ must be uniquely maximized at $\theta_{0}$. If there were multiple maximizers, convergence of $Q_{N}$ to $Q_{0}$ would not pin down which one $\hat{\theta}$ approaches.
Convergence: $Q_{N}(\theta)$ must converge to $Q_{0}(\theta)$ in a sufficiently strong sense that the maximizer of $Q_{N}$ tracks the maximizer of $Q_{0}$.

These two conditions are the backbone of every consistency argument. The precise form of the convergence condition depends on the structure of the problem.

Theorem 13.1 (Consistency with Compactness) Suppose the following conditions hold:

$\Theta$ is a compact subset of $\mathbb{R}^{p}$
$Q_{N}(\theta)$ is continuous in $\theta$ for all realizations of the data
$Q_{N}(\theta)$ is a measurable function of the data for all $\theta\in\Theta$

and additionally:

Identification: $Q_{0}(\theta)$ is uniquely maximized at $\theta_{0}\in\Theta$
Uniform Convergence: $\sup_{\theta\in\Theta}|Q_{N}(\theta)-Q_{0}(\theta)|\rightarrow_{p}0$

Then $\hat{\theta}\rightarrow_{p}\theta_{0}$.

Proof Sketch

The idea is straightforward. Pick any open neighborhood $\mathcal{N}$ around $\theta_{0}$. We want to show that $\hat{\theta}\in\mathcal{N}$ with probability approaching 1.

Since $\theta_{0}$ uniquely maximizes $Q_{0}$ and $\Theta\setminus\mathcal{N}$ is compact (closed subset of a compact set), there exists a gap: \[\varepsilon = Q_{0}(\theta_{0}) - \sup_{\theta\in\Theta\setminus\mathcal{N}}Q_{0}(\theta) > 0\]

Now, by uniform convergence: \[Q_{N}(\hat{\theta}) \geq Q_{N}(\theta_{0}) \geq Q_{0}(\theta_{0}) - \varepsilon/2\] with probability approaching 1. At the same time, for any $\theta\in\Theta\setminus\mathcal{N}$: \[Q_{N}(\theta) \leq Q_{0}(\theta) + \varepsilon/2 \leq Q_{0}(\theta_{0}) - \varepsilon/2\] also with probability approaching 1. Thus $\hat{\theta}$ cannot lie outside $\mathcal{N}$, and since $\mathcal{N}$ was arbitrary, $\hat{\theta}\rightarrow_{p}\theta_{0}$.

The Necessity of Uniform Convergence

Here is an example to show the necessity of uniform (as opposed to pointwise) convergence. Suppose that $\{X_{n}\}_{n}^{N}$ are iid data with $[X]= 1 $ and $\overline{X}_{N}=\sum X_{n} / N$. Define:

\[Q_{N}(\theta) = 2\mathbf{1}\{\theta = \sqrt{N}/(1+\sqrt{N})\} + \overline{X}_{N} - \theta^2 \]

Note that $Q_{0}(\theta) = 1 - \theta^2$ and so $\theta_{0} = 0$, but that $\hat{\theta}_{N}\rightarrow_{p}1$.

The following animation illustrates this. For each $N$, we draw $\{X_n\}$ and plot $Q_N(\theta)$ as a continuous curve (the $\overline{X}_N - \theta^2$ part) plus the spike at $\theta = \sqrt{N}/(1+\sqrt{N})$. Even though $Q_N$ converges pointwise to $Q_0$ (shown in red), the spike always makes the maximizer $\hat{\theta}_N$ (orange diamond) drift toward 1, never settling at $\theta_0 = 0$ (green circle).

Compactness is a strong assumption. Many parameter spaces of interest are not bounded (e.g. regression coefficients). The following result relaxes compactness at the cost of requiring concavity.

Theorem 13.2 (Consistency without Compactness) Suppose the following conditions hold:

$\theta_{0}\in\text{int}(\Theta)$
$Q_{N}(\theta)$ is concave in $\theta$ for all realizations of the data
$Q_{N}(\theta)$ is a measurable function of the data for all $\theta\in\Theta$

and additionally:

Identification: $Q_{0}(\theta)$ is uniquely maximized at $\theta_{0}\in\Theta$
Pointwise Convergence: $Q_{N}(\theta)\rightarrow_{p}Q_{0}(\theta)$ for all $\theta\in\Theta$

Then $\hat{\theta}\rightarrow_{p}\theta_{0}$.

The key insight is that concavity turns pointwise convergence into uniform convergence on compact subsets (this follows from a result in convex analysis due to Rockafellar, 1970). Combined with the fact that $\theta_{0}$ is an interior point, one can construct a compact set around $\theta_{0}$ that traps $\hat{\theta}$ with probability approaching 1 and then apply the logic of Theorem 13.1.

13.2.1 Uniform Convergence for M-Estimators

Condition (b) of Theorem 13.1 requires uniform convergence of $Q_{N}$ to $Q_{0}$. This is stronger than pointwise convergence and deserves some attention. For M-estimators of the form $Q_{N}(\theta)=\frac{1}{N}\sum_{n=1}^{N}m(\mathbf{w}_{n},\theta)$, the question reduces to asking for a uniform law of large numbers. The following result provides simple sufficient conditions.

Theorem 13.3 (Uniform Law of Large Numbers) Suppose that $\{\mathbf{w}_{n}\}_{n=1}^{N}$ is an ergodic stationary sequence and:

$\Theta$ is compact
$m(\mathbf{w},\theta)$ is continuous in $\theta$ for all $\mathbf{w}$
$m(\mathbf{w},\theta)$ is measurable in $\mathbf{w}$ for all $\theta$
There exists $d(\mathbf{w})$ with $|m(\mathbf{w},\theta)|\leq d(\mathbf{w})$ for all $\theta\in\Theta$ and $\mathbb{E}[d(\mathbf{w})]<\infty$

Then:

$\sup_{\theta\in\Theta}|Q_{N}(\theta)-Q_{0}(\theta)|\rightarrow_{p}0$; and
$Q_{0}(\theta) = \mathbb{E}[m(\mathbf{w},\theta)]$ is continuous in $\theta$.

In practice, the dominance condition (4) is verified by checking $\mathbb{E}[\sup_{\theta\in\Theta}|m(\mathbf{w},\theta)|]<\infty$. This is straightforward for many common estimators. See Newey and McFadden (1994) for a comprehensive treatment of these results.

13.2.2 Consistency of Maximum Likelihood

For maximum likelihood, the objective is $Q_{N}(\theta)=\frac{1}{N}\sum_{n}^{N}\log f(\mathbf{w}_{n};\theta)$, and the population analogue is $Q_{0}(\theta) = \mathbb{E}_{\theta_{0}}[\log f(\mathbf{w};\theta)]$. Identification for MLE has an elegant justification through the Kullback-Leibler inequality: for any two densities $g$ and $h$,

\[\mathbb{E}_{g}\left[\log\frac{g(\mathbf{w})}{h(\mathbf{w})}\right]\geq 0\]

with equality if and only if $g=h$ almost everywhere. Applying this with $g(\cdot) = f(\cdot;\theta_{0})$ and $h(\cdot) = f(\cdot;\theta)$ gives:

\[\mathbb{E}_{\theta_{0}}[\log f(\mathbf{w};\theta_{0})] \geq \mathbb{E}_{\theta_{0}}[\log f(\mathbf{w};\theta)]\]

with equality if and only if $f(\cdot;\theta)=f(\cdot;\theta_{0})$ almost everywhere. Thus, as long as different values of $\theta$ imply different densities (a natural notion of identification for parametric models), the population log-likelihood is uniquely maximized at $\theta_{0}$.

Theorem 13.4 (Consistency of Maximum Likelihood) Suppose that $\{\mathbf{w}_{n}\}$ is ergodic stationary with density $f(\mathbf{w};\theta_{0})$ and that $\theta_{0}\in\Theta$. If:

$\Theta$ is compact
$\log f(\mathbf{w};\theta)$ is continuous in $\theta$
$f(\mathbf{w};\theta_{0})\neq f(\mathbf{w};\theta)$ with positive probability for all $\theta\neq\theta_{0}$ (identification)
$\mathbb{E}[\sup_{\theta\in\Theta}|\log f(\mathbf{w};\theta)|]<\infty$ (dominance)

Then $\hat{\theta}_{ML}\rightarrow_{p}\theta_{0}$.

Notice that identification here takes a model-specific form: we need different parameter values to imply different distributions for the data. This is a consequence of the fact that MLE relies on a fully specified parametric model.

An analogous result holds without compactness when the log-likelihood is concave in $\theta$ (as is often the case for exponential family models), replacing (1) with $\theta_{0}\in\text{int}(\Theta)$ and (4) with pointwise moment conditions.

13.3 Asymptotic Normality for M-Estimators

Having established when $\hat{\theta}\rightarrow_{p}\theta_{0}$, we now turn to characterizing the rate and distribution of $\hat{\theta}$ around $\theta_{0}$. The answer will justify the standard errors and confidence intervals that we routinely compute in applied work.

Consider an M-estimator: $Q_{N}(\theta) = \frac{1}{N}\sum_{n=1}^{N}m(\mathbf{w}_{n},\theta)$. Define the score (gradient) and Hessian of $m$:

\[\mathbf{s}(\mathbf{w},\theta) = \frac{\partial m(\mathbf{w},\theta)}{\partial\theta}\qquad(p\times 1)\]

\[\mathbf{H}(\mathbf{w},\theta) = \frac{\partial^{2}m(\mathbf{w},\theta)}{\partial\theta\partial\theta'}\qquad(p\times p)\]

13.3.1 Derivation via the Mean Value Theorem

Since $\hat{\theta}$ maximizes $Q_{N}$, the first-order condition gives: \[\frac{1}{N}\sum_{n=1}^{N}\mathbf{s}(\mathbf{w}_{n},\hat{\theta}) = \mathbf{0}\]

A mean value expansion around $\theta_{0}$ yields: \[\mathbf{0} = \frac{1}{N}\sum_{n}\mathbf{s}(\mathbf{w}_{n},\theta_{0}) + \left[\frac{1}{N}\sum_{n}\mathbf{H}(\mathbf{w}_{n},\bar{\theta})\right](\hat{\theta}-\theta_{0})\]

where $\bar{\theta}$ lies between $\hat{\theta}$ and $\theta_{0}$ (applied row-by-row). Rearranging: \[\sqrt{N}(\hat{\theta}-\theta_{0}) = -\left[\frac{1}{N}\sum_{n}\mathbf{H}(\mathbf{w}_{n},\bar{\theta})\right]^{-1}\frac{1}{\sqrt{N}}\sum_{n}\mathbf{s}(\mathbf{w}_{n},\theta_{0})\]

Now apply two standard arguments:

By the Central Limit Theorem: $\frac{1}{\sqrt{N}}\sum_{n}\mathbf{s}(\mathbf{w}_{n},\theta_{0})\rightarrow_{d}\mathcal{N}(\mathbf{0},\Sigma)$ where $\Sigma = \mathbb{E}[\mathbf{s}(\mathbf{w},\theta_{0})\mathbf{s}(\mathbf{w},\theta_{0})']$.
By a Law of Large Numbers and continuity: $\frac{1}{N}\sum_{n}\mathbf{H}(\mathbf{w}_{n},\bar{\theta})\rightarrow_{p}\mathbb{E}[\mathbf{H}(\mathbf{w},\theta_{0})]$, using that $\bar{\theta}\rightarrow_{p}\theta_{0}$.

Combining these via Slutsky’s theorem gives the result.

Theorem 13.5 (Asymptotic Normality for M-estimators) Suppose that the consistency conditions hold and additionally:

$\theta_{0}\in\text{int}(\Theta)$
$m(\mathbf{w},\theta)$ is twice continuously differentiable in $\theta$
$\frac{1}{\sqrt{N}}\sum_{n}\mathbf{s}(\mathbf{w}_{n},\theta_{0})\rightarrow_{d}\mathcal{N}(\mathbf{0},\Sigma)$ with $\Sigma$ positive definite
$\mathbb{E}[\sup_{\theta\in\mathcal{N}(\theta_{0})}\|\mathbf{H}(\mathbf{w},\theta)\|]<\infty$ for some neighborhood $\mathcal{N}(\theta_{0})$
$\mathbb{E}[\mathbf{H}(\mathbf{w},\theta_{0})]$ is nonsingular

Then: \[\sqrt{N}(\hat{\theta}-\theta_{0})\rightarrow_{d}\mathcal{N}\left(\mathbf{0},\ \mathbb{E}[\mathbf{H}]^{-1}\Sigma\mathbb{E}[\mathbf{H}]^{-1}\right)\]

where $\mathbb{E}[\mathbf{H}]=\mathbb{E}[\mathbf{H}(\mathbf{w},\theta_{0})]$ and $\Sigma = \mathbb{E}[\mathbf{s}(\mathbf{w},\theta_{0})\mathbf{s}(\mathbf{w},\theta_{0})']$.

The asymptotic variance $\mathbb{E}[\mathbf{H}]^{-1}\Sigma\mathbb{E}[\mathbf{H}]^{-1}$ is often called the sandwich formula. In practice, we replace the population expectations with sample analogues: \[\widehat{\mathbb{V}}[\hat{\theta}] = \hat{H}^{-1}\hat{\Sigma}\hat{H}^{-1}/N\] where $\hat{H} = \frac{1}{N}\sum_{n}\mathbf{H}(\mathbf{w}_{n},\hat{\theta})$ and $\hat{\Sigma} = \frac{1}{N}\sum_{n}\mathbf{s}(\mathbf{w}_{n},\hat{\theta})\mathbf{s}(\mathbf{w}_{n},\hat{\theta})'$.

13.3.2 The Information Matrix Equality

For maximum likelihood, $m(\mathbf{w},\theta)=\log f(\mathbf{w};\theta)$, and a remarkable simplification occurs. Under standard regularity conditions, the information matrix equality holds:

\[\mathcal{I}(\theta_{0}) \equiv \mathbb{E}\left[\mathbf{s}(\mathbf{w},\theta_{0})\mathbf{s}(\mathbf{w},\theta_{0})'\right] = -\mathbb{E}\left[\mathbf{H}(\mathbf{w},\theta_{0})\right]\]

Why does this hold?

Since $\int f(\mathbf{w};\theta)d\mathbf{w}=1$ for all $\theta$, differentiating under the integral sign with respect to $\theta$ gives: \[\int\frac{\partial f(\mathbf{w};\theta)}{\partial\theta}d\mathbf{w} = 0\] which is $\mathbb{E}_{\theta}[\mathbf{s}(\mathbf{w},\theta)]=0$. Differentiating again: \[\int\frac{\partial^{2}\log f}{\partial\theta\partial\theta'}f\ d\mathbf{w} + \int\frac{\partial\log f}{\partial\theta}\frac{\partial\log f}{\partial\theta'}f\ d\mathbf{w} = 0\] which yields $\mathbb{E}[\mathbf{H}]+\mathbb{E}[\mathbf{s}\mathbf{s}'] = 0$, i.e. $\Sigma = -\mathbb{E}[\mathbf{H}]$.

This means that for MLE, the sandwich formula collapses to: \[\sqrt{N}(\hat{\theta}_{ML}-\theta_{0})\rightarrow_{d}\mathcal{N}\left(\mathbf{0},\ \mathcal{I}(\theta_{0})^{-1}\right)\]

The matrix $\mathcal{I}(\theta)$ is called the Fisher information matrix. The MLE variance can be estimated using either the Hessian or the outer product of the score — or indeed the sandwich (which is robust to certain forms of misspecification).

Example: Probit Standard Errors

Example 13.1 Let’s illustrate the asymptotic variance formula for the probit model from the Generalized Roy Model (Example 7.1). The probit log-likelihood for a single observation is:

\[m(\mathbf{w}_{n},\gamma) = D_{n}\log\Phi(\mathbf{w}_{n}\gamma) + (1-D_{n})\log(1-\Phi(\mathbf{w}_{n}\gamma))\]

where $\mathbf{w}_{n} = [1,X_{n},Z_{n}]$. Rather than deriving the Hessian and score analytically, we can use automatic differentiation — a tool covered in more detail in the appendix.

using Distributions, Optim, Random, ForwardDiff, LinearAlgebra, Plots
function sim_data(γ,β0,β1,N ; ρ_0 = 0.3, ρ_1 = -0.3)
    X = rand(Normal(),N)
    Z = rand(Normal(),N)
    v = rand(Normal(),N)
    U0 = rand(Normal(),N) .+ ρ_0.*v
    U1 = rand(Normal(),N) .+ ρ_1.*v
    D = (γ[1] .+ γ[2]*X .+ γ[3]*Z .- v) .> 0
    Y = β0[1] .+ β0[2].*X .+ U0
    Y1 = β1[1] .+ β1[2].*X .+ U1
    Y[D.==1] .= Y1[D.==1]
    return (;X,Z,Y,D)
end

function log_likelihood(γ,data)
    (;D,X,Z) = data
    ll = 0.
    Fv = Normal()
    for n in eachindex(D)
        xg = γ[1] + γ[2]*X[n] + γ[3]*Z[n]
        if D[n] == 1
            ll += log(cdf(Fv,xg))
        else
            ll += log(1-cdf(Fv,xg))
        end
    end
    return ll / length(D)
end

function estimate_probit(data)
    res = optimize(x->-log_likelihood(x,data),zeros(3),Newton(),autodiff=:forward)
    return res.minimizer
end

estimate_probit (generic function with 1 method)

Now let’s estimate the probit model on a single dataset and compute standard errors using the information matrix. We use ForwardDiff to compute the score and Hessian numerically.

gamma = [0., 0.5, 0.1]
beta0 = [0., 0.3]
beta1 = [0., 0.5]
Random.seed!(123)
data = sim_data(gamma, beta0, beta1, 5000)
γ_hat = estimate_probit(data)

# Compute score for each observation
function score_n(n, γ, data)
    function ll_n(g)
        xg = g[1] + g[2]*data.X[n] + g[3]*data.Z[n]
        if data.D[n] == 1
            return log(cdf(Normal(),xg))
        else
            return log(1-cdf(Normal(),xg))
        end
    end
    return ForwardDiff.gradient(ll_n, γ)
end

N = length(data.D)

# Outer product of scores (estimate of Fisher information)
Σ_hat = zeros(3,3)
for n in 1:N
    s = score_n(n, γ_hat, data)
    Σ_hat += s * s'
end
Σ_hat ./= N

# Hessian of average log-likelihood
H_hat = ForwardDiff.hessian(g -> log_likelihood(g, data), γ_hat)

# Asymptotic variance (three ways)
V_sandwich = inv(H_hat) * Σ_hat * inv(H_hat) / N
V_hessian = -inv(H_hat) / N
V_opg = inv(Σ_hat) / N

se_sandwich = sqrt.(diag(V_sandwich))
se_hessian = sqrt.(diag(V_hessian))
se_opg = sqrt.(diag(V_opg))
println("Estimates: $(round.(γ_hat,digits=3))")
println("SE (sandwich): $(round.(se_sandwich,digits=4))")
println("SE (Hessian):  $(round.(se_hessian,digits=4))")
println("SE (OPG):      $(round.(se_opg,digits=4))")

Estimates: [-0.005, 0.481, 0.103]
SE (sandwich): [0.0184, 0.0203, 0.0185]
SE (Hessian):  [0.0185, 0.0203, 0.0187]
SE (OPG):      [0.0185, 0.0202, 0.0188]

Let’s verify these asymptotic standard errors against a Monte Carlo simulation. If the asymptotic theory is working, the standard deviation of the Monte Carlo estimates should be close to the asymptotic SE.

gamma_mc = mapreduce(vcat, 1:500) do b
    d = sim_data(gamma, beta0, beta1, 5000)
    estimate_probit(d)'
end

se_mc = std.(eachcol(gamma_mc))
println("SE (Monte Carlo): $(round.(se_mc,digits=4))")
println("SE (asymptotic):  $(round.(se_hessian,digits=4))")

SE (Monte Carlo): [0.0186, 0.0197, 0.0185]
SE (asymptotic):  [0.0185, 0.0203, 0.0187]

We can also plot the Monte Carlo distribution against the asymptotic normal approximation.

pl = [begin
    histogram(gamma_mc[:,j], normalize=:pdf, label=false, alpha=0.5)
    xgrid = range(extrema(gamma_mc[:,j])..., length=100)
    plot!(xgrid, pdf.(Normal(gamma[j], se_hessian[j]), xgrid),
          linewidth=2, label="Asymptotic Normal", color="red")
    plot!(title = "γ_$(j)")
end for j in 1:3]
plot(pl..., layout=(1,3), size=(900,300))

13.3.3 The Delta Method

In many settings, the parameters of direct interest are not $\theta$ itself but some smooth function $g(\theta)$. For instance, in Example 12.1 the search model is parameterized by $\theta=(h,\delta,\mu,\sigma,w^{*})$ but the economically meaningful objects include $\lambda$ and $b$, which are nonlinear functions of $\theta$. The delta method provides the asymptotic distribution of such transformed parameters.

Theorem 13.6 (Delta Method) If $\sqrt{N}(\hat{\theta}-\theta_{0})\rightarrow_{d}\mathcal{N}(\mathbf{0},V)$ and $g:\mathbb{R}^{p}\rightarrow\mathbb{R}^{k}$ is continuously differentiable at $\theta_{0}$ with Jacobian $\nabla g(\theta_{0})$ of full row rank, then:

\[\sqrt{N}(g(\hat{\theta})-g(\theta_{0}))\rightarrow_{d}\mathcal{N}\left(\mathbf{0},\ \nabla g(\theta_{0})\ V\ \nabla g(\theta_{0})'\right)\]

The proof is a direct application of the continuous mapping theorem to the first-order Taylor expansion $g(\hat{\theta})\approx g(\theta_{0})+\nabla g(\theta_{0})(\hat{\theta}-\theta_{0})$.

In practice, the Jacobian $\nabla g$ can be computed analytically or by automatic differentiation. This is convenient when $g$ is a complex function — for instance, when it involves solving the model as in the case of the reservation wage $w^{*}$ in the search model.

13.4 Minimum Distance Estimators

The minimum distance estimator takes a different approach from the M-estimators discussed above. Instead of directly optimizing an objective over the raw data, it works in two stages: first estimate a reduced-form object $\pi$, then find the structural parameters $\theta$ that best fit the model’s implications for $\pi$.

13.4.1 Setup

Let $\psi(\pi,\theta)$ be a vector of $J$ model restrictions satisfying $\psi(\pi_{0},\theta_{0})=\mathbf{0}$. For example, in the savings model from Example 12.2, $\pi$ consists of the variances of log income at each age, and $\psi(\pi,\theta)=\pi - \mathbf{v}(\theta)$ measures the gap between observed and model-implied moments.

Suppose we have a first-stage estimator $\hat{\pi}$ with: \[\sqrt{N}(\hat{\pi}-\pi_{0})\rightarrow_{d}\mathcal{N}(\mathbf{0},\Omega)\]

The minimum distance estimator is: \[\hat{\theta} = \arg\min_{\theta}\psi(\hat{\pi},\theta)'\mathbf{W}_{N}\psi(\hat{\pi},\theta)\] where $\mathbf{W}_{N}$ is a positive definite weighting matrix.

13.4.2 Asymptotic Distribution

Theorem 13.7 (Asymptotics for Minimum Distance) Suppose:

$\psi(\pi_{0},\theta_{0})=\mathbf{0}$ and $\psi(\pi_{0},\theta)\neq\mathbf{0}$ for all $\theta\neq\theta_{0}$ (identification)
$\sqrt{N}(\hat{\pi}-\pi_{0})\rightarrow_{d}\mathcal{N}(\mathbf{0},\Omega)$
$\mathbf{W}_{N}\rightarrow_{p}\mathbf{W}$, symmetric and nonsingular
$\psi$ is differentiable with $\text{rank}(\nabla_{\theta}\psi_{0})=p$

Define $\nabla_{\theta}\psi_{0} = \frac{\partial\psi(\pi_{0},\theta_{0})'}{\partial\theta}$ and $\nabla_{\pi}\psi_{0}=\frac{\partial\psi(\pi_{0},\theta_{0})'}{\partial\pi}$. Then:

\[\sqrt{N}(\hat{\theta}-\theta_{0})\rightarrow_{d}\mathcal{N}(\mathbf{0},\ V_{MD})\] where: \[V_{MD} = \left(\nabla_{\theta}\psi_{0}\mathbf{W}\nabla_{\theta}\psi_{0}'\right)^{-1}\nabla_{\theta}\psi_{0}\mathbf{W}\nabla_{\pi}\psi_{0}'\Omega\nabla_{\pi}\psi_{0}\mathbf{W}\nabla_{\theta}\psi_{0}'\left(\nabla_{\theta}\psi_{0}\mathbf{W}\nabla_{\theta}\psi_{0}'\right)^{-1}\]

Derivation Sketch

The first-order condition of the minimum distance problem is: \[\nabla_{\theta}\psi(\hat{\pi},\hat{\theta})\mathbf{W}_{N}\psi(\hat{\pi},\hat{\theta}) = \mathbf{0}\]

Expanding $\psi(\hat{\pi},\hat{\theta})$ around $(\pi_{0},\theta_{0})$: \[\psi(\hat{\pi},\hat{\theta})\approx \nabla_{\pi}\psi_{0}'(\hat{\pi}-\pi_{0}) + \nabla_{\theta}\psi_{0}'(\hat{\theta}-\theta_{0})\]

Substituting and solving: \[\sqrt{N}(\hat{\theta}-\theta_{0})\approx -(\nabla_{\theta}\psi_{0}\mathbf{W}\nabla_{\theta}\psi_{0}')^{-1}\nabla_{\theta}\psi_{0}\mathbf{W}\nabla_{\pi}\psi_{0}'\sqrt{N}(\hat{\pi}-\pi_{0})\]

The result follows from the asymptotic distribution of $\hat{\pi}$.

13.4.3 The Optimal Weighting Matrix

The variance $V_{MD}$ depends on the choice of $\mathbf{W}$. The optimal weighting matrix minimizes $V_{MD}$ (in the positive semi-definite sense) and is given by:

\[\mathbf{W}^{*} = \left(\nabla_{\pi}\psi_{0}'\Omega\nabla_{\pi}\psi_{0}\right)^{-1}\]

Under this choice, the variance simplifies to: \[V_{MD}^{*} = \left(\nabla_{\theta}\psi_{0}\left(\nabla_{\pi}\psi_{0}'\Omega\nabla_{\pi}\psi_{0}\right)^{-1}\nabla_{\theta}\psi_{0}'\right)^{-1}\]

In the common case where $\psi(\pi,\theta) = \pi-h(\theta)$ for some function $h$, the derivatives simplify: $\nabla_{\pi}\psi_{0} = I$ and $\nabla_{\theta}\psi_{0} = -\nabla_{\theta}h(\theta_{0})$. Then $\mathbf{W}^{*}=\Omega^{-1}$ and:

\[V_{MD}^{*} = \left(\nabla_{\theta}h_{0}\Omega^{-1}\nabla_{\theta}h_{0}'\right)^{-1}\]

Efficiency of Optimal Minimum Distance

Suppose that $\hat{\pi}$ are parameters estimated by maximum likelihood. and let $\dim(\psi)=\dim(\pi)$. The implicit function theorem says: \[ \nabla_\theta \pi = -\nabla_\theta\psi (\nabla_\pi\psi')^{-1} \]

So: \[\begin{align} \Sigma^* &= (\nabla_\theta\psi_0(\nabla_\pi\psi_0' \mathbb{E}[s_\pi s_\pi']^{-1} \nabla_\pi \psi_0)^{-1}\nabla_\theta \psi_0')^{-1} \\ &= (\nabla_\theta \psi_0\nabla_\pi\psi_0'^{-1}\mathbb{E}[s_\pi s_\pi'] \nabla_\pi \psi_0^{-1}\nabla_\theta \psi_0')^{-1} \\ &= (\nabla_\theta \pi \mathbb{E}[s_\pi s_\pi']\nabla_\theta \pi')^{-1} \\ &= \mathbb{E}[s_\theta s_\theta']^{-1} = \mathcal{I}_\theta \end{align}\]

Thus, the optimally weighted minimum distance estimator $\hat{\theta}$ also obtains its respective CRLB.

An important special case arises when the model is just-identified: $\text{dim}(\psi)=\text{dim}(\theta)$. In this case, one can show using the implicit function theorem that the optimally weighted minimum distance estimator achieves the same asymptotic variance as MLE. Over-identification (more moments than parameters) necessarily introduces some loss relative to MLE but gains robustness.

Example: Standard Errors for the Income Process

Example 13.2 Let’s extend the minimum distance estimation from Example 12.2 to compute standard errors for the income process parameters. Recall that we matched the variance of log income at each age to the model-implied variances.

Since our restrictions take the form $\psi(\pi,\theta) = \pi - \mathbf{v}(\theta)$, we have $\nabla_{\pi}\psi=I$ and $\nabla_{\theta}\psi = -\nabla_{\theta}\mathbf{v}$. Using the identity weighting matrix, the asymptotic variance is:

\[V_{MD} = (\nabla_{\theta}\mathbf{v}'\nabla_{\theta}\mathbf{v})^{-1}\nabla_{\theta}\mathbf{v}'\Omega\nabla_{\theta}\mathbf{v}(\nabla_{\theta}\mathbf{v}'\nabla_{\theta}\mathbf{v})^{-1}\]

where $\Omega$ is the variance-covariance matrix of the sample variance estimates $\hat{\pi}$.

using CSV, DataFrames, DataFramesMeta, Statistics, Optim, ForwardDiff, LinearAlgebra, Plots

# Load and prepare data
data_psid = @chain begin
    CSV.read("../data/abb_aea_data.csv",DataFrame,missingstring = "NA")
    @select :person :y :tot_assets1 :asset :age :year
    @subset :age.>=25 :age.<=64
end

# Calculate sample variances by age
moments_df = @chain data_psid begin
    groupby(:age)
    @combine begin
        :var_logy = var(log.(:y))
        :n = length(:y)
    end
    @orderby :age
end
m_hat = moments_df.var_logy
n_age = moments_df.n

# Model-implied moments
function model_moments(θ, T)
    ρ, σ2_α, σ2_η = θ
    m = [σ2_α + (1-ρ^(2t))/(1-ρ^2) * ρ^2 * σ2_η for t in 1:T]
    return m
end

# Transform parameters to enforce constraints
function transform(x)
    ρ = tanh(x[1])
    σ2_α = exp(x[2])
    σ2_η = exp(x[3])
    return [ρ, σ2_α, σ2_η]
end

# Minimum distance objective (identity weight)
function md_objective(x, m_hat)
    θ = transform(x)
    T = length(m_hat)
    m_model = model_moments(θ, T)
    diff = m_hat .- m_model
    return diff' * diff
end

# Estimate
x0 = [0.5, log(0.1), log(0.05)]
res = optimize(x -> md_objective(x, m_hat), x0, Newton(), autodiff = :forward)
x_hat = res.minimizer
θ_hat = transform(x_hat)

3-element Vector{Float64}:
 0.9180604516854532
 0.177428181251932
 0.11996993749567628

Now we compute standard errors. We need $\nabla_{\theta}\mathbf{v}$ (the Jacobian of model moments with respect to parameters) and $\Omega$ (the variance of sample moments). We compute the Jacobian with ForwardDiff and use the delta method to account for the parameter transformation.

T = length(m_hat)

# Jacobian of model moments w.r.t. θ = (ρ, σ²_α, σ²_η)
∇v = ForwardDiff.jacobian(θ -> model_moments(θ, T), θ_hat)

# Jacobian of transform (for delta method through the transformation)
∇t = ForwardDiff.jacobian(transform, x_hat)

# Estimate Ω: for variances, Var(σ̂²) ≈ 2σ⁴/(n-1) under normality
# A simple approximation using sample sizes at each age
Ω = Diagonal(2 .* m_hat.^2 ./ (n_age .- 1))

# Total sample size (use average n per age as approximation)
N_eff = Int(round(mean(n_age)))

# Asymptotic variance of θ̂ (identity weighting)
G = ∇v  # J × p Jacobian
V_md = inv(G' * G) * G' * Ω * G * inv(G' * G) / N_eff

# Standard errors
se = sqrt.(diag(V_md))

println("Minimum Distance Estimates with Standard Errors:")
println("  ρ     = $(round(θ_hat[1], digits=3))  ($(round(se[1], digits=4)))")
println("  σ²_α  = $(round(θ_hat[2], digits=3))  ($(round(se[2], digits=4)))")
println("  σ²_η  = $(round(θ_hat[3], digits=3))  ($(round(se[3], digits=4)))")

Minimum Distance Estimates with Standard Errors:
  ρ     = 0.918  (0.0005)
  σ²_α  = 0.177  (0.0015)
  σ²_η  = 0.12  (0.0009)

We can also compute the standard errors under the optimal weighting matrix and compare:

# Optimal weighting
W_opt = inv(Ω)
V_md_opt = inv(G' * W_opt * G) / N_eff
se_opt = sqrt.(diag(V_md_opt))

# Re-estimate with optimal weighting
function md_objective_opt(x, m_hat, W)
    θ = transform(x)
    T = length(m_hat)
    m_model = model_moments(θ, T)
    diff = m_hat .- m_model
    return diff' * W * diff
end

res_opt = optimize(x -> md_objective_opt(x, m_hat, W_opt), x0, Newton(), autodiff = :forward)
θ_hat_opt = transform(res_opt.minimizer)

# Recompute Jacobian at new estimates
∇v_opt = ForwardDiff.jacobian(θ -> model_moments(θ, T), θ_hat_opt)
V_md_opt2 = inv(∇v_opt' * W_opt * ∇v_opt) / N_eff
se_opt2 = sqrt.(diag(V_md_opt2))

println("\nOptimally Weighted Estimates:")
println("  ρ     = $(round(θ_hat_opt[1], digits=3))  ($(round(se_opt2[1], digits=4)))")
println("  σ²_α  = $(round(θ_hat_opt[2], digits=3))  ($(round(se_opt2[2], digits=4)))")
println("  σ²_η  = $(round(θ_hat_opt[3], digits=3))  ($(round(se_opt2[3], digits=4)))")


Optimally Weighted Estimates:
  ρ     = 0.922  (0.0004)
  σ²_α  = 0.197  (0.0012)
  σ²_η  = 0.087  (0.0006)

The optimally weighted estimator should be at least as precise, and is often substantially more so when the moment conditions have very different scales or variances.

Estimating the Variance of Moments

Note that in this example we estimated the variance of the moments assuming the model-implied assumption that the errors take a normal distribution. We also assumed that the off-diagonal members of the matrix (the covariances) were zero. This assumption is explicitly wrong since the same individuals appear at different ages in the sample. How can we properly calculate these variances and covariances without strict (and wrong) assumptions? The next chapter will illustrate how using a block bootstrap routine.

13.5 The Generalized Method of Moments

GMM is an extremum estimator with objective: \[Q_{N}(\theta) = -\frac{1}{2}\mathbf{g}_{N}(\theta)'\mathbf{W}_{N}\mathbf{g}_{N}(\theta),\qquad\mathbf{g}_{N}(\theta)=\frac{1}{N}\sum_{n}g(\mathbf{w}_{n},\theta)\]

where $\mathbb{E}[g(\mathbf{w},\theta_{0})]=\mathbf{0}$ are the moment conditions. The asymptotic distribution follows from Theorem 13.5 as a special case, but the structure of the problem leads to a particularly clean expression.

Theorem 13.8 (Asymptotic Distribution of GMM) Suppose that the standard regularity conditions hold and $\mathbf{W}_{N}\rightarrow_{p}\mathbf{W}$. Let $G=\mathbb{E}[\nabla_{\theta}g(\mathbf{w},\theta_{0})']$ and $S=\mathbb{E}[g(\mathbf{w},\theta_{0})g(\mathbf{w},\theta_{0})']$. Then:

\[\sqrt{N}(\hat{\theta}_{GMM}-\theta_{0})\rightarrow_{d}\mathcal{N}\left(\mathbf{0},\ (G'\mathbf{W}G)^{-1}G'\mathbf{W}S\mathbf{W}G(G'\mathbf{W}G)^{-1}\right)\]

13.5.1 The Optimal Weighting Matrix

As with minimum distance, the asymptotic variance depends on the choice of $\mathbf{W}$. The optimal weighting matrix is:

\[\mathbf{W}^{*}=S^{-1} = \left(\mathbb{E}[g(\mathbf{w},\theta_{0})g(\mathbf{w},\theta_{0})']\right)^{-1}\]

Under this choice, the variance simplifies to:

\[V_{GMM}^{*} = (G'S^{-1}G)^{-1}\]

When the model is just-identified (number of moments equals number of parameters), the GMM estimator does not depend on $\mathbf{W}$ at all. This is because the sample moments $\mathbf{g}_{N}(\hat{\theta})=\mathbf{0}$ are set exactly to zero, regardless of the weighting.

13.5.2 Feasible Efficient GMM

In practice, $S$ depends on $\theta_{0}$ and must be estimated. A common approach is two-step GMM:

Estimate $\hat{\theta}_{1}$ using some initial weighting matrix (e.g. $\mathbf{W}=I$).
Compute $\hat{S}=\frac{1}{N}\sum_{n}g(\mathbf{w}_{n},\hat{\theta}_{1})g(\mathbf{w}_{n},\hat{\theta}_{1})'$.
Re-estimate: $\hat{\theta}_{2} = \arg\min_{\theta}\mathbf{g}_{N}(\theta)'\hat{S}^{-1}\mathbf{g}_{N}(\theta)$.

The resulting estimator $\hat{\theta}_{2}$ is asymptotically efficient. The first-stage estimation of $\hat{S}$ does not affect the asymptotic variance because $\hat{S}\rightarrow_{p}S$ under standard conditions, and the weighting matrix appears in the asymptotic variance only through its probability limit.

13.6 Efficiency

Given two consistent, asymptotically normal estimators $\hat{\theta}_{1}$ and $\hat{\theta}_{2}$, we say $\hat{\theta}_{1}$ is asymptotically efficient relative to $\hat{\theta}_{2}$ if $V_{2}-V_{1}$ is positive semi-definite, where $V_{j}$ is the asymptotic variance of $\hat{\theta}_{j}$. A natural question is: among the class of estimators we have discussed, is there a “best” one?

13.6.1 Efficiency of Maximum Likelihood

The answer is yes, under the assumption that the model is correctly specified. The MLE achieves the Cramér-Rao lower bound: for any consistent, asymptotically normal estimator $\hat{\theta}$ based on the likelihood,

\[V[\hat{\theta}] - \mathcal{I}(\theta_{0})^{-1}\geq 0\]

in the positive semi-definite sense. Since the MLE has asymptotic variance $\mathcal{I}(\theta_{0})^{-1}$, no other estimator in this class can do better.

More concretely, we can show that MLE is efficient relative to any GMM estimator that uses moment conditions implied by the model.

Proof: MLE is Efficient Relative to GMM

Influence functions. From the asymptotic expansions derived earlier in this chapter, each estimator can be written as a sample average of an influence function $\mathbf{m}(\mathbf{w})$: \[\sqrt{N}(\hat{\theta}-\theta_{0})\approx\frac{1}{\sqrt{N}}\sum_{n=1}^{N}\mathbf{m}(\mathbf{w}_{n}),\qquad V = \mathbb{E}[\mathbf{m}\mathbf{m}']\]

For MLE (using $-\mathbb{E}[\mathbf{H}] = \mathcal{I}$): $\quad\mathbf{m}_{MLE} = \mathcal{I}(\theta_{0})^{-1}\mathbf{s}(\mathbf{w},\theta_{0})$, giving $V_{MLE} = \mathcal{I}^{-1}$.

For GMM: $\quad\mathbf{m}_{GMM} = -(G\mathbf{W}G')^{-1}G\mathbf{W}g(\mathbf{w},\theta_{0})$, giving $V_{GMM} = (G\mathbf{W}G')^{-1}G\mathbf{W}S\mathbf{W}G'(G\mathbf{W}G')^{-1}$.

A key identity. When the model is correctly specified, $\int g(\mathbf{w},\theta)f(\mathbf{w};\theta)d\mathbf{w} = \mathbf{0}$ holds as an identity for all $\theta$. Differentiating both sides with respect to $\theta$:

\[\int\frac{\partial g}{\partial\theta'}f\,d\mathbf{w} + \int g\cdot\frac{\partial\log f}{\partial\theta'}f\,d\mathbf{w} = \mathbf{0}\]

In matrix notation, evaluating at $\theta_{0}$:

\[\mathbb{E}[\nabla_{\theta}g(\mathbf{w},\theta_{0})] + \mathbb{E}[g(\mathbf{w},\theta_{0})\mathbf{s}(\mathbf{w},\theta_{0})'] = \mathbf{0}\]

Since $G' = \mathbb{E}[\nabla_{\theta}g]$, this gives us $G' = -\mathbb{E}[g\mathbf{s}']$. This is the bridge between the moment conditions and the likelihood score — it is the analogue of the information matrix equality, but for arbitrary moment conditions rather than the score.

Both influence functions have the same covariance with the score. Using the identity above:

\[\begin{align} \mathbb{E}[\mathbf{m}_{GMM}\mathbf{s}'] &= -(G\mathbf{W}G')^{-1}G\mathbf{W}\mathbb{E}[g\mathbf{s}'] = -(G\mathbf{W}G')^{-1}G\mathbf{W}(-G') = I_{p} \\ \mathbb{E}[\mathbf{m}_{MLE}\mathbf{s}'] &= \mathcal{I}^{-1}\mathbb{E}[\mathbf{s}\mathbf{s}'] = I_{p} \end{align}\]

The variance decomposition. Define the residual $\mathbf{U} = \mathbf{m}_{GMM} - \mathbf{m}_{MLE}$. Since both influence functions have the same covariance with $\mathbf{s}$, the cross terms simplify:

\[\mathbb{E}[\mathbf{m}_{GMM}\mathbf{m}_{MLE}'] = \mathbb{E}[\mathbf{m}_{GMM}\mathbf{s}']\mathcal{I}^{-1} = I_{p}\cdot\mathcal{I}^{-1} = V_{MLE}\]

Expanding $\mathbb{E}[\mathbf{U}\mathbf{U}']$:

\[\begin{align} \mathbb{E}[\mathbf{U}\mathbf{U}'] &= \mathbb{E}[\mathbf{m}_{GMM}\mathbf{m}_{GMM}'] - \mathbb{E}[\mathbf{m}_{GMM}\mathbf{m}_{MLE}'] - \mathbb{E}[\mathbf{m}_{MLE}\mathbf{m}_{GMM}'] + \mathbb{E}[\mathbf{m}_{MLE}\mathbf{m}_{MLE}'] \\ &= V_{GMM} - V_{MLE} - V_{MLE} + V_{MLE} \\ &= V_{GMM} - V_{MLE} \end{align}\]

Therefore:

\[V_{GMM} - V_{MLE} = \mathbb{E}[\mathbf{U}\mathbf{U}'] \geq \mathbf{0}\]

The inequality holds because $\mathbb{E}[\mathbf{U}\mathbf{U}']$ is a variance matrix — positive semi-definite by construction. Equality holds if and only if $\mathbf{U}=\mathbf{0}$ almost surely, i.e. when the GMM influence function equals the MLE influence function. This happens only when the GMM estimator fully exploits the likelihood.

Discussion: Efficiency vs. Robustness

The efficiency of MLE comes at a price: it requires the entire parametric model to be correctly specified. If the density $f(\mathbf{w};\theta)$ is wrong — even slightly — the MLE may still converge, but it will converge to a pseudo-true value that minimizes the Kullback-Leibler divergence to the truth, and the information matrix equality will fail. In contrast, GMM only requires the moment conditions to be correct, making it more robust to partial misspecification. The sandwich variance estimator $\hat{H}^{-1}\hat{\Sigma}\hat{H}^{-1}$ remains valid for MLE even under misspecification, which is why it is sometimes preferred in practice.

Discussion: Pros and Cons of MLE

The single greatest strength of MLE is that it uses every single piece of information in the data. If you know that your model is identified by the population distribution, you do not have to think about which features of the data to match in order for estimation to work well in practice. This is particularly useful when you want to deal with unobserved heterogeneity in the model, as we will see later on. MLE will extract the most amount of information from panel data and use it optimally.

The greatest weakness of MLE is that it uses every single piece of information in the data. You do not get to control which features of the data the model fits and which features it misses. The likelihood will decide that. Given that you know your model is wrong, and it is not designed to fit everything perfectly well, MLE takes away your ability to govern this tradeoff. Often, as a result, models that are estimated by MLE in practice often feature. This relates to the whether vs how of identification. The way you map the data to the model most credibly (say, by fitting how the data respond to plausibly exogenous variation) may not coincide with the MLE estimator.

13.7 Two-Step Estimators

Many structural estimators proceed in stages. In Example 7.1, we first estimated the selection equation by MLE and then used these estimates in a second-stage OLS regression. In the search model, we might first estimate the wage distribution and then back out the reservation wage. The theory of two-step estimators formalizes the effect of first-stage estimation uncertainty on second-stage inference.

13.7.1 Setup

Suppose the estimator is defined by two sets of moment conditions:

First step: Estimate $\hat{\gamma}$ via $\frac{1}{N}\sum_{n}g_{1}(\mathbf{w}_{n},\hat{\gamma})=\mathbf{0}$
Second step: Estimate $\hat{\beta}$ via $\frac{1}{N}\sum_{n}g_{2}(\mathbf{w}_{n},\hat{\gamma},\hat{\beta})=\mathbf{0}$

The key feature is that the second step depends on the first-step estimates.

13.7.2 Asymptotic Distribution

To derive the joint distribution, stack the moment conditions. Let $\alpha=(\gamma',\beta')'$ and write the full system as: \[\frac{1}{N}\sum_{n}\begin{bmatrix}g_{1}(\mathbf{w}_{n},\gamma)\\ g_{2}(\mathbf{w}_{n},\gamma,\beta)\end{bmatrix} = \mathbf{0}\]

The Jacobian of this system has a block-triangular structure: \[\Gamma = \begin{bmatrix}\Gamma_{1\gamma} & 0 \\ \Gamma_{2\gamma} & \Gamma_{2\beta}\end{bmatrix}\]

where $\Gamma_{1\gamma}=\mathbb{E}[\nabla_{\gamma}g_{1}']$, $\Gamma_{2\gamma}=\mathbb{E}[\nabla_{\gamma}g_{2}']$, and $\Gamma_{2\beta}=\mathbb{E}[\nabla_{\beta}g_{2}']$. The zero in the upper-right block reflects the fact that the first step does not depend on $\beta$.

Applying the standard GMM formula, the asymptotic variance of $\hat{\beta}$ is:

\[V_{\beta} = \Gamma_{2\beta}^{-1}\mathbb{E}[(g_{2}-\Gamma_{2\gamma}\Gamma_{1\gamma}^{-1}g_{1})(g_{2}-\Gamma_{2\gamma}\Gamma_{1\gamma}^{-1}g_{1})']\Gamma_{2\beta}^{-1\prime}\]

The term $\Gamma_{2\gamma}\Gamma_{1\gamma}^{-1}g_{1}$ captures the correction for first-stage estimation error. If we ignored this term and computed standard errors using only the second-stage moment conditions, we would generally get incorrect inference.

When Does the First Stage Not Matter?

There is an important special case: if $\Gamma_{2\gamma}=\mathbb{E}[\nabla_{\gamma}g_{2}']=0$, then the correction vanishes and the first-stage estimation has no effect on the second-stage asymptotic variance. Intuitively, this happens when the second-step moment conditions are locally insensitive to the first-step parameters at the true values.

A classic example is the two-stage IV estimator where the first stage is a probit. Here, the second-stage moment condition takes the form $\mathbb{E}[\Phi(\mathbf{x}'\gamma_{0})\cdot u]=0$. One can show that $\mathbb{E}[\nabla_{\gamma}g_{2}']=0$ because the projection of $u$ onto functions of the instruments is zero by construction.

Discussion: Bootstrap vs. Analytical Standard Errors

Computing analytical standard errors for two-step estimators can be tedious, as the example above illustrates. An attractive alternative is the bootstrap: resample the data, re-run the entire two-step procedure on each bootstrap sample, and compute standard errors from the distribution of bootstrap estimates. This automatically accounts for first-stage estimation uncertainty without requiring explicit computation of correction terms. We will discuss the bootstrap formally in the chapter on simulation methods.

13.8 Exercises

Exercise

Exercise 13.1 Extend Example 12.1 to:

Compute asymptotic standard errors for the estimated parameters $\hat{\theta}=(h,\delta,\mu,\sigma,w^{*})$ using the information matrix.
Use the delta method to compute standard errors for the derived estimates $\hat{\lambda}$ and $\hat{b}$.
Estimate the search model separately for men with and without a bachelor’s degree.
Report and compare your estimates. Are the differences across education groups economically meaningful? What does the model imply about the sources of wage differences between these groups?

Exercise

Exercise 13.2 Blundell, Pistaferri, and Preston (2008) estimate income and consumption dynamics using a model where: \[\Delta y_{n,t} = \zeta_{n,t} + \Delta\xi_{n,t}\] where $\zeta_{n,t}$ is a permanent shock and $\xi_{n,t}$ is a transitory shock. Consumption responds differently to each: \[\Delta c_{n,t} = \phi\zeta_{n,t} + \psi\xi_{n,t} + \epsilon_{n,t}\] where $\phi$ and $\psi$ capture the “insurance coefficients” — the degree to which consumption responds to permanent and transitory income shocks, respectively.

Show that from the second moments of $(\Delta y, \Delta c)$, one can identify $\sigma^{2}_{\zeta}$, $\sigma^{2}_{\xi}$, $\phi$, and $\psi$. Hint: consider $\mathbb{V}[\Delta y_{t}]$, $\mathbb{C}(\Delta y_{t},\Delta y_{t-1})$, $\mathbb{C}(\Delta c_{t},\Delta y_{t})$, and $\mathbb{C}(\Delta c_{t},\Delta y_{t-1})$.
Using the data from Example 10.1, implement a minimum distance estimator for $(\sigma^{2}_{\zeta},\sigma^{2}_{\xi},\phi,\psi)$ and report standard errors.
What do the estimates of $\phi$ and $\psi$ tell us about how well households insure against permanent versus transitory income shocks?

Exercise

Exercise 13.3 Consider estimation of the entry-exit model by minimum distance, as described in the introduction. Recall that for each state $(x,a,a')$, the model implies a choice probability $p(x,a,a';\phi,\beta)$ that depends on the payoff parameters through the equilibrium of the game. Data consist of entry decisions across many independent markets.

Simulate data from the entry-exit model for $M=500$ markets at 5 equally spaced values of $x$ and use the empirical choice frequencies as your target moments $\hat{\mathbf{p}}$.
Implement the minimum distance estimator: \[\hat{\phi} = \arg\min_{\phi}(\hat{\mathbf{p}}-\mathbf{p}(\phi,\beta))'\mathbf{W}(\hat{\mathbf{p}}-\mathbf{p}(\phi,\beta))\] using the identity weighting matrix. Note: for each candidate $\phi$, you must solve for the equilibrium to compute $\mathbf{p}(\phi,\beta)$.
Compute standard errors for $\hat{\phi}$ using the minimum distance variance formula from Theorem 13.7.
Re-estimate with the optimal weighting matrix and compare your standard errors.