14 Simulation Methods

In many structural models, the likelihood or moment conditions cannot be computed in closed form because they involve high-dimensional integrals over unobserved heterogeneity, latent states, or expectations over future shocks. The savings model is a good example: even if we knew the true parameters, computing the likelihood of an observed consumption path requires integrating over sequences of income shocks. In these settings, simulation provides a way forward.

Example: Calculating Mean Assets

Example 14.1 Suppose you are estimating the savings model by matching mean assets at different ages over the life-cycle. Note that \(A_{t}\) depends on the sequence of realizations of the income shock \(\varepsilon_{s}\) for all perios \(s<t\). Assuming a discrete approximation to the income process, the numerical equivalent to the moment is:

\[ \mathbb{E}[A_{t}] = \sum_{\varepsilon_{1}^{t-1}\in\mathcal{E}^{t-1}}\pi(\varepsilon_1)\prod_{s=2}^{t=1}\pi(\varepsilon_{s}|\varepsilon_{s-1})A^*(\varepsilon_1,\varepsilon_2,...,\varepsilon_{t-1}) \]

where \(A^*\) indicates accumulated savings for the optimal savings policy under a particular guess of the model’s parameters. Notice that if we assume \(K_\varepsilon\) grid points for \(\varepsilon\), the number of sequences \(\varepsilon_{1}^{t-1}\) that we must sum over is \(K_{\varepsilon}^{t-1}\). For larger \(t\), this will be intractably large.

This chapter covers three topics. First, we discuss simulation-based estimators — the method of simulated moments (MSM) and indirect inference — which replace intractable analytical objects with simulation approximations. Second, we introduce the bootstrap, a general resampling-based approach to inference that is particularly useful when analytical standard errors are difficult to derive. Third, we bring these tools together with our prototype models.

14.1 Integration by Simulation

At the heart of simulation-based estimation is the problem of computing integrals and high-dimensional sums. Suppose we need to evaluate:

\[f(y|\mathbf{x},\theta) = \int h(y|\mathbf{x},\theta,\mathbf{u})g(\mathbf{u})d\mathbf{u}\]

where \(\mathbf{u}\) is a vector of unobservables with density \(g\). If we want to evaluate this integral with quadrature methods (Section 2.5.1) then we face a curse of dimensionality. When the number of dimensions in \(\mathbf{u}\) is large then the problem is simply intractable. One solution is to approximate by Monte Carlo integration: draw \(\mathbf{u}^{1},\ldots,\mathbf{u}^{R}\) from \(g\) and compute:

\[\hat{f}(y|\mathbf{x},\theta) = \frac{1}{R}\sum_{r=1}^{R}h(y|\mathbf{x},\theta,\mathbf{u}^{r})\]

By the law of large numbers, \(\hat{f}\rightarrow_{p} f\) as \(R\rightarrow\infty\). This is unbiased by construction: \(\mathbb{E}_{\mathbf{u}}[\hat{f}]=f\).

14.1.1 Importance Sampling

Sometimes we want to draw from an alternative density \(w(\mathbf{u})\) instead of \(g(\mathbf{u})\) — for instance because \(g\) is hard to sample from, or because draws from \(w\) reduce the variance of the estimator. In this case:

\[\hat{f}(y|\mathbf{x},\theta) = \frac{1}{R}\sum_{r=1}^{R}h(y|\mathbf{x},\theta,\mathbf{u}^{r})\frac{g(\mathbf{u}^{r})}{w(\mathbf{u}^{r})},\qquad \mathbf{u}^{r}\sim w\]

This is still unbiased and consistent. The ratio \(g/w\) is called the importance weight.

Variance reduction. Why can the choice of \(w\) reduce variance? The variance of the importance-sampling estimator depends on the variance of \(h \cdot g/w\) under draws from \(w\). Intuitively, if \(w\) concentrates draws in regions where the integrand \(h \cdot g\) is large, the estimator “wastes” fewer draws on regions that contribute little to the integral, reducing variance. The ideal importance density is proportional to \(|h| \cdot g\), which puts the most weight exactly where it matters most. In practice, one looks for a convenient \(w\) that roughly approximates this shape.

Example: Variance Reduction via Importance Sampling

Example 14.2 To illustrate, consider estimating a tail probability: \(P(Z > 3)\) where \(Z\sim\mathcal{N}(0,1)\). The true value is approximately \(0.00135\).

Naive Monte Carlo draws \(Z^{1},\ldots,Z^{R}\sim\mathcal{N}(0,1)\) and computes \(\hat{p}=\frac{1}{R}\sum_{r}\mathbf{1}\{Z^{r}>3\}\). Because the event \(\{Z>3\}\) is rare, most draws contribute zero to the sum — the estimator “wastes” draws in the center of the distribution where \(\mathbf{1}\{Z>3\}\cdot\phi(Z)\) is negligible.

Importance sampling instead draws from \(w=\mathcal{N}(3,1)\), which concentrates mass in the tail where the integrand is large. The estimator becomes: \[\hat{p}_{IS}=\frac{1}{R}\sum_{r}\mathbf{1}\{Z^{r}>3\}\frac{\phi(Z^{r})}{\phi_{w}(Z^{r})},\qquad Z^{r}\sim\mathcal{N}(3,1)\]

We compare both estimators across 2,000 independent replications, each using \(R=2{,}000\) draws.

using Distributions, Random, Plots, Statistics

Random.seed!(42)
c = 3.0
p_true = 1 - cdf(Normal(), c)

R = 2000; n_reps = 2000
g = Normal(0, 1)      # target distribution
w = Normal(c, 1)       # importance distribution (shifted to the tail)

# Naive MC: draw from g, average the indicators
naive_estimates = [mean(rand(g, R) .> c) for _ in 1:n_reps]

# Importance sampling: draw from w, reweight
is_estimates = map(1:n_reps) do _
    z = rand(w, R)
    weights = pdf.(g, z) ./ pdf.(w, z)
    mean((z .> c) .* weights)
end

println("True probability: P(Z > $c) = $(round(p_true, sigdigits=5))")
println("Naive MC:       mean = $(round(mean(naive_estimates), sigdigits=5)),  std = $(round(std(naive_estimates), sigdigits=3))")
println("Importance MC:  mean = $(round(mean(is_estimates), sigdigits=5)),  std = $(round(std(is_estimates), sigdigits=3))")
println("Variance ratio (naive / IS): $(round(var(naive_estimates) / var(is_estimates), digits=1))×")

True probability: P(Z > 3.0) = 0.0013499
Naive MC:       mean = 0.0013565,  std = 0.000817
Importance MC:  mean = 0.0013504,  std = 5.47e-5
Variance ratio (naive / IS): 223.3×

p1 = histogram(naive_estimates, normalize=:pdf, alpha=0.6, label=false,
               color=:steelblue, xlabel="Estimate of P(Z > $c)", title="Naive MC")
vline!(p1, [p_true], color=:black, linewidth=2, linestyle=:dash, label="True value")

p2 = histogram(is_estimates, normalize=:pdf, alpha=0.6, label=false,
               color=:firebrick, xlabel="Estimate of P(Z > $c)", title="Importance Sampling")
vline!(p2, [p_true], color=:black, linewidth=2, linestyle=:dash, label="True value")

plot(p1, p2, layout=(1,2), size=(800, 350))

The contrast is stark. The naive estimator is highly discrete — it can only take values that are multiples of \(1/R\) — and its distribution is spread widely around the truth. The importance sampling estimator is continuous (because the importance weights are continuous) and tightly concentrated. Both are unbiased, but importance sampling achieves a dramatic reduction in variance by directing draws to the region that matters.

Differentiability. Importance sampling also plays a role in making simulated integrals differentiable with respect to the parameters \(\theta\). When the integrand involves indicator functions — as in discrete choice models where \(h(y|\mathbf{x},\theta,\mathbf{u})=\mathbf{1}\{\cdot\}\) — the crude Monte Carlo approximation is a step function in \(\theta\), which creates problems for gradient-based optimization. Through a change of variables, it is sometimes possible to move the dependence on \(\theta\) out of the indicator and into smooth importance weights, yielding a simulated integral that is differentiable in \(\theta\). The GHK simulator for multivariate probit models is a prominent example of this strategy (see Section 14.2.1 below).

14.2 Maximum Simulated Likelihood

A natural first idea is to replace the likelihood with its simulated analogue. The maximum simulated likelihood (MSL) estimator is:

\[\hat{\theta}_{MSL} = \arg\max_{\theta}\sum_{n=1}^{N}\log\hat{f}(y_{n}|\mathbf{x}_{n},\theta)\]

where \(\hat{f}\) is the simulated probability from above.

A Subtlety with MSL

There is an important issue with MSL: because \(\log\) is concave, Jensen’s inequality gives \(\mathbb{E}[\log\hat{f}]\leq \log\mathbb{E}[\hat{f}]=\log f\). This means the simulated log-likelihood is biased downward, and the MSL estimator is inconsistent for fixed \(R\). Consistency requires that the number of simulation draws grows with the sample: both \(N,R\rightarrow\infty\) with \(\sqrt{N}/R\rightarrow 0\).

This is in contrast to the method of simulated moments, which we discuss next, where a fixed number of simulation draws is sufficient for consistency.

Under standard regularity conditions and \(\sqrt{N}/R\rightarrow 0\), the MSL estimator has the same asymptotic distribution as the (infeasible) MLE:

\[\sqrt{N}(\hat{\theta}_{MSL}-\theta_{0})\rightarrow_{d}\mathcal{N}(\mathbf{0},\ \mathcal{I}(\theta_{0})^{-1})\]

In practice, one should use enough draws that the simulation error is small relative to the sampling variability. A common rule of thumb is to set \(R\) large enough that results are insensitive to further increases.

14.2.1 Differentiability of the Simulated Likelihood

Warning

A separate challenge with MSL arises when the model involves discrete outcomes. Consider a simple probit: \(Y=\mathbf{1}\{X\beta+\varepsilon>0\}\) with \(\varepsilon\sim\mathcal{N}(0,1)\). The true choice probability \(\Phi(X\beta)\) is a smooth function of \(\beta\). But if we approximate it by crude Monte Carlo — drawing \(\varepsilon^{1},\ldots,\varepsilon^{R}\sim\mathcal{N}(0,1)\) and computing \(\hat{P}=\frac{1}{R}\sum_{r}\mathbf{1}\{X\beta+\varepsilon^{r}>0\}\) — the result is a step function in \(\beta\). Each indicator \(\mathbf{1}\{X\beta+\varepsilon^{r}>0\}\) switches from 0 to 1 at \(\beta=-\varepsilon^{r}/X\), so \(\hat{P}\) jumps at finitely many points and is flat everywhere else. Its derivative is zero almost everywhere, making gradient-based optimization of the simulated log-likelihood impossible.

The figure below illustrates this for \(X=1\) and \(R=20\) draws. The true probability \(\Phi(\beta)\) is smooth, but the simulated probability is a step function that jumps at each \(\beta=-\varepsilon^{r}\). The kernel-smoothed simulator replaces the indicator with \(\Phi(\cdot/h)\), restoring differentiability at the cost of a small bias.

using Distributions, Plots, Random

Random.seed!(123)
R = 20
ε = rand(Normal(), R)

β_grid = range(-3, 3, length=1000)
P_true = cdf.(Normal(), β_grid)
P_sim = [mean(β .+ ε .> 0) for β in β_grid]
h = 0.3
P_smooth = [mean(cdf.(Normal(), (β .+ ε) ./ h)) for β in β_grid]

plot(β_grid, P_true, label="True Φ(β)", linewidth=2.5, color=:black,
     xlabel="β", ylabel="P(Y=1 | X=1)")
plot!(β_grid, P_sim, label="Crude MC (R=$R)", linewidth=1.5,
     color=:steelblue, alpha=0.8)
plot!(β_grid, P_smooth, label="Kernel-smoothed (h=$h)", linewidth=2,
     color=:firebrick, linestyle=:dash)

This problem is not specific to probit — it arises whenever the simulated likelihood involves averaging indicator functions. Solutions include:

Kernel smoothing: replace the indicator \(\mathbf{1}\{\cdot>0\}\) with a smooth approximation such as \(\Phi(\cdot/h)\) for a small bandwidth \(h\) (the dashed red curve above).
Importance sampling / change of variables: restructure the integral so that \(\beta\) appears in smooth weights rather than inside an indicator. The GHK simulator for multivariate probit uses sequential conditioning to achieve this.
Derivative-free optimization: use methods like Nelder-Mead that do not require gradients. This is simple but can be slow and unreliable in high dimensions.

14.3 Method of Simulated Moments

The method of simulated moments (MSM) is the simulation analogue of GMM. Suppose that the moment conditions take the form:

\[\mathbb{E}[g(\mathbf{w},\theta_{0})] = \mathbb{E}\left[\int h(\mathbf{w},\theta_{0},\mathbf{u})f(\mathbf{u}|\theta_{0})d\mathbf{u}\right] = \mathbf{0}\]

If we cannot evaluate \(g\) analytically, we can replace it with a simulated version: \[\hat{g}(\mathbf{w}_{n},\theta) = \frac{1}{R}\sum_{r=1}^{R}\tilde{g}(\mathbf{w}_{n},\mathbf{u}^{r},\theta)\]

where \(\mathbb{E}_{\mathbf{u}}[\tilde{g}(\mathbf{w},\mathbf{u},\theta)]=g(\mathbf{w},\theta)\). The MSM estimator is:

\[\hat{\theta}_{MSM} = \arg\min_{\theta}\ \hat{\mathbf{g}}_{N}(\theta)'\mathbf{W}\hat{\mathbf{g}}_{N}(\theta),\qquad\hat{\mathbf{g}}_{N}(\theta) = \frac{1}{N}\sum_{n=1}^{N}\hat{g}(\mathbf{w}_{n},\theta)\]

14.3.1 A Common Special Case

In many structural applications, the moment conditions do not involve observation-level simulation. Instead, we match aggregate statistics from the data to their model-implied counterparts computed by simulating the model. The moments take the form:

\[\hat{\mathbf{g}}_{N}(\theta) = \hat{\mathbf{m}} - \tilde{\mathbf{m}}(\theta)\]

where \(\hat{\mathbf{m}}\) is a vector of sample statistics (means, variances, covariances, etc.) and \(\tilde{\mathbf{m}}(\theta)\) is the corresponding vector computed from simulated data. This is the structure we will use for the savings model below.

14.3.2 Asymptotic Distribution

Theorem 14.1 (Asymptotic Distribution of MSM) Suppose the standard regularity conditions for GMM hold and \(\mathbb{E}_{\mathbf{u}}[\tilde{g}(\mathbf{w},\mathbf{u},\theta)]=g(\mathbf{w},\theta)\). Then with \(R\) fixed:

\[\sqrt{N}(\hat{\theta}_{MSM}-\theta_{0})\rightarrow_{d}\mathcal{N}(\mathbf{0},\Sigma_{MSM})\]

where: \[\Sigma_{MSM} = (\nabla_{\theta}g_{0}\mathbf{W}\nabla_{\theta}g_{0}')^{-1}\nabla_{\theta}g_{0}\mathbf{W}\mathbb{E}[\hat{g}_{0}\hat{g}_{0}']\mathbf{W}\nabla_{\theta}g_{0}'(\nabla_{\theta}g_{0}\mathbf{W}\nabla_{\theta}g_{0}')^{-1}\]

with \(\nabla_{\theta}g_{0}=\mathbb{E}[\nabla_{\theta}g(\mathbf{w},\theta_{0})']\).

Several features are worth highlighting:

Consistency does not require \(R\rightarrow\infty\). Because \(\hat{g}\) is an unbiased estimator of \(g\), the simulated sample moments converge to zero at \(\theta_{0}\) by the law of large numbers. This is in contrast to MSL.
Simulation adds variance. Because \(\mathbb{V}[\hat{g}]\geq\mathbb{V}[g]\), the MSM estimator is less efficient than the infeasible GMM estimator that uses the true \(g\). For the common case of a frequency simulator (where \(\tilde{g}\) is a binary indicator), the variance inflates by a factor of \((1+1/R)\).
If \(\sqrt{N}/R\rightarrow 0\), the efficiency loss from simulation disappears and the MSM achieves the same asymptotic variance as GMM.

See Newey and McFadden (1994) for a comprehensive treatment.

Example: Simulating the Savings Model

Example 14.3 To estimate the savings model by MSM, we first need to be able to simulate data from it. Given a solution (the policy function \(A\)), we can forward-simulate a panel of individuals by drawing income shocks and applying the optimal savings rule.

using Distributions, Random, LinearAlgebra, Statistics

# === Model setup (from savings model chapter) ===
Φ_cdf(x) = cdf(Normal(),x)

function tauchen(ρ,ση,Kϵ)
    sd = ση/sqrt(1-ρ^2)
    grid = range(-3sd,stop=3sd,length=Kϵ)
    Π = zeros(Kϵ,Kϵ)
    Δ = grid[2]-grid[1]
    for j=1:Kϵ
        Π[1,j] = Φ_cdf((grid[1] + Δ/2 - ρ*grid[j])/ση)
        Π[end,j] = 1 - Φ_cdf((grid[end] - Δ/2 - ρ*grid[j])/ση)
        for k=2:(Kϵ-1)
            Π[k,j] = Φ_cdf((grid[k] + Δ/2 - ρ*grid[j])/ση) - Φ_cdf((grid[k] - Δ/2 - ρ*grid[j])/ση)
        end
    end
    return Π,collect(grid)
end

u(c,σ) = c^(1-σ) / (1-σ)

function solve_max(V,t,iϵ,ia,pars)
    (;agrid,ϵgrid,Π,σ,Ka,r,β) = pars
    cash = exp(pars.μ[t] + ϵgrid[iϵ]) + agrid[ia]
    amax = 1
    vmax = -Inf
    loop = true
    a = 1
    while loop && a<Ka
        c = cash - agrid[a] / (1+r)
        if c>0
            v = u(c,σ)
            for iϵ′ in axes(V,1)
                v += β * Π[iϵ′,iϵ] * V[iϵ′,a,t+1]
            end
            if v>vmax
                vmax = v
                amax = a
            end
        else
            loop = false
        end
        a += 1
    end
    return amax,vmax
end

function iterate!(V,A,t,pars)
    for ia in axes(V,2), iϵ in axes(V,1)
        A[iϵ,ia,t],V[iϵ,ia,t] = solve_max(V,t,iϵ,ia,pars)
    end
end

function terminal_values!(V,pars)
    (;σ,ψ,agrid) = pars
    for ia in axes(V,2), iϵ in axes(V,1)
        V[iϵ,ia] = ψ > 0 ? ψ * u(max(agrid[ia], 1e-10), σ) : 0.0
    end
end

function backward_induction!(V,A,pars)
    (;T) = pars
    @views terminal_values!(V[:,:,T+1],pars)
    for t in reverse(1:T)
        iterate!(V,A,t,pars)
    end
end

function setup_and_solve(θ)
    β, σ, ψ = θ
    T = 45
    Ka = 100
    Kϵ = 5
    ρ = 0.9
    ση = 0.1
    r = 0.05
    μ = fill(2., T)
    agrid = collect(LinRange(0, μ[1] * T, Ka))
    Π, ϵgrid = tauchen(ρ, ση, Kϵ)
    pars = (;T, β, σ, ρ, ση, μ, ψ, r, Ka, Kϵ, agrid, Π, ϵgrid)
    V = zeros(Kϵ, Ka, T+1)
    A = zeros(Int64, Kϵ, Ka, T)
    backward_induction!(V, A, pars)
    return A, pars
end

setup_and_solve (generic function with 1 method)

Now we write a simulation function. Given the policy function, we draw income shocks and track each individual’s asset and income path.

function simulate_panel(A, pars, N_sim; seed=nothing)
    (;T, Kϵ, Ka, agrid, ϵgrid, Π, μ, r) = pars
    if !isnothing(seed)
        Random.seed!(seed)
    end

    # Pre-compute CDF for sampling from Markov chain
    cumΠ = cumsum(Π, dims=1)

    # Storage
    assets = zeros(N_sim, T+1)   # assets at start of each period
    income = zeros(N_sim, T)
    consumption = zeros(N_sim, T)
    iϵ_sim = zeros(Int, N_sim, T)

    # Initial conditions: start with zero assets, draw initial ϵ from stationary dist
    # Use middle state as starting point for simplicity
    iϵ_sim[:, 1] .= (Kϵ + 1) ÷ 2

    for n in 1:N_sim
        ia = 1  # start with zero assets
        for t in 1:T
            iϵ = iϵ_sim[n, t]
            y = exp(μ[t] + ϵgrid[iϵ])
            income[n, t] = y
            cash = y + agrid[ia]

            # Look up optimal savings
            ia_next = A[iϵ, ia, t]
            c = cash - agrid[ia_next] / (1 + r)
            consumption[n, t] = c
            assets[n, t+1] = agrid[ia_next]
            ia = ia_next

            # Draw next period's ϵ
            if t < T
                u_draw = rand()
                iϵ_next = findfirst(cumΠ[:, iϵ] .>= u_draw)
                iϵ_sim[n, t+1] = iϵ_next
            end
        end
    end
    return (;assets, income, consumption)
end

# Solve and simulate
A, pars = setup_and_solve((0.95, 2.0, 0.0))
sim = simulate_panel(A, pars, 1000; seed=123)

(assets = [0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.9090909090909092 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.9090909090909092 0.0], income = [7.38905609893065 7.38905609893065 … 5.2376681994924175 5.2376681994924175; 7.38905609893065 7.38905609893065 … 7.38905609893065 7.38905609893065; … ; 7.38905609893065 7.38905609893065 … 7.38905609893065 7.38905609893065; 7.38905609893065 7.38905609893065 … 7.38905609893065 7.38905609893065], consumption = [7.38905609893065 7.38905609893065 … 5.2376681994924175 5.2376681994924175; 7.38905609893065 7.38905609893065 … 8.341437051311603 8.298147008021559; … ; 7.38905609893065 7.38905609893065 … 7.38905609893065 7.38905609893065; 7.38905609893065 7.38905609893065 … 7.432346142220693 8.298147008021559])

Let’s check the simulation by plotting average assets and consumption over the life cycle.

using Plots
p1 = plot(1:pars.T, mean(sim.assets[:,1:pars.T], dims=1)', label="Mean Assets",
          xlabel="Age", title="Life-Cycle Profiles")
plot!(p1, 1:pars.T, mean(sim.consumption, dims=1)', label="Mean Consumption")
plot!(p1, 1:pars.T, mean(sim.income, dims=1)', label="Mean Income")

The hump-shaped asset profile and the smooth consumption path are consistent with the model’s predictions: individuals accumulate assets during working years as a buffer against income risk and draw them down toward the end of life. With no bequest motive (\(\psi=0\)), assets are fully decumulated by the final period.

Example: Simulated Moments for the Savings Model

Example 14.4 Now let’s use the simulation machinery to define a set of moments and construct an MSM estimator for the preference parameters \((\beta,\sigma)\), setting \(\psi=0\) (no bequest motive). We take the income process parameters \((\rho,\sigma_{\eta})\) as given (estimated in Example 12.2).

Our target moments will be:

Mean assets at ages 30, 40, 50, 60 (to pin down the savings profile)
Variance of log consumption at ages 30, 40, 50, 60 (to pin down risk aversion)

# Compute moments from simulated data
function simulated_moments(θ; N_sim=2000, seed=42)
    A, pars = setup_and_solve(θ)
    sim = simulate_panel(A, pars, N_sim; seed=seed)
    target_ages = [6, 16, 26, 36]  # ages 30, 40, 50, 60 (relative to t=1 at age 25)
    m_assets = [mean(sim.assets[:, t]) for t in target_ages]
    m_var_logc = [var(log.(sim.consumption[:, t])) for t in target_ages]
    return vcat(m_assets, m_var_logc)
end

# "True" parameters and data moments (ψ=0: no bequest motive)
θ_true = (0.95, 2.0, 0.0)
m_data = simulated_moments(θ_true; N_sim=2_000, seed=1)

# MSM objective: estimate (β, σ) with ψ fixed at 0
function msm_objective(x, m_data)
    β = 0.8 + 0.19 * (1 / (1 + exp(-x[1])))  # map to (0.8, 0.99)
    σ = exp(x[2])
    θ = (β, σ, 0.0)
    m_sim = simulated_moments(θ; N_sim=5_000)
    diff = m_data .- m_sim
    return diff' * diff
end

msm_objective (generic function with 1 method)

This estimator is computationally intensive — each evaluation of the objective requires solving the dynamic programming problem and simulating the model.

using Optim
x0 = [0.0, log(2.0)]
res = optimize(x -> msm_objective(x, m_data), x0, NelderMead(),
               Optim.Options(iterations=200))
x_hat = res.minimizer
β_hat = 0.8 + 0.19 / (1 + exp(-x_hat[1]))
σ_hat = exp(x_hat[2])
println("MSM Estimates vs Truth:")
println("  β: $(round(β_hat, digits=4))  (true: 0.95)")
println("  σ: $(round(σ_hat, digits=4))  (true: 2.0)")
println("  Objective value: $(round(res.minimum, digits=6))")

MSM Estimates vs Truth:
  β: 0.897  (true: 0.95)
  σ: 5.8734  (true: 2.0)
  Objective value: 0.099968

Let’s check the model fit by simulating at both the true and estimated parameters.

# Simulate at estimated and true parameters
θ_hat_msm = (β_hat, σ_hat, 0.0)
A_hat, pars_hat = setup_and_solve(θ_hat_msm)
sim_hat = simulate_panel(A_hat, pars_hat, 5_000)

A_true, pars_true = setup_and_solve(θ_true)
sim_true = simulate_panel(A_true, pars_true, 2_000; seed=1)

T = pars_true.T
ages = 25 .+ (1:T)
target_ages_actual = [30, 40, 50, 60]
target_periods = target_ages_actual .- 24  # periods corresponding to target ages

# Mean assets over the life cycle
p1 = plot(ages, mean(sim_true.assets[:, 1:T], dims=1)', label="Data (true θ)",
          linewidth=2, color=:black, xlabel="Age", ylabel="Mean Assets",
          title="Mean Assets")
plot!(p1, ages, mean(sim_hat.assets[:, 1:T], dims=1)', label="Estimated θ",
      linewidth=2, color=:steelblue, linestyle=:dash)
vline!(p1, target_ages_actual, color=:gray, alpha=0.3, linestyle=:dot, label=false)

# Variance of log consumption over the life cycle
var_logc_true = [var(log.(sim_true.consumption[:, t])) for t in 1:T]
var_logc_hat = [var(log.(sim_hat.consumption[:, t])) for t in 1:T]

p2 = plot(ages, var_logc_true, label="Data (true θ)",
          linewidth=2, color=:black, xlabel="Age", ylabel="Var(log c)",
          title="Variance of Log Consumption")
plot!(p2, ages, var_logc_hat, label="Estimated θ",
      linewidth=2, color=:steelblue, linestyle=:dash)
vline!(p2, target_ages_actual, color=:gray, alpha=0.3, linestyle=:dot, label=false)

plot(p1, p2, layout=(1,2), size=(800, 350))

The discrete jumps visible in the estimated profiles are a consequence of the discrete asset grid — the policy function maps to grid points, so small parameter changes can cause the optimal grid point to jump (see Section 14.2.1).

Non-Differentiability of Simulated Moments

Notice that we used the Nelder-Mead algorithm above — a derivative-free optimizer. This is not a coincidence. In the savings model, the policy function \(A(\varepsilon,a,t)\) maps to a discrete asset grid, so the simulated consumption and asset paths are step functions of the parameters \(\theta\). As \(\theta\) changes continuously, the optimal grid point occasionally jumps, producing discontinuities in the simulated moments. This is the same non-differentiability problem discussed in Section 14.2.1, now appearing in the MSM context.

Gradient-based optimizers will fail here because the numerical gradient is zero almost everywhere and undefined at the jumps. Derivative-free methods like Nelder-Mead avoid this issue entirely. Alternatives include using a finer asset grid (which reduces the size of the jumps), interpolating the policy function between grid points to smooth the simulated moments, or using kernel-smoothed simulators.

Fixing Simulation Error

Notice in Example 14.4 that we fixed the seed of the simulation when we call the function simulate_panel. What do you think would happen if we failed to do this inside our optimization routine?

14.4 Indirect Inference

Indirect inference is a simulation-based estimator that works by fitting an auxiliary model to both the real data and to data simulated from the structural model, and then finding the structural parameters that make the two sets of auxiliary estimates as close as possible.

14.4.1 Setup

Let \(\hat{\beta}_{N}\) be a vector of statistics estimated from the data (the “auxiliary parameters”). For example, \(\hat{\beta}\) might be OLS coefficients from a wage regression, or the parameters of a reduced-form probit. The key requirement is that \(\hat{\beta}\) is asymptotically normal:

\[\sqrt{N}(\hat{\beta}_{N}-\beta_{0})\rightarrow_{d}\mathcal{N}(\mathbf{0},\Omega)\]

Now, for each candidate \(\theta\), simulate data from the structural model and compute the same statistic on the simulated data, yielding \(\hat{\beta}^{R}(\theta)\). Let \(\beta(\theta)=\text{plim}\ \hat{\beta}^{R}(\theta)\) denote the “pseudo-true” value of the auxiliary parameter implied by \(\theta\). The indirect inference estimator is:

\[\hat{\theta}_{II} = \arg\min_{\theta}\ (\hat{\beta}_{N}-\hat{\beta}^{R}(\theta))'\mathbf{W}(\hat{\beta}_{N}-\hat{\beta}^{R}(\theta))\]

14.4.2 Asymptotic Distribution

Theorem 14.2 (Asymptotic Distribution of Indirect Inference) Suppose:

\(\sqrt{N}(\hat{\beta}_{N}-\beta_{0})\rightarrow_{d}\mathcal{N}(\mathbf{0},\Omega)\)
\(R,N\rightarrow\infty\) with \(\sqrt{N}/R\rightarrow 0\)
\(\theta_{0}\) is the unique solution to \(\beta(\theta)=\beta_{0}\) (identification)

Then: \[\sqrt{N}(\hat{\theta}_{II}-\theta_{0})\rightarrow_{d}\mathcal{N}\left(\mathbf{0},\ (\nabla_{\theta}\beta_{0}\mathbf{W}\nabla_{\theta}\beta_{0}')^{-1}\nabla_{\theta}\beta_{0}\mathbf{W}\Omega\mathbf{W}\nabla_{\theta}\beta_{0}'(\nabla_{\theta}\beta_{0}\mathbf{W}\nabla_{\theta}\beta_{0}')^{-1}\right)\]

where \(\nabla_{\theta}\beta_{0}=\frac{\partial\beta(\theta_{0})'}{\partial\theta}\).

The variance formula has the same sandwich structure as minimum distance estimation (Theorem 13.7), which makes sense: indirect inference is a simulated minimum distance estimator where the reduced-form statistics play the role of \(\pi\).

14.4.3 Relationship to MSM

Indirect inference nests MSM as a special case. If the auxiliary model consists of sample moments \(\hat{\beta}=\frac{1}{N}\sum_{n}g(\mathbf{w}_{n})\) and the simulated counterpart is \(\hat{\beta}^{R}(\theta)=\frac{1}{NR}\sum_{r}\sum_{n}g(\mathbf{w}_{n}^{r}(\theta))\), then the indirect inference estimator reduces to MSM. The appeal of indirect inference is that it allows the use of richer auxiliary models — like regressions or multinomial choice models — that can capture complex features of the data.

Discussion: Choosing Auxiliary Statistics

The choice of auxiliary model is both an art and a science. A good auxiliary model should:

Be sensitive to the structural parameters of interest (for efficiency).
Capture the key features of the data that the structural model is designed to explain.
Be computationally cheap to estimate (since it must be re-estimated on each simulated dataset).

In practice, it is useful to think of the auxiliary model as a diagnostic: if the structural model cannot match the auxiliary estimates, it suggests the model is missing something important. This perspective connects indirect inference to the broader goal of model evaluation.

14.5 The Bootstrap

We now turn to a fundamentally different use of simulation: not for evaluating the objective function, but for inference. The bootstrap uses resampling to approximate the sampling distribution of an estimator, providing standard errors and confidence intervals without requiring analytical derivations.

14.5.1 Motivation

In many settings we face one or more of the following challenges:

The variance formula is difficult to derive (e.g. for multi-step estimators).
The derivatives of the criterion function are hard or costly to compute.
The estimation criterion is non-smooth in finite samples.
The sample size is too small for the CLT to provide a good approximation.

The bootstrap addresses all of these by directly approximating the finite-sample distribution.

14.5.2 Framework

Let \(T_{N} = T_{N}(X_{1},\ldots,X_{N})\) be a statistic — for instance, \(T_{N}=\hat{\theta}\) or \(T_{N}=\sqrt{N}(\hat{\theta}-\theta_{0})\). The finite-sample distribution \(G_{N}(\cdot,F_{0})\) is typically unknown. The bootstrap idea is to replace \(F_{0}\) with an estimate \(\hat{F}_{N}\) and use \(G_{N}(\cdot,\hat{F}_{N})\) to approximate \(G_{N}(\cdot,F_{0})\).

There are two main approaches to constructing \(\hat{F}_{N}\):

Definition 14.1 (Nonparametric Bootstrap) Replace \(F_{0}\) with the empirical distribution function: \[\hat{F}_{N}(x) = \frac{1}{N}\sum_{n=1}^{N}\mathbf{1}\{X_{n}\leq x\}\] Bootstrap samples are obtained by sampling \(N\) observations with replacement from the original data.

Definition 14.2 (Parametric Bootstrap) Replace \(F_{0}\) with \(F(\cdot,\hat{\theta}_{N})\), the model-implied distribution evaluated at the estimated parameters. Bootstrap samples are obtained by simulating from the estimated model.

The nonparametric bootstrap is more general and requires fewer assumptions, while the parametric bootstrap can be more efficient when the model is correctly specified. For structural models, the parametric bootstrap is natural since we already have the machinery to simulate from the model.

14.5.3 Procedure

Given an estimator \(\hat{\theta}_{N}\):

Draw a bootstrap sample of size \(N\): \(\mathbf{X}^{b}=\{X_{1}^{b},\ldots,X_{N}^{b}\}\)
- Nonparametric: sample with replacement from \(\{X_{1},\ldots,X_{N}\}\)
- Parametric: simulate from \(F(\cdot,\hat{\theta}_{N})\)
Compute the bootstrap estimate: \(\hat{\theta}^{b}_{N}=T_{N}(\mathbf{X}^{b})\)
Repeat steps 1-2 for \(b=1,\ldots,B\) to obtain the bootstrap distribution \(\{\hat{\theta}^{1}_{N},\ldots,\hat{\theta}^{B}_{N}\}\)

From this distribution we can compute:

Bootstrap standard errors: \(\text{se}^{*}(\hat{\theta}) = \text{sd}(\hat{\theta}^{1},\ldots,\hat{\theta}^{B})\)
Percentile confidence intervals: \([q_{\alpha/2},\ q_{1-\alpha/2}]\) where \(q_{\alpha}\) is the \(\alpha\)-quantile of the bootstrap distribution
Bootstrap-\(t\) confidence intervals: based on the distribution of the pivotal statistic \(t^{b}=(\hat{\theta}^{b}-\hat{\theta})/\text{se}(\hat{\theta}^{b})\)

14.5.4 When Does the Bootstrap Work?

The bootstrap is consistent when the statistic \(T_{N}\) is asymptotically normal. In this case, \(G_{N}(\cdot,\hat{F}_{N})\rightarrow_{p}G_{\infty}(\cdot,F_{0})\), so the bootstrap distribution converges to the true asymptotic distribution.

The bootstrap can actually do better than the asymptotic normal approximation when the statistic is asymptotically pivotal — that is, when its asymptotic distribution does not depend on unknown parameters. The \(t\)-statistic \(\sqrt{N}(\hat{\theta}-\theta_{0})/\text{se}(\hat{\theta})\) is a canonical example. In this case, the bootstrap provides a higher-order refinement over the normal approximation. See Horowitz (2001) for a detailed discussion.

Replicating the Dependence Structure

The bootstrap requires that the resampling scheme correctly replicates the dependence structure in the data. For iid data, sampling with replacement is sufficient. For panel data, one should resample entire individuals (not individual observations) to preserve the within-person correlation structure. For time series data, block bootstrap methods are needed.

Example: Bootstrap Confidence Intervals for the Income Process

Example 14.5 Let’s apply the bootstrap to construct confidence intervals for the income process parameters estimated by minimum distance in Example 12.2. Rather than relying on the analytical standard errors from Example 13.2, we resample individuals from the PSID panel and re-estimate the parameters on each bootstrap sample.

Since the PSID is panel data, we resample entire individuals (all observations for a given person) to preserve the within-person correlation.

using CSV, DataFrames, DataFramesMeta, Statistics, Optim, Plots, Random

# Load and prepare data
data_psid = @chain begin
    CSV.read("../data/abb_aea_data.csv",DataFrame,missingstring = "NA")
    @select :person :y :tot_assets1 :asset :age :year
    @subset :age.>=25 :age.<=64
end

# Model moments and estimation functions (from @exm-md_income)
function model_moments(θ, T)
    ρ, σ2_α, σ2_η = θ
    m = [σ2_α + (1-ρ^(2(t-1)))/(1-ρ^2) * σ2_η for t in 1:T]
    return m
end

function md_estimate(data)
    m_hat = @chain data begin
        groupby(:age)
        @combine :var_logy = var(log.(:y))
        @orderby :age
        _.var_logy
    end
    T = length(m_hat)

    function md_objective(x)
        ρ = tanh(x[1])
        σ2_α = exp(x[2])
        σ2_η = exp(x[3])
        θ = (ρ, σ2_α, σ2_η)
        m_model = model_moments(θ, T)
        diff = m_hat .- m_model
        return diff' * diff
    end

    x0 = [0.5, log(0.1), log(0.05)]
    res = optimize(md_objective, x0, Newton(), autodiff=:forward)
    x = res.minimizer
    return [tanh(x[1]), exp(x[2]), exp(x[3])]
end

# Point estimate
θ_hat = md_estimate(data_psid)
println("Point estimates: ρ=$(round(θ_hat[1],digits=3)), σ²_α=$(round(θ_hat[2],digits=3)), σ²_η=$(round(θ_hat[3],digits=3))")

Point estimates: ρ=0.918, σ²_α=0.279, σ²_η=0.085

Now the bootstrap. We resample persons with replacement and re-estimate on each bootstrap sample.

Random.seed!(42)
persons = unique(data_psid.person)
N_persons = length(persons)
B = 200

θ_boot = mapreduce(vcat, 1:B) do b
    # Resample persons with replacement
    boot_persons = persons[rand(1:N_persons, N_persons)]
    # Build bootstrap dataset (preserving panel structure)
    boot_data = mapreduce(vcat, boot_persons) do p
        @subset(data_psid, :person .== p)
    end
    try
        return md_estimate(boot_data)'
    catch
        return [NaN NaN NaN]
    end
end

# Remove any failed bootstrap replications
θ_boot = θ_boot[.!any(isnan.(θ_boot), dims=2)[:], :]

# Bootstrap standard errors
se_boot = std.(eachcol(θ_boot))
println("\nBootstrap standard errors (B=$B):")
println("  se(ρ)    = $(round(se_boot[1], digits=4))")
println("  se(σ²_α) = $(round(se_boot[2], digits=4))")
println("  se(σ²_η) = $(round(se_boot[3], digits=4))")

# 95% percentile confidence intervals
ci_lower = [quantile(θ_boot[:,j], 0.025) for j in 1:3]
ci_upper = [quantile(θ_boot[:,j], 0.975) for j in 1:3]
println("\n95% Confidence Intervals:")
println("  ρ:    [$(round(ci_lower[1],digits=3)), $(round(ci_upper[1],digits=3))]")
println("  σ²_α: [$(round(ci_lower[2],digits=3)), $(round(ci_upper[2],digits=3))]")
println("  σ²_η: [$(round(ci_lower[3],digits=3)), $(round(ci_upper[3],digits=3))]")


Bootstrap standard errors (B=200):
  se(ρ)    = 0.079
  se(σ²_α) = 0.115
  se(σ²_η) = 0.0758

95% Confidence Intervals:
  ρ:    [0.742, 1.0]
  σ²_α: [0.144, 0.515]
  σ²_η: [0.012, 0.249]

Let’s visualize the bootstrap distributions.

labels = ["ρ", "σ²_α", "σ²_η"]
pl = [begin
    histogram(θ_boot[:,j], normalize=:pdf, label=false, alpha=0.5)
    plot!([θ_hat[j], θ_hat[j]], [0, ylims()[2]*0.9], color="red",
          linewidth=2, label="Point Estimate")
    plot!(title=labels[j])
end for j in 1:3]
plot(pl..., layout=(1,3), size=(900,300))

Notice that bootstrapping panel data by resampling individuals is straightforward to implement and automatically accounts for within-person dependence, arbitrary heteroskedasticity, and the full complexity of the minimum distance estimator. Compare these to the analytical standard errors from Example 13.2.

14.6 Practical Considerations

14.6.1 Simulation Noise and Smoothing

When simulation draws are held fixed across evaluations of the objective function, the simulated objective is a smooth function of \(\theta\) (assuming the model itself is smooth in \(\theta\)). This means standard gradient-based optimizers can be used. If simulation draws are re-drawn at each evaluation, the objective becomes noisy and optimization may fail or converge to the wrong point. Always fix the random seed when evaluating the simulated objective across different \(\theta\) values.

14.6.2 Simulation Noise and Standard Errors

Although it is critical to fix the seed for randomization when implementing these methods, you have to think carefully about how you then set the seed if you are using bootstrap methods for inference. If the number of simulations \(R\) is comparable to the number of observations, our theory tells us that simulation error will be a non-negligible component of the variance of the estimator. To accommodate this, you shouild ensure that you change the initial seed in each bootstrap trial. Even if you think simulation error is negligible compared to sampling error (i.e. \(R>>N\)), it is still good practice and costs nothing to incorporate it.

14.6.3 How Many Draws?

For MSM, consistency holds for any fixed \(R\), but efficiency improves with \(R\). A practical approach is to start with a moderate \(R\) for exploration, then increase \(R\) and check that the estimates are stable. If the estimates change substantially, \(R\) is too small.

For MSL, the bias from taking logarithms of simulated probabilities is of order \(1/R\). In practice, \(R\) should be large enough that this bias is negligible relative to the standard error.

14.6.4 Choosing Moments for MSM

In Example 14.4, we chose a small set of moments. In practice, the choice of moments should be guided by:

Identification: the moments should be informative about the parameters. Moments that are insensitive to a parameter will not help estimate it.
Parsimony: more moments improve efficiency (in principle) but increase computational cost and can lead to poorly conditioned weighting matrices.
Model fit: if the model cannot match a particular moment, including it will distort the estimates of other parameters. Start with moments the model should be able to match.

The discussion in the identification chapter on which moments pin down which preference parameters is directly relevant here.

14.6.5 Diagonal Weighting Matrices

When using MSM or minimum distance with many moments, the optimal weighting matrix \(\mathbf{W}^{*}=\hat{S}^{-1}\) can be poorly estimated and lead to erratic results in finite samples. A common and practical alternative is to use a diagonal weighting matrix, where each moment is weighted by the inverse of its variance but cross-moment covariances are ignored. This sacrifices some efficiency but tends to be much more stable in practice.

14.7 Exercises

Exercise

Exercise 14.1 Estimate the preference parameters \((\beta,\sigma)\) of the savings model (with \(\psi=0\)) using the method of simulated moments. Take the income process parameters as given from Example 12.2.

Choose a set of target moments from the PSID data (Example 10.1) that are informative about \((\beta,\sigma)\). Justify your choice based on the identification discussion in the savings identification chapter.
Implement the MSM estimator using the simulation code from Example 14.3.
Report your estimates and plot the model fit for your chosen moments.
Assess sensitivity: how do the estimates change when you use a different set of moments?

Exercise

Exercise 14.2 Continuing from Exercise 14.1:

Implement a bootstrap to compute standard errors for \((\hat{\beta},\hat{\sigma})\). Use the same bootstrap samples (resampled individuals from the PSID) that you used to construct confidence intervals for the income process parameters in Example 14.5. For each bootstrap replication \(b\), re-estimate the income process parameters \((\hat{\rho}^{b},\hat{\sigma}^{2b}_{\alpha},\hat{\sigma}^{2b}_{\eta})\) on the bootstrap sample, then re-estimate the savings model preference parameters \((\hat{\beta}^{b},\hat{\sigma}^{b})\) using those first-stage estimates. Use \(B=100\) replications.
Construct 95% confidence intervals for each parameter.
Explain why it is important to use the same bootstrap draws for both the income process and the savings model estimation, rather than bootstrapping each independently. What would go wrong if you used separate bootstrap samples?
Which parameter is most precisely estimated? Which is least? Does this match your intuition from the identification discussion?

Exercise

Exercise 14.3 Return to the minimum distance estimation of the income process from Example 12.2. Instead of using the identity weighting matrix, implement estimation with:

The diagonal weighting matrix, where each moment is weighted by the inverse of the variance of the corresponding sample moment. Use the bootstrap (resampling individuals as in Example 14.5) to estimate the variance of each sample moment.
The optimal weighting matrix, which uses the full variance-covariance matrix of the sample moments. Again, estimate this variance-covariance matrix using the bootstrap.

Compare the estimates and standard errors across all three weighting schemes (identity, diagonal, optimal). How much does the choice of weighting matrix matter in practice?