In many structural models, the likelihood or moment conditions cannot be computed in closed form because they involve high-dimensional integrals over unobserved heterogeneity, latent states, or expectations over future shocks. The savings model is a good example: even if we knew the true parameters, computing the likelihood of an observed consumption path requires integrating over sequences of income shocks. In these settings, simulation provides a way forward.
Example: Calculating Mean Assets
Example 14.1 Suppose you are estimating the savings model by matching mean assets at different ages over the life-cycle. Note that \(A_{t}\) depends on the sequence of realizations of the income shock \(\varepsilon_{s}\) for all perios \(s<t\). Assuming a discrete approximation to the income process, the numerical equivalent to the moment is:
where \(A^*\) indicates accumulated savings for the optimal savings policy under a particular guess of the model’s parameters. Notice that if we assume \(K_\varepsilon\) grid points for \(\varepsilon\), the number of sequences \(\varepsilon_{1}^{t-1}\) that we must sum over is \(K_{\varepsilon}^{t-1}\). For larger \(t\), this will be intractably large.
This chapter covers three topics. First, we discuss simulation-based estimators — the method of simulated moments (MSM) and indirect inference — which replace intractable analytical objects with simulation approximations. Second, we introduce the bootstrap, a general resampling-based approach to inference that is particularly useful when analytical standard errors are difficult to derive. Third, we bring these tools together with our prototype models.
14.1 Integration by Simulation
At the heart of simulation-based estimation is the problem of computing integrals and high-dimensional sums. Suppose we need to evaluate:
where \(\mathbf{u}\) is a vector of unobservables with density \(g\). If we want to evaluate this integral with quadrature methods (Section 2.5.1) then we face a curse of dimensionality. When the number of dimensions in \(\mathbf{u}\) is large then the problem is simply intractable. One solution is to approximate by Monte Carlo integration: draw \(\mathbf{u}^{1},\ldots,\mathbf{u}^{R}\) from \(g\) and compute:
By the law of large numbers, \(\hat{f}\rightarrow_{p} f\) as \(R\rightarrow\infty\). This is unbiased by construction: \(\mathbb{E}_{\mathbf{u}}[\hat{f}]=f\).
14.1.1 Importance Sampling
Sometimes we want to draw from an alternative density \(w(\mathbf{u})\) instead of \(g(\mathbf{u})\) — for instance because \(g\) is hard to sample from, or because draws from \(w\) reduce the variance of the estimator. In this case:
This is still unbiased and consistent. The ratio \(g/w\) is called the importance weight.
Variance reduction. Why can the choice of \(w\) reduce variance? The variance of the importance-sampling estimator depends on the variance of \(h \cdot g/w\) under draws from \(w\). Intuitively, if \(w\) concentrates draws in regions where the integrand \(h \cdot g\) is large, the estimator “wastes” fewer draws on regions that contribute little to the integral, reducing variance. The ideal importance density is proportional to \(|h| \cdot g\), which puts the most weight exactly where it matters most. In practice, one looks for a convenient \(w\) that roughly approximates this shape.
Example: Variance Reduction via Importance Sampling
Example 14.2 To illustrate, consider estimating a tail probability: \(P(Z > 3)\) where \(Z\sim\mathcal{N}(0,1)\). The true value is approximately \(0.00135\).
Naive Monte Carlo draws \(Z^{1},\ldots,Z^{R}\sim\mathcal{N}(0,1)\) and computes \(\hat{p}=\frac{1}{R}\sum_{r}\mathbf{1}\{Z^{r}>3\}\). Because the event \(\{Z>3\}\) is rare, most draws contribute zero to the sum — the estimator “wastes” draws in the center of the distribution where \(\mathbf{1}\{Z>3\}\cdot\phi(Z)\) is negligible.
Importance sampling instead draws from \(w=\mathcal{N}(3,1)\), which concentrates mass in the tail where the integrand is large. The estimator becomes: \[\hat{p}_{IS}=\frac{1}{R}\sum_{r}\mathbf{1}\{Z^{r}>3\}\frac{\phi(Z^{r})}{\phi_{w}(Z^{r})},\qquad Z^{r}\sim\mathcal{N}(3,1)\]
We compare both estimators across 2,000 independent replications, each using \(R=2{,}000\) draws.
usingDistributions, Random, Plots, StatisticsRandom.seed!(42)c =3.0p_true =1-cdf(Normal(), c)R =2000; n_reps =2000g =Normal(0, 1) # target distributionw =Normal(c, 1) # importance distribution (shifted to the tail)# Naive MC: draw from g, average the indicatorsnaive_estimates = [mean(rand(g, R) .> c) for _ in1:n_reps]# Importance sampling: draw from w, reweightis_estimates =map(1:n_reps) do _ z =rand(w, R) weights =pdf.(g, z) ./pdf.(w, z)mean((z .> c) .* weights)endprintln("True probability: P(Z > $c) = $(round(p_true, sigdigits=5))")println("Naive MC: mean = $(round(mean(naive_estimates), sigdigits=5)), std = $(round(std(naive_estimates), sigdigits=3))")println("Importance MC: mean = $(round(mean(is_estimates), sigdigits=5)), std = $(round(std(is_estimates), sigdigits=3))")println("Variance ratio (naive / IS): $(round(var(naive_estimates) /var(is_estimates), digits=1))×")
True probability: P(Z > 3.0) = 0.0013499
Naive MC: mean = 0.0013565, std = 0.000817
Importance MC: mean = 0.0013504, std = 5.47e-5
Variance ratio (naive / IS): 223.3×
The contrast is stark. The naive estimator is highly discrete — it can only take values that are multiples of \(1/R\) — and its distribution is spread widely around the truth. The importance sampling estimator is continuous (because the importance weights are continuous) and tightly concentrated. Both are unbiased, but importance sampling achieves a dramatic reduction in variance by directing draws to the region that matters.
Differentiability. Importance sampling also plays a role in making simulated integrals differentiable with respect to the parameters \(\theta\). When the integrand involves indicator functions — as in discrete choice models where \(h(y|\mathbf{x},\theta,\mathbf{u})=\mathbf{1}\{\cdot\}\) — the crude Monte Carlo approximation is a step function in \(\theta\), which creates problems for gradient-based optimization. Through a change of variables, it is sometimes possible to move the dependence on \(\theta\) out of the indicator and into smooth importance weights, yielding a simulated integral that is differentiable in \(\theta\). The GHK simulator for multivariate probit models is a prominent example of this strategy (see Section 14.2.1 below).
14.2 Maximum Simulated Likelihood
A natural first idea is to replace the likelihood with its simulated analogue. The maximum simulated likelihood (MSL) estimator is:
where \(\hat{f}\) is the simulated probability from above.
A Subtlety with MSL
There is an important issue with MSL: because \(\log\) is concave, Jensen’s inequality gives \(\mathbb{E}[\log\hat{f}]\leq \log\mathbb{E}[\hat{f}]=\log f\). This means the simulated log-likelihood is biased downward, and the MSL estimator is inconsistent for fixed \(R\). Consistency requires that the number of simulation draws grows with the sample: both \(N,R\rightarrow\infty\) with \(\sqrt{N}/R\rightarrow 0\).
This is in contrast to the method of simulated moments, which we discuss next, where a fixed number of simulation draws is sufficient for consistency.
Under standard regularity conditions and \(\sqrt{N}/R\rightarrow 0\), the MSL estimator has the same asymptotic distribution as the (infeasible) MLE:
In practice, one should use enough draws that the simulation error is small relative to the sampling variability. A common rule of thumb is to set \(R\) large enough that results are insensitive to further increases.
14.2.1 Differentiability of the Simulated Likelihood
Warning
A separate challenge with MSL arises when the model involves discrete outcomes. Consider a simple probit: \(Y=\mathbf{1}\{X\beta+\varepsilon>0\}\) with \(\varepsilon\sim\mathcal{N}(0,1)\). The true choice probability \(\Phi(X\beta)\) is a smooth function of \(\beta\). But if we approximate it by crude Monte Carlo — drawing \(\varepsilon^{1},\ldots,\varepsilon^{R}\sim\mathcal{N}(0,1)\) and computing \(\hat{P}=\frac{1}{R}\sum_{r}\mathbf{1}\{X\beta+\varepsilon^{r}>0\}\) — the result is a step function in \(\beta\). Each indicator \(\mathbf{1}\{X\beta+\varepsilon^{r}>0\}\) switches from 0 to 1 at \(\beta=-\varepsilon^{r}/X\), so \(\hat{P}\) jumps at finitely many points and is flat everywhere else. Its derivative is zero almost everywhere, making gradient-based optimization of the simulated log-likelihood impossible.
The figure below illustrates this for \(X=1\) and \(R=20\) draws. The true probability \(\Phi(\beta)\) is smooth, but the simulated probability is a step function that jumps at each \(\beta=-\varepsilon^{r}\). The kernel-smoothed simulator replaces the indicator with \(\Phi(\cdot/h)\), restoring differentiability at the cost of a small bias.
This problem is not specific to probit — it arises whenever the simulated likelihood involves averaging indicator functions. Solutions include:
Kernel smoothing: replace the indicator \(\mathbf{1}\{\cdot>0\}\) with a smooth approximation such as \(\Phi(\cdot/h)\) for a small bandwidth \(h\) (the dashed red curve above).
Importance sampling / change of variables: restructure the integral so that \(\beta\) appears in smooth weights rather than inside an indicator. The GHK simulator for multivariate probit uses sequential conditioning to achieve this.
Derivative-free optimization: use methods like Nelder-Mead that do not require gradients. This is simple but can be slow and unreliable in high dimensions.
14.3 Method of Simulated Moments
The method of simulated moments (MSM) is the simulation analogue of GMM. Suppose that the moment conditions take the form:
If we cannot evaluate \(g\) analytically, we can replace it with a simulated version: \[\hat{g}(\mathbf{w}_{n},\theta) = \frac{1}{R}\sum_{r=1}^{R}\tilde{g}(\mathbf{w}_{n},\mathbf{u}^{r},\theta)\]
where \(\mathbb{E}_{\mathbf{u}}[\tilde{g}(\mathbf{w},\mathbf{u},\theta)]=g(\mathbf{w},\theta)\). The MSM estimator is:
In many structural applications, the moment conditions do not involve observation-level simulation. Instead, we match aggregate statistics from the data to their model-implied counterparts computed by simulating the model. The moments take the form:
where \(\hat{\mathbf{m}}\) is a vector of sample statistics (means, variances, covariances, etc.) and \(\tilde{\mathbf{m}}(\theta)\) is the corresponding vector computed from simulated data. This is the structure we will use for the savings model below.
14.3.2 Asymptotic Distribution
Theorem 14.1 (Asymptotic Distribution of MSM) Suppose the standard regularity conditions for GMM hold and \(\mathbb{E}_{\mathbf{u}}[\tilde{g}(\mathbf{w},\mathbf{u},\theta)]=g(\mathbf{w},\theta)\). Then with \(R\) fixed:
with \(\nabla_{\theta}g_{0}=\mathbb{E}[\nabla_{\theta}g(\mathbf{w},\theta_{0})']\).
Several features are worth highlighting:
Consistency does not require \(R\rightarrow\infty\). Because \(\hat{g}\) is an unbiased estimator of \(g\), the simulated sample moments converge to zero at \(\theta_{0}\) by the law of large numbers. This is in contrast to MSL.
Simulation adds variance. Because \(\mathbb{V}[\hat{g}]\geq\mathbb{V}[g]\), the MSM estimator is less efficient than the infeasible GMM estimator that uses the true \(g\). For the common case of a frequency simulator (where \(\tilde{g}\) is a binary indicator), the variance inflates by a factor of \((1+1/R)\).
If \(\sqrt{N}/R\rightarrow 0\), the efficiency loss from simulation disappears and the MSM achieves the same asymptotic variance as GMM.
See Newey and McFadden (1994) for a comprehensive treatment.
Example: Simulating the Savings Model
Example 14.3 To estimate the savings model by MSM, we first need to be able to simulate data from it. Given a solution (the policy function \(A\)), we can forward-simulate a panel of individuals by drawing income shocks and applying the optimal savings rule.
usingDistributions, Random, LinearAlgebra, Statistics# === Model setup (from savings model chapter) ===Φ_cdf(x) =cdf(Normal(),x)functiontauchen(ρ,ση,Kϵ) sd = ση/sqrt(1-ρ^2) grid =range(-3sd,stop=3sd,length=Kϵ) Π =zeros(Kϵ,Kϵ) Δ = grid[2]-grid[1]for j=1:Kϵ Π[1,j] =Φ_cdf((grid[1] + Δ/2- ρ*grid[j])/ση) Π[end,j] =1-Φ_cdf((grid[end] - Δ/2- ρ*grid[j])/ση)for k=2:(Kϵ-1) Π[k,j] =Φ_cdf((grid[k] + Δ/2- ρ*grid[j])/ση) -Φ_cdf((grid[k] - Δ/2- ρ*grid[j])/ση)endendreturn Π,collect(grid)endu(c,σ) = c^(1-σ) / (1-σ)functionsolve_max(V,t,iϵ,ia,pars) (;agrid,ϵgrid,Π,σ,Ka,r,β) = pars cash =exp(pars.μ[t] + ϵgrid[iϵ]) + agrid[ia] amax =1 vmax =-Inf loop =true a =1while loop && a<Ka c = cash - agrid[a] / (1+r)if c>0 v =u(c,σ)for iϵ′ inaxes(V,1) v += β * Π[iϵ′,iϵ] * V[iϵ′,a,t+1]endif v>vmax vmax = v amax = aendelse loop =falseend a +=1endreturn amax,vmaxendfunctioniterate!(V,A,t,pars)for ia inaxes(V,2), iϵ inaxes(V,1) A[iϵ,ia,t],V[iϵ,ia,t] =solve_max(V,t,iϵ,ia,pars)endendfunctionterminal_values!(V,pars) (;σ,ψ,agrid) = parsfor ia inaxes(V,2), iϵ inaxes(V,1) V[iϵ,ia] = ψ >0 ? ψ *u(max(agrid[ia], 1e-10), σ) :0.0endendfunctionbackward_induction!(V,A,pars) (;T) = pars@viewsterminal_values!(V[:,:,T+1],pars)for t inreverse(1:T)iterate!(V,A,t,pars)endendfunctionsetup_and_solve(θ) β, σ, ψ = θ T =45 Ka =100 Kϵ =5 ρ =0.9 ση =0.1 r =0.05 μ =fill(2., T) agrid =collect(LinRange(0, μ[1] * T, Ka)) Π, ϵgrid =tauchen(ρ, ση, Kϵ) pars = (;T, β, σ, ρ, ση, μ, ψ, r, Ka, Kϵ, agrid, Π, ϵgrid) V =zeros(Kϵ, Ka, T+1) A =zeros(Int64, Kϵ, Ka, T)backward_induction!(V, A, pars)return A, parsend
setup_and_solve (generic function with 1 method)
Now we write a simulation function. Given the policy function, we draw income shocks and track each individual’s asset and income path.
functionsimulate_panel(A, pars, N_sim; seed=nothing) (;T, Kϵ, Ka, agrid, ϵgrid, Π, μ, r) = parsif !isnothing(seed)Random.seed!(seed)end# Pre-compute CDF for sampling from Markov chain cumΠ =cumsum(Π, dims=1)# Storage assets =zeros(N_sim, T+1) # assets at start of each period income =zeros(N_sim, T) consumption =zeros(N_sim, T) iϵ_sim =zeros(Int, N_sim, T)# Initial conditions: start with zero assets, draw initial ϵ from stationary dist# Use middle state as starting point for simplicity iϵ_sim[:, 1] .= (Kϵ +1) ÷2for n in1:N_sim ia =1# start with zero assetsfor t in1:T iϵ = iϵ_sim[n, t] y =exp(μ[t] + ϵgrid[iϵ]) income[n, t] = y cash = y + agrid[ia]# Look up optimal savings ia_next = A[iϵ, ia, t] c = cash - agrid[ia_next] / (1+ r) consumption[n, t] = c assets[n, t+1] = agrid[ia_next] ia = ia_next# Draw next period's ϵif t < T u_draw =rand() iϵ_next =findfirst(cumΠ[:, iϵ] .>= u_draw) iϵ_sim[n, t+1] = iϵ_nextendendendreturn (;assets, income, consumption)end# Solve and simulateA, pars =setup_and_solve((0.95, 2.0, 0.0))sim =simulate_panel(A, pars, 1000; seed=123)
The hump-shaped asset profile and the smooth consumption path are consistent with the model’s predictions: individuals accumulate assets during working years as a buffer against income risk and draw them down toward the end of life. With no bequest motive (\(\psi=0\)), assets are fully decumulated by the final period.
Example: Simulated Moments for the Savings Model
Example 14.4 Now let’s use the simulation machinery to define a set of moments and construct an MSM estimator for the preference parameters \((\beta,\sigma)\), setting \(\psi=0\) (no bequest motive). We take the income process parameters \((\rho,\sigma_{\eta})\) as given (estimated in Example 12.2).
Our target moments will be:
Mean assets at ages 30, 40, 50, 60 (to pin down the savings profile)
Variance of log consumption at ages 30, 40, 50, 60 (to pin down risk aversion)
# Compute moments from simulated datafunctionsimulated_moments(θ; N_sim=2000, seed=42) A, pars =setup_and_solve(θ) sim =simulate_panel(A, pars, N_sim; seed=seed) target_ages = [6, 16, 26, 36] # ages 30, 40, 50, 60 (relative to t=1 at age 25) m_assets = [mean(sim.assets[:, t]) for t in target_ages] m_var_logc = [var(log.(sim.consumption[:, t])) for t in target_ages]returnvcat(m_assets, m_var_logc)end# "True" parameters and data moments (ψ=0: no bequest motive)θ_true = (0.95, 2.0, 0.0)m_data =simulated_moments(θ_true; N_sim=2_000, seed=1)# MSM objective: estimate (β, σ) with ψ fixed at 0functionmsm_objective(x, m_data) β =0.8+0.19* (1/ (1+exp(-x[1]))) # map to (0.8, 0.99) σ =exp(x[2]) θ = (β, σ, 0.0) m_sim =simulated_moments(θ; N_sim=5_000) diff = m_data .- m_simreturn diff'* diffend
msm_objective (generic function with 1 method)
This estimator is computationally intensive — each evaluation of the objective requires solving the dynamic programming problem and simulating the model.
Let’s check the model fit by simulating at both the true and estimated parameters.
# Simulate at estimated and true parametersθ_hat_msm = (β_hat, σ_hat, 0.0)A_hat, pars_hat =setup_and_solve(θ_hat_msm)sim_hat =simulate_panel(A_hat, pars_hat, 5_000)A_true, pars_true =setup_and_solve(θ_true)sim_true =simulate_panel(A_true, pars_true, 2_000; seed=1)T = pars_true.Tages =25.+ (1:T)target_ages_actual = [30, 40, 50, 60]target_periods = target_ages_actual .-24# periods corresponding to target ages# Mean assets over the life cyclep1 =plot(ages, mean(sim_true.assets[:, 1:T], dims=1)', label="Data (true θ)", linewidth=2, color=:black, xlabel="Age", ylabel="Mean Assets", title="Mean Assets")plot!(p1, ages, mean(sim_hat.assets[:, 1:T], dims=1)', label="Estimated θ", linewidth=2, color=:steelblue, linestyle=:dash)vline!(p1, target_ages_actual, color=:gray, alpha=0.3, linestyle=:dot, label=false)# Variance of log consumption over the life cyclevar_logc_true = [var(log.(sim_true.consumption[:, t])) for t in1:T]var_logc_hat = [var(log.(sim_hat.consumption[:, t])) for t in1:T]p2 =plot(ages, var_logc_true, label="Data (true θ)", linewidth=2, color=:black, xlabel="Age", ylabel="Var(log c)", title="Variance of Log Consumption")plot!(p2, ages, var_logc_hat, label="Estimated θ", linewidth=2, color=:steelblue, linestyle=:dash)vline!(p2, target_ages_actual, color=:gray, alpha=0.3, linestyle=:dot, label=false)plot(p1, p2, layout=(1,2), size=(800, 350))
The discrete jumps visible in the estimated profiles are a consequence of the discrete asset grid — the policy function maps to grid points, so small parameter changes can cause the optimal grid point to jump (see Section 14.2.1).
Non-Differentiability of Simulated Moments
Notice that we used the Nelder-Mead algorithm above — a derivative-free optimizer. This is not a coincidence. In the savings model, the policy function \(A(\varepsilon,a,t)\) maps to a discrete asset grid, so the simulated consumption and asset paths are step functions of the parameters \(\theta\). As \(\theta\) changes continuously, the optimal grid point occasionally jumps, producing discontinuities in the simulated moments. This is the same non-differentiability problem discussed in Section 14.2.1, now appearing in the MSM context.
Gradient-based optimizers will fail here because the numerical gradient is zero almost everywhere and undefined at the jumps. Derivative-free methods like Nelder-Mead avoid this issue entirely. Alternatives include using a finer asset grid (which reduces the size of the jumps), interpolating the policy function between grid points to smooth the simulated moments, or using kernel-smoothed simulators.
Fixing Simulation Error
Notice in Example 14.4 that we fixed the seed of the simulation when we call the function simulate_panel. What do you think would happen if we failed to do this inside our optimization routine?
14.4 Indirect Inference
Indirect inference is a simulation-based estimator that works by fitting an auxiliary model to both the real data and to data simulated from the structural model, and then finding the structural parameters that make the two sets of auxiliary estimates as close as possible.
14.4.1 Setup
Let \(\hat{\beta}_{N}\) be a vector of statistics estimated from the data (the “auxiliary parameters”). For example, \(\hat{\beta}\) might be OLS coefficients from a wage regression, or the parameters of a reduced-form probit. The key requirement is that \(\hat{\beta}\) is asymptotically normal:
Now, for each candidate \(\theta\), simulate data from the structural model and compute the same statistic on the simulated data, yielding \(\hat{\beta}^{R}(\theta)\). Let \(\beta(\theta)=\text{plim}\ \hat{\beta}^{R}(\theta)\) denote the “pseudo-true” value of the auxiliary parameter implied by \(\theta\). The indirect inference estimator is:
where \(\nabla_{\theta}\beta_{0}=\frac{\partial\beta(\theta_{0})'}{\partial\theta}\).
The variance formula has the same sandwich structure as minimum distance estimation (Theorem 13.7), which makes sense: indirect inference is a simulated minimum distance estimator where the reduced-form statistics play the role of \(\pi\).
14.4.3 Relationship to MSM
Indirect inference nests MSM as a special case. If the auxiliary model consists of sample moments \(\hat{\beta}=\frac{1}{N}\sum_{n}g(\mathbf{w}_{n})\) and the simulated counterpart is \(\hat{\beta}^{R}(\theta)=\frac{1}{NR}\sum_{r}\sum_{n}g(\mathbf{w}_{n}^{r}(\theta))\), then the indirect inference estimator reduces to MSM. The appeal of indirect inference is that it allows the use of richer auxiliary models — like regressions or multinomial choice models — that can capture complex features of the data.
Discussion: Choosing Auxiliary Statistics
The choice of auxiliary model is both an art and a science. A good auxiliary model should:
Be sensitive to the structural parameters of interest (for efficiency).
Capture the key features of the data that the structural model is designed to explain.
Be computationally cheap to estimate (since it must be re-estimated on each simulated dataset).
In practice, it is useful to think of the auxiliary model as a diagnostic: if the structural model cannot match the auxiliary estimates, it suggests the model is missing something important. This perspective connects indirect inference to the broader goal of model evaluation.
14.5 The Bootstrap
We now turn to a fundamentally different use of simulation: not for evaluating the objective function, but for inference. The bootstrap uses resampling to approximate the sampling distribution of an estimator, providing standard errors and confidence intervals without requiring analytical derivations.
14.5.1 Motivation
In many settings we face one or more of the following challenges:
The variance formula is difficult to derive (e.g. for multi-step estimators).
The derivatives of the criterion function are hard or costly to compute.
The estimation criterion is non-smooth in finite samples.
The sample size is too small for the CLT to provide a good approximation.
The bootstrap addresses all of these by directly approximating the finite-sample distribution.
14.5.2 Framework
Let \(T_{N} = T_{N}(X_{1},\ldots,X_{N})\) be a statistic — for instance, \(T_{N}=\hat{\theta}\) or \(T_{N}=\sqrt{N}(\hat{\theta}-\theta_{0})\). The finite-sample distribution \(G_{N}(\cdot,F_{0})\) is typically unknown. The bootstrap idea is to replace \(F_{0}\) with an estimate \(\hat{F}_{N}\) and use \(G_{N}(\cdot,\hat{F}_{N})\) to approximate \(G_{N}(\cdot,F_{0})\).
There are two main approaches to constructing \(\hat{F}_{N}\):
Definition 14.1 (Nonparametric Bootstrap) Replace \(F_{0}\) with the empirical distribution function: \[\hat{F}_{N}(x) = \frac{1}{N}\sum_{n=1}^{N}\mathbf{1}\{X_{n}\leq x\}\] Bootstrap samples are obtained by sampling \(N\) observations with replacement from the original data.
Definition 14.2 (Parametric Bootstrap) Replace \(F_{0}\) with \(F(\cdot,\hat{\theta}_{N})\), the model-implied distribution evaluated at the estimated parameters. Bootstrap samples are obtained by simulating from the estimated model.
The nonparametric bootstrap is more general and requires fewer assumptions, while the parametric bootstrap can be more efficient when the model is correctly specified. For structural models, the parametric bootstrap is natural since we already have the machinery to simulate from the model.
14.5.3 Procedure
Given an estimator \(\hat{\theta}_{N}\):
Draw a bootstrap sample of size \(N\): \(\mathbf{X}^{b}=\{X_{1}^{b},\ldots,X_{N}^{b}\}\)
Nonparametric: sample with replacement from \(\{X_{1},\ldots,X_{N}\}\)
Parametric: simulate from \(F(\cdot,\hat{\theta}_{N})\)
Compute the bootstrap estimate: \(\hat{\theta}^{b}_{N}=T_{N}(\mathbf{X}^{b})\)
Repeat steps 1-2 for \(b=1,\ldots,B\) to obtain the bootstrap distribution \(\{\hat{\theta}^{1}_{N},\ldots,\hat{\theta}^{B}_{N}\}\)
From this distribution we can compute:
Bootstrap standard errors: \(\text{se}^{*}(\hat{\theta}) = \text{sd}(\hat{\theta}^{1},\ldots,\hat{\theta}^{B})\)
Percentile confidence intervals: \([q_{\alpha/2},\ q_{1-\alpha/2}]\) where \(q_{\alpha}\) is the \(\alpha\)-quantile of the bootstrap distribution
Bootstrap-\(t\) confidence intervals: based on the distribution of the pivotal statistic \(t^{b}=(\hat{\theta}^{b}-\hat{\theta})/\text{se}(\hat{\theta}^{b})\)
14.5.4 When Does the Bootstrap Work?
The bootstrap is consistent when the statistic \(T_{N}\) is asymptotically normal. In this case, \(G_{N}(\cdot,\hat{F}_{N})\rightarrow_{p}G_{\infty}(\cdot,F_{0})\), so the bootstrap distribution converges to the true asymptotic distribution.
The bootstrap can actually do better than the asymptotic normal approximation when the statistic is asymptotically pivotal — that is, when its asymptotic distribution does not depend on unknown parameters. The \(t\)-statistic \(\sqrt{N}(\hat{\theta}-\theta_{0})/\text{se}(\hat{\theta})\) is a canonical example. In this case, the bootstrap provides a higher-order refinement over the normal approximation. See Horowitz (2001) for a detailed discussion.
Replicating the Dependence Structure
The bootstrap requires that the resampling scheme correctly replicates the dependence structure in the data. For iid data, sampling with replacement is sufficient. For panel data, one should resample entire individuals (not individual observations) to preserve the within-person correlation structure. For time series data, block bootstrap methods are needed.
Example: Bootstrap Confidence Intervals for the Income Process
Example 14.5 Let’s apply the bootstrap to construct confidence intervals for the income process parameters estimated by minimum distance in Example 12.2. Rather than relying on the analytical standard errors from Example 13.2, we resample individuals from the PSID panel and re-estimate the parameters on each bootstrap sample.
Since the PSID is panel data, we resample entire individuals (all observations for a given person) to preserve the within-person correlation.
usingCSV, DataFrames, DataFramesMeta, Statistics, Optim, Plots, Random# Load and prepare datadata_psid =@chainbegin CSV.read("../data/abb_aea_data.csv",DataFrame,missingstring ="NA")@select:person :y :tot_assets1 :asset :age :year@subset:age.>=25:age.<=64end# Model moments and estimation functions (from @exm-md_income)functionmodel_moments(θ, T) ρ, σ2_α, σ2_η = θ m = [σ2_α + (1-ρ^(2(t-1)))/(1-ρ^2) * σ2_η for t in1:T]return mendfunctionmd_estimate(data) m_hat =@chain data begingroupby(:age)@combine:var_logy =var(log.(:y))@orderby:age _.var_logyend T =length(m_hat)functionmd_objective(x) ρ =tanh(x[1]) σ2_α =exp(x[2]) σ2_η =exp(x[3]) θ = (ρ, σ2_α, σ2_η) m_model =model_moments(θ, T) diff = m_hat .- m_modelreturn diff'* diffend x0 = [0.5, log(0.1), log(0.05)] res =optimize(md_objective, x0, Newton(), autodiff=:forward) x = res.minimizerreturn [tanh(x[1]), exp(x[2]), exp(x[3])]end# Point estimateθ_hat =md_estimate(data_psid)println("Point estimates: ρ=$(round(θ_hat[1],digits=3)), σ²_α=$(round(θ_hat[2],digits=3)), σ²_η=$(round(θ_hat[3],digits=3))")
Point estimates: ρ=0.918, σ²_α=0.279, σ²_η=0.085
Now the bootstrap. We resample persons with replacement and re-estimate on each bootstrap sample.
Notice that bootstrapping panel data by resampling individuals is straightforward to implement and automatically accounts for within-person dependence, arbitrary heteroskedasticity, and the full complexity of the minimum distance estimator. Compare these to the analytical standard errors from Example 13.2.
14.6 Practical Considerations
14.6.1 Simulation Noise and Smoothing
When simulation draws are held fixed across evaluations of the objective function, the simulated objective is a smooth function of \(\theta\) (assuming the model itself is smooth in \(\theta\)). This means standard gradient-based optimizers can be used. If simulation draws are re-drawn at each evaluation, the objective becomes noisy and optimization may fail or converge to the wrong point. Always fix the random seed when evaluating the simulated objective across different \(\theta\) values.
14.6.2 Simulation Noise and Standard Errors
Although it is critical to fix the seed for randomization when implementing these methods, you have to think carefully about how you then set the seed if you are using bootstrap methods for inference. If the number of simulations \(R\) is comparable to the number of observations, our theory tells us that simulation error will be a non-negligible component of the variance of the estimator. To accommodate this, you shouild ensure that you change the initial seed in each bootstrap trial. Even if you think simulation error is negligible compared to sampling error (i.e. \(R>>N\)), it is still good practice and costs nothing to incorporate it.
14.6.3 How Many Draws?
For MSM, consistency holds for any fixed \(R\), but efficiency improves with \(R\). A practical approach is to start with a moderate \(R\) for exploration, then increase \(R\) and check that the estimates are stable. If the estimates change substantially, \(R\) is too small.
For MSL, the bias from taking logarithms of simulated probabilities is of order \(1/R\). In practice, \(R\) should be large enough that this bias is negligible relative to the standard error.
14.6.4 Choosing Moments for MSM
In Example 14.4, we chose a small set of moments. In practice, the choice of moments should be guided by:
Identification: the moments should be informative about the parameters. Moments that are insensitive to a parameter will not help estimate it.
Parsimony: more moments improve efficiency (in principle) but increase computational cost and can lead to poorly conditioned weighting matrices.
Model fit: if the model cannot match a particular moment, including it will distort the estimates of other parameters. Start with moments the model should be able to match.
The discussion in the identification chapter on which moments pin down which preference parameters is directly relevant here.
14.6.5 Diagonal Weighting Matrices
When using MSM or minimum distance with many moments, the optimal weighting matrix \(\mathbf{W}^{*}=\hat{S}^{-1}\) can be poorly estimated and lead to erratic results in finite samples. A common and practical alternative is to use a diagonal weighting matrix, where each moment is weighted by the inverse of its variance but cross-moment covariances are ignored. This sacrifices some efficiency but tends to be much more stable in practice.
14.7 Exercises
Exercise
Exercise 14.1 Estimate the preference parameters \((\beta,\sigma)\) of the savings model (with \(\psi=0\)) using the method of simulated moments. Take the income process parameters as given from Example 12.2.
Choose a set of target moments from the PSID data (Example 10.1) that are informative about \((\beta,\sigma)\). Justify your choice based on the identification discussion in the savings identification chapter.
Implement the MSM estimator using the simulation code from Example 14.3.
Report your estimates and plot the model fit for your chosen moments.
Assess sensitivity: how do the estimates change when you use a different set of moments?
Implement a bootstrap to compute standard errors for \((\hat{\beta},\hat{\sigma})\). Use the same bootstrap samples (resampled individuals from the PSID) that you used to construct confidence intervals for the income process parameters in Example 14.5. For each bootstrap replication \(b\), re-estimate the income process parameters \((\hat{\rho}^{b},\hat{\sigma}^{2b}_{\alpha},\hat{\sigma}^{2b}_{\eta})\) on the bootstrap sample, then re-estimate the savings model preference parameters \((\hat{\beta}^{b},\hat{\sigma}^{b})\) using those first-stage estimates. Use \(B=100\) replications.
Construct 95% confidence intervals for each parameter.
Explain why it is important to use the same bootstrap draws for both the income process and the savings model estimation, rather than bootstrapping each independently. What would go wrong if you used separate bootstrap samples?
Which parameter is most precisely estimated? Which is least? Does this match your intuition from the identification discussion?
Exercise
Exercise 14.3 Return to the minimum distance estimation of the income process from Example 12.2. Instead of using the identity weighting matrix, implement estimation with:
The diagonal weighting matrix, where each moment is weighted by the inverse of the variance of the corresponding sample moment. Use the bootstrap (resampling individuals as in Example 14.5) to estimate the variance of each sample moment.
The optimal weighting matrix, which uses the full variance-covariance matrix of the sample moments. Again, estimate this variance-covariance matrix using the bootstrap.
Compare the estimates and standard errors across all three weighting schemes (identity, diagonal, optimal). How much does the choice of weighting matrix matter in practice?
Horowitz, Joel L. 2001. “The Bootstrap.” In Handbook of Econometrics, 5:3159–3228. Elsevier.
Newey, Whitney K, and Daniel McFadden. 1994. “Large Sample Estimation and Hypothesis Testing.”Handbook of Econometrics 4: 2111–2245.