12 Introducing the Estimators with Examples

Before diving into statistical theory, we will introduce the estimators by proposing estimation methods for each of our prototype models.

The three workhorse methods are:

Maximum Likelihood;
The Generalized Method of Moments; and
Minimum Distance.

Each of these approaches is an extremum estimator: any estimator that can be characterized as the solution to a maximization or minimization problem.

Definition 12.1 $\hat{\theta}$ is an extremum estimator if:

\[ \hat{\theta} = \arg\max_{\theta\in\Theta} Q_{N}(\theta) \] where $\Theta\subset\mathbb{R}^{p}$.

Just to clarify where we are going, it helps to reiterate what the key properties are that we would like to establish for each approach.

Key Properties of Estimators

The key theoretical questions we want to establish for each estimation approach described below are:

[Consistency] Does our estimate approach the “true” parameters of the data generating process as we collect more data?
[Inference] How is our estimate distributed around the true parameters? How uncertain are we about our key calculations of interest and can we place reasonable bounds on the correct answer?

12.1 The Generalized Roy Model

To discuss estimation of this model, let’s assume a linear form for the selection and outcomes equations:

\[ D = \mathbf{1}\{\gamma_0 + \gamma_1X + \gamma_2Z - V \geq0\} \] \[ Y_{D} = \beta_{D,0} + \beta_{D,1}X + U_D \] with $V\sim\mathcal{N}(0,1)$.

Our identification argument suggested a two-step estimator for the Generalized Roy Model, which we implemented in Example 7.1:

Estimate the selection equation by maximum likelihood: \[ \hat{\gamma} = \arg\max_{\gamma}\frac{1}{N}\sum_{n=1}^{N}D_{n}\log(\Phi(\mathbf{w}_{n}\gamma)) + (1-D_n)\log(1-\Phi(\mathbf{w}_{n}\gamma)) \] where $\mathbf{w}_{n} = [1,\ X_{n},\ Z_{n}]$.
Estimate the outcome equations with OLS using a selection correction.

Note that this is a two-step estimator. The second stage relies on parameters estimated in the first stage. We will need to develop theory for this!

12.2 The Search Model

For this example, let’s assume that we observe wages with some small amount of known measurement error:

\[ \log(W^{o}_{n,t}) = \log(W_{n,t}) + \zeta_{n,t}\]

where $\zeta_{n,t} \sim \mathcal{N}(0,\sigma^2_\zeta)$ and $\sigma_\zeta = 0.05$.

Recall that without further variation, we must make a parametric assumption on the wage distribution, and so we assume that $W$ is log-normally distributed with mean $\mu$ and variance $\sigma^2_{W}$.

Our strategy here is to estimate the parameters

\[ \theta = (\mu,\sigma^2_{W},h,\delta,w^*) \]

and invert out $\lambda$ and $b$ (the latter using the reservation wage equation). Let $X_{n} = (W^o,t_u,E)$ indicate the data. The log-likelihood of a single observation is:

\[ l(X;\theta) = E \times \log\left(\int f_{W|W>w^*}(\log(W^{o})-\zeta)\phi(\zeta;\sigma_\zeta)d\zeta\right) + (1-E)\times[\log(h) + t_u\log(1-h)] \]

where, according to our parametric specifications:

\[ f_{W|W>w^*}(w) = \frac{\phi(w;\sigma_{W})}{1-\Phi(\log(w^*)/\sigma_{W})}.\]

$\phi(\cdot;\sigma)$ is the pdf of a normal with standard deviation $\sigma$ and $\Phi$ is the cdf of a standard normal.

The maximum likelihood estimator is:

\[ \hat{\theta} = \arg\max_\theta \frac{1}{N}\sum_{n}l(X_n;\theta) \] while the MLE estimates of $\lambda$ and $b$ are:

\[ \hat{\lambda} = \hat{h} / (1 - \widehat{F}_{W|W>w^*}(\hat{w}^*) \]

\[ \hat{b} = w^* - \frac{\hat{\lambda}}{1 - \beta(1-\hat{\delta})}\int_{\hat{w}^*}(1-\widehat{F}_{W|W>w^*}(w))dw \]

When we get to the theory we will consider the asymptotic properties of not just $\hat{\theta}$ but also the derived estimates $\hat{b}$ and $\hat{\lambda}$.

Example: Coding the Log-Likelihood

Example 12.1 First, let’s load the routines that we previously wrote to solve the model and take numerical integrals with quadrature. These are identical to what we’ve seen before and are available on the course github.

include("../scripts/search_model.jl")

solve_res_wage (generic function with 1 method)

Before writing the likelihood, let’s load the data, clean, and create the data frame.

using CSV, DataFrames, DataFramesMeta, Statistics

data = CSV.read("../data/cps_00019.csv",DataFrame)
data = @chain data begin
    @transform :E = :EMPSTAT.<21
    @transform @byrow :wage = begin
        if :PAIDHOUR==0
            return missing
        elseif :PAIDHOUR==2
            if :HOURWAGE<99.99 && :HOURWAGE>0
                return :HOURWAGE
            else
                return missing
            end
        elseif :PAIDHOUR==1
            if :EARNWEEK>100 && :UHRSWORKT<997 && :UHRSWORKT>0
                return :EARNWEEK / :UHRSWORKT
            else
                return missing
            end
        end
    end
    @subset :MONTH.==1
    @select :AGE :SEX :RACE :EDUC :wage :E :DURUNEMP
    @transform begin
        :bachelors = :EDUC.>=111
        :nonwhite = :RACE.!=100 
        :female = :SEX.==2
        :DURUNEMP = round.(:DURUNEMP .* 12/52)
    end
    @subset :female .!:bachelors
end

# the whole dataset in a named tuple
wage_missing = ismissing.(data.wage)
wage = coalesce.(data.wage,1.)
N = length(data.AGE)
# create a named tuple with all variables to conveniently pass to the log-likelihood:
dat = (;logwage = log.(wage),wage_missing,E = data.E,tU = data.DURUNEMP)

(logwage = [3.0368742168851663, 2.302585092994046, 2.2512917986064953, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  2.70805020110221, 0.0, 0.0, 2.3978952727983707, 0.0, 0.0, 0.0, 2.4849066497880004, 0.0, 0.0], wage_missing = Bool[0, 0, 0, 1, 1, 1, 1, 1, 1, 1  …  0, 1, 1, 0, 1, 1, 1, 0, 1, 1], E = Bool[1, 1, 1, 1, 0, 1, 0, 1, 1, 1  …  1, 1, 1, 1, 1, 1, 1, 1, 1, 1], tU = [231.0, 231.0, 231.0, 231.0, 4.0, 231.0, 1.0, 231.0, 231.0, 231.0  …  231.0, 231.0, 231.0, 231.0, 231.0, 231.0, 231.0, 231.0, 231.0, 231.0])

Note that we limit the sample in the last line to women without a bachelor’s degree, and that we drop imputed wages with reported weekly earnings less than $100. This is to avoid infeasibly low imputed wages.

Now, let’s write the log-likelihood as above.

using Distributions, Optim

ϕ(x,μ,σ) = pdf(Normal(μ,σ),x)
Φ(x,μ,σ) = cdf(Normal(μ,σ),x)

# a function for the log-likelihood of observed wages (integrating out measurement error)
function logwage_likelihood(logwage,F,σζ,log_wres)
    f(x) = pdf(F,x) / (1-cdf(F,log_wres)) * ϕ(logwage,x,σζ)
    ub = quantile(F,0.9999)
    return integrateGL(f,log_wres,ub)
end

# a function to get the log-likelihood of a single observation
# note this function assumes that data holds vectors 
# E, tU, and logwage
function log_likelihood(n,data,pars)
    (;h,δ,log_wres,F,σζ) = pars
    ll = 0.
    if data.E[n]
        ll += log(h) - log(h + δ)
        if !data.wage_missing[n]
            ll += log(logwage_likelihood(data.logwage[n],F,σζ,log_wres))
        end
    else
        ll += log(δ) - log(h + δ)
        ll += log(h) + data.tU[n] * log(1-h)
    end
    return ll
end

# a function to iterate over all observations
function log_likelihood_obj(x,pars,data)
    pars = update(pars,x)
    ll = 0.
    for n in eachindex(data.E)
        ll += log_likelihood(n,data,pars)
    end
    return ll / length(data.E)
end

log_likelihood_obj (generic function with 1 method)

Finally, since routines like Optim optimize over vectors, we want to write an update routine that takes a vector x and maps it to new parameters. Here we are going to use transformation functions to ensure that parameters obey their bound constraints. There are other ways to ensure this, but this is one way that works.

logit(x) = exp(x) / (1+exp(x))
logit_inv(x) = log(x/(1-x))

function update(pars,x)
    h = logit(x[1])
    δ = logit(x[2])
    μ = x[3]
    σ = exp(x[4])
    log_wres = x[5]
    F = Normal(μ,σ)
    return (;pars...,h,δ,μ,σ,log_wres,F)
end

update (generic function with 1 method)

Now we can finally test our likelihood to see how it runs:

x0 = [logit_inv(0.5),logit_inv(0.03),2.,log(1.),log(1.)]
pars = (;σζ = 0.05, β = 0.995)
log_likelihood_obj(x0,pars,dat) #<- test.
res = optimize(x->-log_likelihood_obj(x,pars,dat),x0,BFGS(),Optim.Options(show_trace=true))

Iter     Function value   Gradient norm 
     0     2.695311e+00     4.354820e+00
 * time: 0.01106882095336914
     1     1.465845e+00     1.244146e+00
 * time: 0.653613805770874
     2     1.152272e+00     8.200154e-01
 * time: 0.9719328880310059
     3     9.856815e-01     7.907121e-02
 * time: 1.5949277877807617
     4     9.073466e-01     3.514877e-01
 * time: 1.7326359748840332
     5     8.991289e-01     3.497148e-01
 * time: 2.0469658374786377
     6     8.834827e-01     1.169386e-01
 * time: 2.1770877838134766
     7     8.572509e-01     2.789022e-01
 * time: 2.3110268115997314
     8     8.358818e-01     2.687150e-01
 * time: 2.447190999984741
     9     8.343310e-01     1.215115e-01
 * time: 2.6281898021698
    10     8.288645e-01     4.530032e-02
 * time: 2.719261884689331
    11     8.179760e-01     6.421968e-02
 * time: 2.855409860610962
    12     8.163814e-01     5.745335e-02
 * time: 2.990616798400879
    13     8.162840e-01     1.988477e-02
 * time: 3.130162000656128
    14     8.156821e-01     1.225748e-02
 * time: 3.313178777694702
    15     8.154726e-01     4.356875e-03
 * time: 3.4503698348999023
    16     8.154678e-01     8.197604e-04
 * time: 3.5864148139953613
    17     8.154677e-01     8.689571e-05
 * time: 3.678497791290283
    18     8.154677e-01     3.283187e-07
 * time: 3.814708948135376
    19     8.154677e-01     6.012390e-08
 * time: 3.905904769897461
    20     8.154677e-01     2.665202e-08
 * time: 4.042452812194824
    21     8.154677e-01     2.360617e-08
 * time: 4.179626941680908
    22     8.154677e-01     3.172903e-08
 * time: 4.316060781478882
    23     8.154677e-01     4.866309e-09
 * time: 4.45349383354187

 * Status: success (objective increased between iterations)

 * Candidate solution
    Final objective value:     8.154677e-01

 * Found with
    Algorithm:     BFGS

 * Convergence measures
    |x - x'|               = 1.07e-07 ≰ 0.0e+00
    |x - x'|/|x'|          = 2.26e-08 ≰ 0.0e+00
    |f(x) - f(x')|         = 4.12e-13 ≰ 0.0e+00
    |f(x) - f(x')|/|f(x')| = 5.05e-13 ≰ 0.0e+00
    |g(x)|                 = 4.87e-09 ≤ 1.0e-08

 * Work counters
    Seconds run:   4  (vs limit Inf)
    Iterations:    23
    f(x) calls:    91
    ∇f(x) calls:   91

Here we tell Optim to make use of automatic differentiation with ForwardDiff. Let’s take a peek at the parameter estimates:

DataFrame(;update(pars,res.minimizer)...)

1×8 DataFrame

Row	σζ	β	h	δ	μ	σ	log_wres	F
	Float64	Float64	Float64	Float64	Float64	Float64	Float64	Normal…
1	0.05	0.995	0.181658	0.00885406	2.60923	0.372494	1.60605	Normal{Float64}(μ=2.60923, σ=0.372494)

Performance Note

Note that in Example 12.1 we created a NamedTuple called dat from the data frame, which cements the type of each vector of data into dat.

This is important for performance! Working with DataFrame types directly can dramatically slow down your code because the columns of these data frames are not typed concretely. See the performance tips for more discussion.

Exercise

Exercise 12.1 Extend the code above to

Additionally estimate the parameters $b$ and $\lambda$.
Estimate the search model separately for men with and without a bachelor’s degree.
Report and comment on your estimates.

12.3 The Labor Supply Model

Suppose that we have a vector of instruments $\mathbf{z}_{n}$ that we hope will jointly move consumption and labor supply, with a single cross-section of observations:

\[ (W_{n},H_{n},C_{n},\mathbf{z}_{n}).\]

We write the labor supply equation as

\[ \log(H) = \mu - \psi\log(W) - \psi\sigma\log(C) + \epsilon \]

where we assume that $\mathbb{E}[\epsilon\ |\mathbf{z}] = 0$, implying the moment condition:

\[ \mathbb{E}[\epsilon \mathbf{z}] = 0.\]

Let $\theta = (\mu,\sigma,\psi)$ and define the sample moment:

\[g_{N}(\theta) = \frac{1}{N}\sum_{N}\left(\log(H_{n})-\mu-\psi\log(W)-\psi\sigma\log(C)\right)\mathbf{z}_{n}.\]

The GMM estimator is:

\[ \hat{\theta} = \arg\min_{\theta} g_{N}(\theta)^\prime \mathbf{W}_{N} g_{N}(\theta) \]

where $\mathbf{W}_{N}$ is a symmetric, positive definite weighting matrix. Since we have a linear system, this becomes a quadratic minimization problem with a known solution,¹ but the theory we develop will be more general.

Relevant questions for GMM are:

What value of $\mathbf{W}$ will give us the “best” performing estimator in the population? (And yes, we also have to define what “best” means).
Can we implement optimal weighting in finite sample?

12.4 The Savings Model

Let’s consider estimation of the income process for this model and save estimation of the preference parameters for our chapter on simulation. Recall from our discussion of identification and from Exercise 10.1 that we can identify the parameters of this process by matching implied variances and covariances. Supposing that we have more of these moments than we do parameters (i.e. that the parameters are over-identified by the moments), we can estimate the income process by minimum distance.

Recall the income process: \[ \varepsilon_{n,t+1} = \rho \varepsilon_{n,t} + \eta_{n,t+1},\qquad \eta_{n,t}\sim\mathcal{N}(0,\sigma^2_\eta) \]

and consider the extended specification with permanent heterogeneity: \[ \log(y_{n,t}) = \mu_t + \alpha_n + \varepsilon_{n,t} \]

where $\alpha_n \sim (0,\sigma^2_\alpha)$ is an individual fixed effect. Let us further assume that in the first period, $\varepsilon_{0} = 0$. This gives us that

\[\varepsilon_{t} = \sum_{s=1}^{t}\rho^{t-s}\eta_{s}.\]

Define $\theta = (\rho, \sigma^2_\alpha, \sigma^2_\eta)$ as the parameters we wish to estimate.

12.4.1 The Minimum Distance Estimator

In Example 10.1, we considered identification of the income process by examining covariance restrictions in panel data. Here we’ll consider this approach as well as an alternative. Define

\[\epsilon = \log(y) - \mu_{t} = \alpha + \varepsilon \]

Let’s begin by noting the following generic relationships:

\[ \mathbb{V}[\epsilon_{t}] = \sigma^2_{\alpha} + \frac{(1-\rho^{2t})}{1-\rho^2}\rho^2\sigma^2_{\eta}\]

and

\[ \mathbb{V}[\epsilon_{t+1}] = \sigma^2_{\alpha} + \rho^2\mathbb{V}[\varepsilon_{t}] + \sigma^2_{\eta} \]

\[ \mathbb{C}(\epsilon_{t},\epsilon_{t+s}) = \sigma^2_{\alpha} + \rho^{s}\mathbb{V}[\epsilon_{t}] \]

We’ll consider two potential vectors of moments to match. The first vector consists of the variance of $\epsilon$ at each $t$:

\[ \mathbf{v} = [\mathbb{V}[\epsilon_{1}],\ \mathbb{V}[\epsilon_{2}],\ ...,\ \mathbb{V}[\epsilon_{T}]]^\prime \]

while the second takes two variances and $K$ covariances:

\[\mathbf{c} = [\mathbb{V}[\epsilon_{t}],\ \mathbb{V}[\epsilon_{t+1}],\ \mathbb{C}(\epsilon_{t},\epsilon_{t+1}),\ ...,\ \mathbb{C}(\epsilon_{t},\epsilon_{t+K})]^\prime \]

Let $\mathbf{v}(\theta)$ and $\mathbf{c}(\theta)$ be the model-implied values of these moments, given by the expressions above. The minimum distance estimator is

\[ \hat{\theta} = \arg\min_\theta \left(\hat{\mathbf{v}} - \mathbf{v}(\theta)\right)^\prime \mathbf{W} \left(\hat{\mathbf{v}} - \mathbf{v}(\theta)\right) \]

where $\mathbf{W}$ is a positive definite weighting matrix. An estimator is equivalently defined for the second set of moments $\mathbf{c}$.

Example: Minimum Distance Estimation of the Income Process

Example 12.2 Building on Example 10.1, let’s implement a minimum distance estimator for the income process parameters. First, we compute sample moments from the PSID data.

using CSV, DataFrames, DataFramesMeta, Statistics, Optim, Plots

# Load and prepare data
data = @chain begin
    CSV.read("../data/abb_aea_data.csv",DataFrame,missingstring = "NA")
    @select :person :y :tot_assets1 :asset :age :year
    @subset :age.>=25 :age.<=64
end

# Calculate the variance of log income at each age
m_hat = @chain data begin
    groupby(:age)
    @combine :var_logy = var(log.(:y))
    @orderby :age
    _.var_logy
end

40-element Vector{Float64}:
 0.2832928560329017
 0.31385672611280657
 0.3481614098261801
 0.49042885748190596
 0.7186945866590636
 0.8145871761041581
 0.3899819595842863
 0.4310734870614536
 1.0048586189529876
 0.8330684076119356
 0.6927461795744777
 0.4530783605245448
 0.7647115466410289
 ⋮
 0.7856911636904547
 0.8002257744890358
 0.6119942670835942
 0.9594333073919417
 0.6036241679859082
 1.0296983130781643
 0.6008575915783718
 1.1495148217769573
 0.7851363933479678
 1.6514958037883842
 0.5559690181469094
 1.1855708905092428

Now we define the model-implied moments. Since PSID is biennial, we adjust for two-year gaps:

function model_moments(θ, T)
    ρ, σ2_α, σ2_η = θ
    # Variance of transitory component (assuming stationarity for simplicity)
    var_eps = σ2_η / (1 - ρ^2)
    m = [σ2_α + (1-ρ^(2t))/(1-ρ^2) * ρ^2 * σ2_η for t in 1:T]
    return m
end

# Minimum distance objective (identity weighting matrix)
function md_objective(x, m_hat)
    # Transform to ensure constraints: ρ ∈ (-1,1), σ² > 0
    ρ = tanh(x[1])
    σ2_α = exp(x[2])
    σ2_η = exp(x[3])
    θ = (ρ, σ2_α, σ2_η)
    T = length(m_hat)
    m_model = model_moments(θ, T)
    diff = m_hat .- m_model
    return diff' * diff  # Identity weighting
end

md_objective (generic function with 1 method)

Finally, we estimate the parameters:

# Initial values
x0 = [0.5, log(0.1), log(0.05)]

# Optimize
res = optimize(x -> md_objective(x, m_hat), x0, Newton(),autodiff = :forward)

# Extract estimates
x_hat = res.minimizer
ρ_hat = tanh(x_hat[1])
σ2_α_hat = exp(x_hat[2])
σ2_η_hat = exp(x_hat[3])
θ_hat = (ρ_hat, σ2_α_hat, σ2_η_hat)
println("Minimum Distance Estimates:")
println("  ρ = $(round(ρ_hat, digits=3))")
println("  σ²_α = $(round(σ2_α_hat, digits=3))")
println("  σ²_η = $(round(σ2_η_hat, digits=3))")
# and finally a plot of model fit
T = length(m_hat)
scatter(1:T,m_hat,label = "data",title = "Model Fit of Targeted Moments")
plot!(1:T,model_moments(θ_hat,length(m_hat)),label = "model fit")
xlabel!("Model Periods (Age)")

Minimum Distance Estimates:
  ρ = 0.918
  σ²_α = 0.177
  σ²_η = 0.12

As with the case of GMM, we would like to know how our choice of $\mathbf{W}$ affects the sampling distribution (i.e. precision) of our estimator and if there is an “optimal” choice.

Discussion: Whether vs How

Let’s think about more about how we approached identification here: by matching the growth in the variance of log income with age.

Essentially, we are attributing all of this growth to income risk. Suppose we use the model to evaluate the welfare gains from redistributive taxes and transfers. Are you comfortable with how we’ve identified the extent of income risk? How important will those parameters be for how agents value social insurance?

12.5 The Entry-Exit Model

Consider two alternative estimators of the entry-exit model. The key insight from our identification discussion is that the choice probability $p(x,a,a')$ is directly observable in the data and encodes information about the underlying payoff parameters.

Recall the payoff specification: \[ u_{1}(x,a,d^{\prime}) = \phi_{0} + \phi_{1}x - \phi_{2}d^\prime - \phi_{3}(1-a) \] \[ u_{0}(x,a) = \phi_{4}a \]

and let $\phi = (\phi_0, \phi_1, \phi_2, \phi_3, \phi_4)$ denote the vector of payoff parameters.

12.5.1 Estimation by Minimum Distance

The minimum distance approach directly exploits the mapping between parameters and choice probabilities. For each market-state combination $(x,a,a')$, the model implies a choice probability:

\[ p(x,a,a';\phi,\beta) = \frac{\exp(v_1(x,a,a';\phi,\beta))}{\exp(v_0(x,a,a';\phi,\beta)) + \exp(v_1(x,a,a';\phi,\beta))} \]

where $v_0$ and $v_1$ are the choice-specific value functions that depend on the equilibrium solution.

Suppose we have a cross-section of data:

\[ (X_{m},D_{m},A_{m},A'_{m})_{m=1}^{M} \]

for each of $M$ markets. Further assume that $X$ is a variable that takes one of a discrete number of values in the support $\mathcal{X}$.

For each unique state $(x,a,a')$ in the data, we can compute the empirical choice frequency:

\[ \hat{p}(x,a,a') = \frac{\sum_{m} D_{m}\mathbf{1}\{X_m = x, A_{m} = a, A'_{m} = a'\}}{\sum_{m} \mathbf{1}\{X_m = x, A_{m} = a, A'_{m} = a'\}} \]

The minimum distance estimator minimizes the weighted sum of squared deviations between observed and predicted choice probabilities. Let $\mathbf{p}(\theta)$ be the vector of choice probabilities across the state space $\mathcal{X}\times\{0,1\}^2$ and let $\widehat{\mathbf{p}}$ be the equivalent frequency estimate. The minimum distance estimator is:

\[ \hat{\theta} = \arg\min_\theta (\widehat{\mathbf{p}}-\mathbf{p}(\theta))^\prime \mathbf{W}_{N}(\widehat{\mathbf{p}}-\mathbf{p}(\theta))\]

where $\mathbf{W}_{N}$ is once again a positive definite weighting matrix.

12.5.2 Estimation by GMM

An alternative approach uses the Generalized Method of Moments. The key insight is that choice probabilities satisfy certain orthogonality conditions that can be expressed as moment restrictions.

Given the discrete choice structure, the residual: \[ \xi_{m} = D_{m} - p(X_m, A_{m}, A'_{m}; \phi, \beta) \]

has the property that $\mathbb{E}[\xi_{m} | X_m, A_{m}, A'_{m}] = 0$ when evaluated at the true parameters. This suggests the moment conditions:

\[ \mathbb{E}\left[(D_{m} - p(X_m, A_{m}, A'_{m}; \phi, \beta)) \cdot \mathbf{z}_{m}\right] = 0 \]

where $\mathbf{z}_{m}$ is a vector of instruments. Natural choices include functions of $(X_m, A_{m}, A'_{m})$ themselves, such as:

\[ \mathbf{z}_{m} = [1,\ X_m,\ A_{m},\ A'_{m},\ X_m \cdot A_{m}]^\prime \]

The GMM estimator minimizes:

\[ \hat{\phi} = \arg\min_\phi g_M(\phi)^\prime \mathbf{W}_M g_M(\phi) \]

where the sample moment is:

\[ g_M(\phi) = \frac{1}{M}\sum_{m}\left(D_{m} - p(X_m, A_{m}, A'_{m}; \phi, \beta)\right) \mathbf{z}_{m} \]

Specifically, let $\beta = [\mu,\ \psi, \psi\sigma]^\prime$, we know that: $\hat{\beta} = (\mathbf{X}^\prime\mathbf{Z}\mathbf{W}_{N}\mathbf{Z}^\prime\mathbf{X})^{-1}(\mathbf{X}^\prime\mathbf{Z}\mathbf{W}_{N}\mathbf{Z}^\prime\mathbf{Y}$ where $\mathbf{X}$, $\mathbf{Z}$, $\mathbf{Y}$ are appropriately stacked vectors of regressors, instruments, and outcomes (log hours).↩︎