include("../scripts/search_model.jl")solve_res_wage (generic function with 1 method)
Before diving into statistical theory, we will introduce the estimators by proposing estimation methods for each of our prototype models.
The three workhorse methods are:
Each of these approaches is an extremum estimator: any estimator that can be characterized as the solution to a maximization or minimization problem.
Definition 12.1 \(\hat{\theta}\) is an extremum estimator if:
\[ \hat{\theta} = \arg\max_{\theta\in\Theta} Q_{N}(\theta) \] where \(\Theta\subset\mathbb{R}^{p}\).
Just to clarify where we are going, it helps to reiterate what the key properties are that we would like to establish for each approach.
The key theoretical questions we want to establish for each estimation approach described below are:
To discuss estimation of this model, let’s assume a linear form for the selection and outcomes equations:
\[ D = \mathbf{1}\{\gamma_0 + \gamma_1X + \gamma_2Z - V \geq0\} \] \[ Y_{D} = \beta_{D,0} + \beta_{D,1}X + U_D \] with \(V\sim\mathcal{N}(0,1)\).
Our identification argument suggested a two-step estimator for the Generalized Roy Model, which we implemented in Example 7.1:
Note that this is a two-step estimator. The second stage relies on parameters estimated in the first stage. We will need to develop theory for this!
For this example, let’s assume that we observe wages with some small amount of known measurement error:
\[ \log(W^{o}_{n,t}) = \log(W_{n,t}) + \zeta_{n,t}\]
where \(\zeta_{n,t} \sim \mathcal{N}(0,\sigma^2_\zeta)\) and \(\sigma_\zeta = 0.05\).
Recall that without further variation, we must make a parametric assumption on the wage distribution, and so we assume that \(W\) is log-normally distributed with mean \(\mu\) and variance \(\sigma^2_{W}\).
Our strategy here is to estimate the parameters
\[ \theta = (\mu,\sigma^2_{W},h,\delta,w^*) \]
and invert out \(\lambda\) and \(b\) (the latter using the reservation wage equation). Let \(X_{n} = (W^o,t_u,E)\) indicate the data. The log-likelihood of a single observation is:
\[ l(X;\theta) = E \times \int f_{W|W>w^*}(\log(W^{o})-\zeta)\phi(\zeta;\sigma_\zeta)d\zeta + (1-E)\times[\log(h) + t_u\log(1-h)] \]
where, according to our parametric specifications:
\[ f_{W|W>w^*}(w) = \frac{\phi(w;\sigma_{W})}{1-\Phi(w^*/\sigma_{W})}.\]
\(\phi(\cdot;\sigma)\) is the pdf of a normal with standard deviation \(\sigma\) and \(\Phi\) is the cdf of a standard normal.
The maximum likelihood estimator is:
\[ \hat{\theta} = \arg\max_\theta \frac{1}{N}\sum_{n}l(X_n;\theta) \] while the MLE estimates of \(\lambda\) and \(b\) are:
\[ \hat{\lambda} = \hat{h} / (1 - \widehat{F}_{W|W>w^*}(\hat{w}^*) \]
\[ \hat{b} = w^* - \frac{\hat{\lambda}}{1 - \beta(1-\hat{\delta})}\int_{\hat{w}^*}(1-\widehat{F}_{W|W>w^*}(w))dw \]
When we get to the theory we will consider the asymptotic properties of not just \(\hat{\theta}\) but also the derived estimates \(\hat{b}\) and \(\hat{\lambda}\).
Example 12.1 First, let’s load the routines that we previously wrote to solve the model and take numerical integrals with quadrature. These are identical to what we’ve seen before and are available on the course github.
include("../scripts/search_model.jl")solve_res_wage (generic function with 1 method)
Before writing the likelihood, let’s load the data, clean, and create the data frame.
using CSV, DataFrames, DataFramesMeta, Statistics
data = CSV.read("../data/cps_00019.csv",DataFrame)
data = @chain data begin
@transform :E = :EMPSTAT.<21
@transform @byrow :wage = begin
if :PAIDHOUR==0
return missing
elseif :PAIDHOUR==2
if :HOURWAGE<99.99 && :HOURWAGE>0
return :HOURWAGE
else
return missing
end
elseif :PAIDHOUR==1
if :EARNWEEK>0 && :UHRSWORKT<997 && :UHRSWORKT>0
return :EARNWEEK / :UHRSWORKT
else
return missing
end
end
end
@subset :MONTH.==1
@select :AGE :SEX :RACE :EDUC :wage :E :DURUNEMP
@transform begin
:bachelors = :EDUC.>=111
:nonwhite = :RACE.!=100
:female = :SEX.==2
:DURUNEMP = round.(:DURUNEMP .* 12/52)
end
end
# the whole dataset in a named tuple
wage_missing = ismissing.(data.wage)
wage = coalesce.(data.wage,1.)
N = length(data.AGE)
# create a named tuple with all variables to conveniently pass to the log-likelihood:
dat = (;logwage = log.(wage),wage_missing,E = data.E,tU = data.DURUNEMP) (logwage = [0.0, 0.0, 0.0, 3.0368742168851663, 2.302585092994046, 3.2188758248682006, 2.2512917986064953, 0.0, 0.0, 0.0 … 0.0, 0.0, 2.4849066497880004, 3.056356895370426, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], wage_missing = Bool[1, 1, 1, 0, 0, 0, 0, 1, 1, 1 … 1, 1, 0, 0, 1, 1, 1, 1, 1, 1], E = Bool[1, 1, 1, 1, 1, 1, 1, 1, 1, 1 … 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], tU = [231.0, 231.0, 231.0, 231.0, 231.0, 231.0, 231.0, 231.0, 231.0, 231.0 … 231.0, 231.0, 231.0, 231.0, 231.0, 231.0, 231.0, 231.0, 231.0, 231.0])
Now, let’s write the log-likelihood as above.
using Distributions, Optim
ϕ(x,μ,σ) = pdf(Normal(μ,σ),x)
Φ(x,μ,σ) = cdf(Normal(μ,σ),x)
# a function for the log-likelihood of observed wages (integrating out measurement error)
function logwage_likelihood(logwage,F,σζ,wres)
f(x) = pdf(F,x) / (1-cdf(F,wres)) * ϕ(logwage,log(x),σζ)
ub = quantile(F,0.9999)
return integrateGL(f,wres,ub)
end
# a function to get the log-likelihood of a single observation
# note this function assumes that data holds vectors
# E, tU, and logwage
function log_likelihood(n,data,pars)
(;h,δ,wres,F,σζ) = pars
ll = 0.
if data.E[n]
ll += log(h) - log(h + δ)
if !data.wage_missing[n]
ll += logwage_likelihood(data.logwage[n],F,σζ,wres)
end
else
ll += log(δ) - log(h + δ)
ll += log(h) + data.tU[n] * log(1-h)
end
return ll
end
# a function to iterate over all observations
function log_likelihood_obj(x,pars,data)
pars = update(pars,x)
ll = 0.
for n in eachindex(data.E)
ll += log_likelihood(n,data,pars)
end
return ll / length(data.E)
endlog_likelihood_obj (generic function with 1 method)
Finally, since routines like Optim optimize over vectors, we want to write an update routine that takes a vector x and maps it to new parameters. Here we are going to use transformation functions to ensure that parameters obey their bound constraints. There are other ways to ensure this, but this is one way that works.
logit(x) = exp(x) / (1+exp(x))
logit_inv(x) = log(x/(1-x))
function update(pars,x)
h = logit(x[1])
δ = logit(x[2])
μ = x[3]
σ = exp(x[4])
wres = exp(x[5])
F = LogNormal(μ,σ)
return (;pars...,h,δ,μ,σ,wres,F)
endupdate (generic function with 1 method)
Now we can finally test our likelihood to see how it runs:
x0 = [logit_inv(0.5),logit_inv(0.03),2.,log(1.),log(5.)]
pars = (;σζ = 0.05, β = 0.995)
log_likelihood_obj(x0,pars,dat) #<- test.
res = optimize(x->-log_likelihood_obj(x,pars,dat),x0,Newton(),Optim.Options(show_trace=true))Iter Function value Gradient norm
0 2.428210e-01 2.348428e-01
* time: 5.984306335449219e-5
1 2.051862e-01 1.072566e-01
* time: 1.8515889644622803
2 1.827395e-01 4.626301e-01
* time: 3.292236804962158
3 1.758193e-01 4.441528e-01
* time: 4.586717844009399
4 1.692587e-01 3.168138e-02
* time: 6.080646991729736
5 1.668869e-01 1.483080e-01
* time: 7.363166809082031
6 1.647839e-01 2.507384e-01
* time: 8.843937873840332
7 1.638445e-01 1.648957e-01
* time: 10.333063840866089
8 1.633820e-01 2.338710e-02
* time: 11.627697944641113
9 1.633356e-01 2.095668e-02
* time: 13.11549687385559
10 1.633224e-01 1.774668e-03
* time: 14.597818851470947
11 1.633213e-01 3.628638e-03
* time: 16.073489904403687
12 1.633211e-01 1.280346e-04
* time: 17.559399843215942
13 1.633210e-01 3.719531e-04
* time: 19.04760980606079
14 1.633210e-01 5.299485e-06
* time: 20.525203943252563
15 1.633210e-01 3.008006e-05
* time: 22.201759815216064
16 1.633210e-01 8.243085e-08
* time: 23.69115400314331
17 1.633210e-01 6.359010e-08
* time: 24.976842880249023
18 1.633210e-01 2.488281e-07
* time: 26.455304861068726
19 1.633210e-01 2.588340e-08
* time: 27.74436092376709
20 1.633210e-01 1.472700e-08
* time: 29.030247926712036
21 1.633210e-01 2.956400e-09
* time: 30.599798917770386
* Status: success (objective increased between iterations)
* Candidate solution
Final objective value: 1.633210e-01
* Found with
Algorithm: Newton's Method
* Convergence measures
|x - x'| = 6.08e-06 ≰ 0.0e+00
|x - x'|/|x'| = 2.83e-07 ≰ 0.0e+00
|f(x) - f(x')| = 1.60e-13 ≰ 0.0e+00
|f(x) - f(x')|/|f(x')| = 9.82e-13 ≰ 0.0e+00
|g(x)| = 2.96e-09 ≤ 1.0e-08
* Work counters
Seconds run: 31 (vs limit Inf)
Iterations: 21
f(x) calls: 66
∇f(x) calls: 66
∇²f(x) calls: 21
Here we tell Optim to make use of automatic differentiation with ForwardDiff. Let’s take a peek at the parameter estimates:
DataFrame(;update(pars,res.minimizer)...)| Row | σζ | β | h | δ | μ | σ | wres | F |
|---|---|---|---|---|---|---|---|---|
| Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | Float64 | LogNorma… | |
| 1 | 0.05 | 0.995 | 0.184915 | 0.00764741 | 3.47993 | 1.09258 | 4.80711e-10 | LogNormal{Float64}(μ=3.47993, σ=1.09258) |
Note that in Example 12.1 we created a NamedTuple called dat from the data frame, which cements the type of each vector of data into dat.
This is important for performance! Working with DataFrame types directly can dramatically slow down your code because the columns of these data frames are not typed concretely. See the performance tips for more discussion.
Suppose that we have a vector of instruments \(\mathbf{z}_{n}\) that we hope will jointly move consumption and labor supply, with a single cross-section of observations:
\[ (W_{n},H_{n},C_{n},\mathbf{z}_{n}).\]
We write the labor supply equation as
\[ \log(H) = \mu - \psi\log(W) - \psi\sigma\log(C) + \epsilon \]
where we assume that \(\mathbb{E}[\epsilon\ |\mathbf{z}] = 0\), implying the moment condition:
\[ \mathbb{E}[\epsilon \mathbf{z}] = 0.\]
Let \(\theta = (\mu,\sigma,\psi)\) and define the sample moment:
\[g_{N}(\theta) = \frac{1}{N}\sum_{N}\left(\log(H_{n})-\mu-\psi\log(W)-\psi\sigma\log(C)\right)\mathbf{z}_{n}.\]
The GMM estimator is:
\[ \hat{\theta} = \arg\min_{\theta} g_{N}(\theta)^\prime \mathbf{W}_{N} g_{N}(\theta) \]
where \(\mathbf{W}_{N}\) is a symmetric, positive definite weighting matrix. Since we have a linear system, this becomes a quadratic minimization problem with a known solution,1 but the theory we develop will be more general.
Relevant questions for GMM are:
Let’s consider estimation of the income process for this model and save estimation of the preference parameters for our chapter on simulation. Recall from our discussion of identification and from Exercise 10.1 that we can identify the parameters of this process by matching implied variances and covariances. Supposing that we have more of these moments than we do parameters (i.e. that the parameters are over-identified by the moments), we can estimate the income process by minimum distance.
Recall the income process: \[ \varepsilon_{n,t+1} = \rho \varepsilon_{n,t} + \eta_{n,t},\qquad \eta_{n,t}\sim\mathcal{N}(0,\sigma^2_\eta) \]
and consider the extended specification with permanent heterogeneity: \[ \log(y_{n,t}) = \mu_t + \alpha_n + \varepsilon_{n,t} \]
where \(\alpha_n \sim (0,\sigma^2_\alpha)\) is an individual fixed effect. Let us further assume that in the first period, \(\varepsilon_{0} = 0\). This gives us that
\[\varepsilon_{t} = \sum_{s=1}^{t}\rho^{t-s}\eta_{s}.\]
Define \(\theta = (\rho, \sigma^2_\alpha, \sigma^2_\eta)\) as the parameters we wish to estimate.
In Example 10.1, we considered identification of the income process by examining covariance restrictions in panel data. Here we’ll consider this approach as well as an alternative. Define
\[\epsilon = \log(y) - \mu_{t} = \alpha + \varepsilon \]
Let’s begin by noting the following generic relationships:
\[ \mathbb{V}[\epsilon_{t}] = \sigma^2_{\alpha} + \frac{(1-\rho^{2(t-1)})}{1-\rho^2}\sigma^2_{\eta}\]
and
\[ \mathbb{V}[\epsilon_{t+1}] = \sigma^2_{\alpha} + \rho^2\mathbb{V}[\varepsilon_{t}] + \sigma^2_{\eta} \]
\[ \mathbb{C}(\epsilon_{t},\epsilon_{t+s}) = \sigma^2_{\alpha} + \rho^{s}\mathbb{V}[\epsilon_{t}] \]
We’ll consider two potential vectors of moments to match. The first vector consists of the variance of \(\epsilon\) at each \(t\):
\[ \mathbf{v} = [\mathbb{V}[\epsilon_{1}],\ \mathbb{V}[\epsilon_{2}],\ ...,\ \mathbb{V}[\epsilon_{T}]]^\prime \]
while the second takes two variances and \(K\) covariances:
\[\mathbf{c} = [\mathbb{V}[\epsilon_{t}],\ \mathbb{V}[\epsilon_{t+1}],\ \mathbb{C}(\epsilon_{t},\epsilon_{t+1}),\ ...,\ \mathbb{C}(\epsilon_{t},\epsilon_{t+K})]^\prime \]
Let \(\mathbf{v}(\theta)\) and \(\mathbf{c}(\theta)\) be the model-implied values of these moments, given by the expressions above. The minimum distance estimator is
\[ \hat{\theta} = \arg\min_\theta \left(\hat{\mathbf{v}} - \mathbf{v}(\theta)\right)^\prime \mathbf{W} \left(\hat{\mathbf{v}} - \mathbf{v}(\theta)\right) \]
where \(\mathbf{W}\) is a positive definite weighting matrix. An estimator is equivalently defined for the second set of moments \(\mathbf{c}\).
Example 12.2 Building on Example 10.1, let’s implement a minimum distance estimator for the income process parameters. First, we compute sample moments from the PSID data.
using CSV, DataFrames, DataFramesMeta, Statistics, Optim, Plots
# Load and prepare data
data = @chain begin
CSV.read("../data/abb_aea_data.csv",DataFrame,missingstring = "NA")
@select :person :y :tot_assets1 :asset :age :year
@subset :age.>=25 :age.<=64
end
# Calculate the variance of log income at each age
m_hat = @chain data begin
groupby(:age)
@combine :var_logy = var(log.(:y))
@orderby :age
_.var_logy
end40-element Vector{Float64}:
0.2832928560329017
0.31385672611280657
0.3481614098261801
0.49042885748190596
0.7186945866590636
0.8145871761041581
0.3899819595842863
0.4310734870614536
1.0048586189529876
0.8330684076119356
0.6927461795744777
0.4530783605245448
0.7647115466410289
⋮
0.7856911636904547
0.8002257744890358
0.6119942670835942
0.9594333073919417
0.6036241679859082
1.0296983130781643
0.6008575915783718
1.1495148217769573
0.7851363933479678
1.6514958037883842
0.5559690181469094
1.1855708905092428
Now we define the model-implied moments. Since PSID is biennial, we adjust for two-year gaps:
function model_moments(θ, T)
ρ, σ2_α, σ2_η = θ
# Variance of transitory component (assuming stationarity for simplicity)
var_eps = σ2_η / (1 - ρ^2)
m = [σ2_α + (1-ρ^(2(t-1)))/(1-ρ^2) * σ2_η for t in 1:T]
return m
end
# Minimum distance objective (identity weighting matrix)
function md_objective(x, m_hat)
# Transform to ensure constraints: ρ ∈ (-1,1), σ² > 0
ρ = tanh(x[1])
σ2_α = exp(x[2])
σ2_η = exp(x[3])
θ = (ρ, σ2_α, σ2_η)
T = length(m_hat)
m_model = model_moments(θ, T)
diff = m_hat .- m_model
return diff' * diff # Identity weighting
endmd_objective (generic function with 1 method)
Finally, we estimate the parameters:
# Initial values
x0 = [0.5, log(0.1), log(0.05)]
# Optimize
res = optimize(x -> md_objective(x, m_hat), x0, Newton(),autodiff = :forward)
# Extract estimates
x_hat = res.minimizer
ρ_hat = tanh(x_hat[1])
σ2_α_hat = exp(x_hat[2])
σ2_η_hat = exp(x_hat[3])
θ_hat = (ρ_hat, σ2_α_hat, σ2_η_hat)
println("Minimum Distance Estimates:")
println(" ρ = $(round(ρ_hat, digits=3))")
println(" σ²_α = $(round(σ2_α_hat, digits=3))")
println(" σ²_η = $(round(σ2_η_hat, digits=3))")
# and finally a plot of model fit
T = length(m_hat)
scatter(1:T,m_hat,label = "data",title = "Model Fit of Targeted Moments")
plot!(1:T,model_moments(θ_hat,length(m_hat)),label = "model fit")
xlabel!("Model Periods (Age)")Minimum Distance Estimates:
ρ = 0.918
σ²_α = 0.279
σ²_η = 0.085
As with the case of GMM, we would like to know how our choice of \(\mathbf{W}\) affects the sampling distribution (i.e. precision) of our estimator and if there is an “optimal” choice.
Consider two alternative estimators of the entry-exit model. The key insight from our identification discussion is that the choice probability \(p(x,a,a')\) is directly observable in the data and encodes information about the underlying payoff parameters.
Recall the payoff specification: \[ u_{1}(x,a,d^{\prime}) = \phi_{0} + \phi_{1}x - \phi_{2}d^\prime - \phi_{3}(1-a) \] \[ u_{0}(x,a) = \phi_{4}a \]
and let \(\phi = (\phi_0, \phi_1, \phi_2, \phi_3, \phi_4)\) denote the vector of payoff parameters.
The minimum distance approach directly exploits the mapping between parameters and choice probabilities. For each market-state combination \((x,a,a')\), the model implies a choice probability:
\[ p(x,a,a';\phi,\beta) = \frac{\exp(v_1(x,a,a';\phi,\beta))}{\exp(v_0(x,a,a';\phi,\beta)) + \exp(v_1(x,a,a';\phi,\beta))} \]
where \(v_0\) and \(v_1\) are the choice-specific value functions that depend on the equilibrium solution.
Suppose we have a cross-section of data:
\[ (X_{m},D_{m},A_{m},A'_{m})_{m=1}^{M} \]
for each of \(M\) markets. Further assume that \(X\) is a variable that takes one of a discrete number of values in the support \(\mathcal{X}\).
For each unique state \((x,a,a')\) in the data, we can compute the empirical choice frequency:
\[ \hat{p}(x,a,a') = \frac{\sum_{m} D_{m}\mathbf{1}\{X_m = x, A_{m} = a, A'_{m} = a'\}}{\sum_{m} \mathbf{1}\{X_m = x, A_{m} = a, A'_{m} = a'\}} \]
The minimum distance estimator minimizes the weighted sum of squared deviations between observed and predicted choice probabilities. Let \(\mathbf{p}(\theta)\) be the vector of choice probabilities across the state space \(\mathcal{X}\times\{0,1\}^2\) and let \(\widehat{\mathbf{p}}\) be the equivalent frequency estimate. The minimum distance estimator is:
\[ \hat{\theta} = \arg\min_\theta (\widehat{\mathbf{p}}-\mathbf{p}(\theta))^\prime \mathbf{W}_{N}(\widehat{\mathbf{p}}-\mathbf{p}(\theta))\]
where \(\mathbf{W}_{N}\) is once again a positive definite weighting matrix.
An alternative approach uses the Generalized Method of Moments. The key insight is that choice probabilities satisfy certain orthogonality conditions that can be expressed as moment restrictions.
Given the discrete choice structure, the residual: \[ \xi_{m} = D_{m} - p(X_m, A_{m}, A'_{m}; \phi, \beta) \]
has the property that \(\mathbb{E}[\xi_{m} | X_m, A_{m}, A'_{m}] = 0\) when evaluated at the true parameters. This suggests the moment conditions:
\[ \mathbb{E}\left[(D_{m} - p(X_m, A_{m}, A'_{m}; \phi, \beta)) \cdot \mathbf{z}_{m}\right] = 0 \]
where \(\mathbf{z}_{m}\) is a vector of instruments. Natural choices include functions of \((X_m, A_{m}, A'_{m})\) themselves, such as:
\[ \mathbf{z}_{m} = [1,\ X_m,\ A_{m},\ A'_{m},\ X_m \cdot A_{m}]^\prime \]
The GMM estimator minimizes:
\[ \hat{\phi} = \arg\min_\phi g_M(\phi)^\prime \mathbf{W}_M g_M(\phi) \]
where the sample moment is:
\[ g_M(\phi) = \frac{1}{M}\sum_{m}\left(D_{m} - p(X_m, A_{m}, A'_{m}; \phi, \beta)\right) \mathbf{z}_{m} \]
Specifically, let \(\beta = [\mu,\ \psi, \psi\sigma]^\prime\), we know that: \(\hat{\beta} = (\mathbf{X}^\prime\mathbf{Z}\mathbf{W}_{N}\mathbf{Z}^\prime\mathbf{X})^{-1}(\mathbf{X}^\prime\mathbf{Z}\mathbf{W}_{N}\mathbf{Z}^\prime\mathbf{Y}\) where \(\mathbf{X}\), \(\mathbf{Z}\), \(\mathbf{Y}\) are appropriately stacked vectors of regressors, instruments, and outcomes (log hours).↩︎