Let us review the empirical content of this model without placing further restrictions on the functional forms. Let’s start with the selection equation. Let \(P(X,Z) = P[D=1|X,Z]\). For any distribution of \(V\), since \(F_{V}\) is monotonically increasing (under support conditions on \(V\)), we can write: \[ D = \mathbf{1}\{\mu_{d}(X,Z) \geq V\} = \mathbf{1}\{F_{V}(\mu_{d}(X,Z)) \geq F_{V}(V)\}.\] Since \(F_{V}(V)\) is a uniform random variable in \([0,1]\), we can always write the selection equation without loss of generality as:
\[ D = \mathbf{1}\{P(X,Z) - V \geq 0\},\qquad V\sim U[0,1] \]
Now consider the conditional expectations: \[ \mathbb{E}[Y|X,Z,D=1] = \mu_{1}(X) + \underbrace{\mathbb{E}[U_1 | V \leq P(X,Z)]}_{h_{1}(P(X,Z))} \]\[ \mathbb{E}[Y|X,Z,D=0] = \mu_{0}(X) + \underbrace{\mathbb{E}[U_0 | V > P(X,Z)]}_{h_{0}(P(X,Z))} \]
Two observations follow:
These equations illustrate the classic selection problem. If the unobservable that determines \(D\) is related to the unobservables that determine the potential outcomes \((Y_0,Y_1)\), then the difference in conditional means is partly contaminated with this selection effect.
The selection model implies dimension reduction in the conditional expectation of each \(U_{D}\) given \(X\) and \(Z\): the combined propensity \(P(X,Z)\) encodes all the relevant information to control for selection. This is a useful property, but the underlying index model is not without loss of generality. We will explore a little more below.
7.2 Identification by Functional Form
We’ll begin by considering identification under some additional functional form restrictions. First, consider the example where the triple \((V,U_0,U_1)\) are jointly normally distributed:
where we have normalized the location of the unobservables by setting the mean to zero, and normalized the scale of \(V\) by assuming a unit variance. Additionally, let us specify that each \(\mu_{D}\) is linear in \(X\): \[ \mu_{D}(X) = X\beta_{D} \] Assume that our data is a single cross-section of observations \((Y_{D},D,X)\). Identification is a statement about population values, so can we take as given that we see the joint distribution \(\mathbb{P}_{Y_{D},D,X}\).
Step 1 Note that the distribution of \(D\) given \(X\) is a probit model: \[ P[D=1|X] = \Phi(\mu_{d}(X)) \] where \(\Phi\) is the cdf of the standard normal. Thus \(\mu_{d}(X)\) is identified as \(\mu_{d}(X) = \Phi^{-1}(P[D=1|X])\) for any \(X\). If we additionally impose that \(\mu_{d}(X) = X\gamma\), then identification of each \(\gamma\) follows from the usual assumption that \(X\) is full-rank with positive probability (just like OLS).
Step 2 Now consider the identification of each \(\beta_{D}\). We have:
\[\begin{align}
\mathbb{E}[Y_{1}|X,D=1] &= X\beta_{1} + \mathbb{E}[U_1 | V<\mu_{d}(X)] \\
&= X\beta_{1} - \sigma_{V1}\frac{\phi(\mu_{d}(X))}{\Phi(\mu_{d}(X))}
\end{align}\] and similarly: \[
\mathbb{E}[Y_{0}|X,D=0] = X\beta_1 + \sigma_{V0}\frac{\phi(\mu_{d}(X))}{1-\Phi(\mu_{d}(X))}
\] Under the same rank conditions for \(X\), both \(\beta_0\) and \(\beta_1\) are identified due to. This also means that \(ATE(X) = X(\beta_1 - \beta_0)\) is identified for all \(X\).
Although we cannot identify the full distribution of treatment effects, this also gives us the average treatment effect among individuals with treatment propensity \(V\): \[\mathbb{E}[Y_0-Y_0|X,V] = X(\beta_1-\beta_0) + \underbrace{(\sigma_{V1}-\sigma_{V0})V}_{\mathbb{E}[U_1-U_0|V]}\] We’ll come back to this object in the next section.
Notice that identification holds here even without an excluded variable \(Z\). It follows from the assumption of linearity and normality of the error terms, which yield a particular parametric decomposition of the conditional expectation that can be identified.
This is an example of identification by functional form: identification depends crucially on these particular functional form assumptions. Notice that without linearity in \(\mu_{d}\), it is not possible to separately identify \(\mu_{d}\) from the selection correction. Would you say that this is a very credible approach to identifying the key causal parameters in the model? Lewbel (2019) provides a broad and satisfying discussion of these issues.
7.3 Identification with Exclusion Restrictions
If we have access to an excluded regressor, we do not have to rely so much on seemingly arbitrary functional form restrictions. Formally, we make two assumptions:
\(\mu_{D}(X,Z)=\mu_{D}(X)\) almost everywhere (exclusion)
$ Z(V,U_0,U_1) | X $ (independence)
This implies that we can write: \[\mathbb{E}[U_1|X,Z,D=1] = \mathbb{E}[U_1|V\leq P(X,Z)] \] and similarly for \(\mathbb{E}[Y|X,Z,D=0]\). So the expectation of \(Y\) (unconditional on \(D\)) is: \[\begin{align}
\mathbb{E}[Y|X,P(X,Z)=p] &= \mu_{0}(X) + p[\mu_{1}(X)-\mu_{0}(X)] + p\mathbb{E}[(U_{1}-U_{0})|V\leq P(X,Z)] \\
& \mu_{0}(X) + P(X,Z)[\mu_{1}(X)-\mu_{0}(X)] + \int_{0}^{p}\mathbb{E}[(U_1 - U_0)|V=u]du
\end{align}\]
Taking a derivative with respect to \(p\) gives: \[\begin{align}
\frac{\partial \mathbb{E}[Y|X,P(X,Z)=p]}{\partial p} &= \mu_{1}(X)-\mu_{0}(X) + \mathbb{E}[U_1 - U_0 | V = p] \\
& \mathbb{E}[Y_1 - Y_0 | V = p] \\
& MTE(p)
\end{align}\]
Heckman and Vytlacil (2005) call this derivative local instrumental variables approach and define the estimand, \(MTE(p)\) as the marginal treatment effect. It is the average treatment effect among individuals with a propensity \(1-p\) to take the treatment. They show that, commonly used estimators (instrumental variables, difference-in-differences) are often weighted averages of this marginal treatment effect, and that the marginal treatment effect is a useful building block for.
Note that under support conditions on \(Z\), as the support of \(P(X,Z)\) approaches the unit interval \([0,1]\), the average treatment effect \(ATE(X) = \mu_{1}(X)-\mu_{0}(X)\) is also identified.
Credibility Whether or not you think that this approach to identifying treatment effects is more or less credible than using functional form restrictions may depend on how valid you think the exclusion and independence restrictions are. However, assuming that we have access to a “good” instrument, it should be clear that this approach is preferable.
In practice, even when we have access to an instrument, the data requirements for fully non-parametric estimation can be overly demanding. A reasonable compromise is to illustrate the conditions under which your instrument provides nonparametric identification, then introduce parametric assumptions that can interpolate this plausible variation. This is a way to establish that identification of your parameters of interest in not purely driven by functional form restrictions. See Cunha, Heckman, and Schennach (2010) and Carneiro, Heckman, and Vytlacil (2011) for two examples of this kind of approach.
Example: Parametric and Semiparametric Estimation
Example 7.1 Let’s compare estimation methods for the generalized roy model. Our estimation approach will follow the identification argument: we’ll estimate the probit model and then run a selection corrected regression to recover the conditional means of the potential outcomes. Let’s assume linearity: \[ Y_D = \beta_{D,1} + \beta_{D,2}X + U_D \]\[ D = \mathbf{1}\{\gamma_1 + \gamma_2 X + \gamma_3 Z - V \geq 0 \} \] Here is a function to simulate the data, a function to calculate the log-likelihood, and a function to estimate \(\gamma\) by maximizing the log-likelihood.
usingDistributions, Optim, Random, Plotsfunctionsim_data(γ,β0,β1,N ; ρ_0 =0.3, ρ_1 =-0.3) X =rand(Normal(),N) Z =rand(Normal(),N) v =rand(Normal(),N) U0 =rand(Normal(),N) .+ ρ_0.*v U1 =rand(Normal(),N) .+ ρ_1.*v D = (γ[1] .+ γ[2]*X .+ γ[3]*Z .- v) .>0 Y = β0[1] .+ β0[2].*X .+ U0 Y1 = β1[1] .+ β1[2].*X .+ U1 Y[D.==1] .= Y1[D.==1]return (;X,Z,Y,D)endfunctionlog_likelihood(γ,data) (;D,X,Z) = data ll =0. Fv =Normal()for n ineachindex(D) xg = γ[1] + γ[2]*X[n] + γ[3]*Z[n]if D[n] ==1 ll +=log(cdf(Fv,xg))else ll +=log(1-cdf(Fv,xg))endendreturn llendfunctionestimate_probit(data) res =optimize(x->-log_likelihood(x,data),zeros(3),Newton(),autodiff=:forward)return res.minimizerend
estimate_probit (generic function with 1 method)
Before we move on to estimating the other parameters, let’s run a quick Monte-Carlo to make sure we did everything correctly.
gamma = [0., 0.5, 0.5]beta0 = [0., 0.3]beta1 = [0., 0.5]gamma_boot = [estimate_probit(sim_data(gamma,beta0,beta1,500))[2] for b in1:500]histogram(gamma_boot,label =false)#plot!([0.5,0.5],[0.,10.],label="true value")
Great! Now let’s write code to estimate the \(\beta\) parameters via a selection correction. We’ll try a non-parametric and a parametric selection correction. The nonparametric selection uses that
estimate_semiparametric (generic function with 1 method)
Now let’s finish off with a Monte Carlo to see how each estimator performs.
ests =mapreduce(vcat,1:500) do b data =sim_data([0.,1.,1.],[0.,0.3],[0.,0.5],500) gamma =estimate_probit(data) B0, B1 =estimate_parametric(data,gamma) C0, C1 =estimate_semiparametric(data,gamma)return [B0[2] B1[2] C0[2] C1[2]]endpl =plot((histogram(ests[:,j],label=false) for j in1:4)...)plot!(pl[1],title ="β0[2] - Parametric")plot!(pl[1],[0.3,0.3],[0,150],color="red",label="True Value")plot!(pl[2],title ="β1[2] - Parametric")plot!(pl[2],[0.5,0.5],[0,150],color="red",label="True Value")plot!(pl[3],title ="β0[2] - Semi-Parametric")plot!(pl[3],[0.3,0.3],[0,150],color="red",label="True Value")plot!(pl[4],title ="β1[2] - Semi-Parametric")plot!(pl[4],[0.5,0.5],[0,150],color="red",label="True Value")
Exercise
Exercise 7.1 Using Example Example exm-roy_estimation, try setting \(\gamma_{3} = 0\) and verify that identification by functional form works in practice: the model can still be estimated using the parametric selection correction.
Exercise
Exercise 7.2 Consider the following problem that features endogeneity and selection. Let \(W\) be wages, let \(X\) be an endogenous variable of interest, and let \(Z\) be an instrument for \(X\). Wages are selected because we only observe then when the individual chooses to work (\(H=1\)):
\[\begin{eqnarray}
\log(W) = \alpha_0 + \alpha_1 X + \epsilon + \rho\eta \\
X = \gamma_0 + \gamma_1 Z + \eta \\
H = \mathbf{1}\{\beta_0 + \beta_1\log(W) + \zeta >0\}
\end{eqnarray}\]
Here is a function that simulates these data for a fixed set of parameter values:
usingDistributions, Randomfunctionsim_data(N) α = [1., 0.5] γ = [1., 0.5] ρ =0.4 β = [-0.25, 0.5] ϵ =rand(Normal(),N) η =rand(Normal(),N) ζ =rand(Normal(),N) Z =rand(Normal(),N) X = γ[1] .+ γ[2]*Z .+ η logW = α[1] .+ α[2]*X .+ ρ*η .+ ϵ H = (β[1] .+ β[2]*logW .+ ζ) .>0 logW[H.==0] .=-1#<- set to missing if H=0return (;logW,X,Z,H)end
sim_data (generic function with 2 methods)
Suppose that you decide to handle the endogeneity problem by estimating \(\alpha_1\) with 2SLS. Below is a monte-carlo simulation of the performance of this estimator:
usingPlots, StatsBasefunctionmonte_carlo_trial(N) data =sim_data(N)# pull out non-missing data W = data.logW[data.H.==1] Z = data.Z[data.H.==1] X = data.X[data.H.==1]# run 2SLS alpha_est =cov(W,Z) /cov(X,Z)return alpha_estend# run 500 trialsalpha_boot = [monte_carlo_trial(10_000) for b in1:500]histogram(alpha_boot,normalize=:probability,label=false)plot!([0.5,0.5],[0.,0.2],color="red",label ="True Value")xlabel!("2SLS Estimate")
Explain why this estimator appears to be showing asymptotic bias.
Assume that \(\zeta\sim\mathcal{N}(0,1)\) and \(\epsilon\sim\mathcal(0,\sigma^2_\epsilon)\). Show that the reduced form for this model (\(H\) and \(W\) written in terms of exogenous variables) can be written as: \[ \log(W) = A_0 + A_1z + A_2\eta + \xi_1 \]\[ H = \mathbf{1}\{B_0 + B_1z + B_2\eta + \xi_2 >0\} \]\[ X = \gamma_0 + \gamma_1 z + \eta \] where \([\xi_1,\ \xi_2]\) is a bivariate normal with non-zero covariance and the coefficients \(A\) and \(B\) are combinations of underlying structural parameters.
Show that the parameters of the reduced form are identified (hint: start by identifying \([\gamma_0,\gamma_1]\) and then inverting \(\eta\) from \(Z\) and \(X\), treating it as observable).
Show that the structural parameters can be identified from this reduced form.
Does this approach to identification work if we do not make these functional form assumptions? In other words, could we nonparametrically identify the reduced form parameters of the model?
Introduce a variable to this model that would allow you to nonparametrically estimate the reduced form once again (hint: think back to what allows this in the Generalized Roy Model).
7.4 Monotonicity and Potential Outcomes
Recall that the generalized roy model embeds the potential outcomes framework (this is the model you get if you throw away the selection equation). Imbens and Angrist (1994) consider what can be estimated from two stage least squares when you have access to an instrument \(Z\) and heterogeneous potential outcomes. To formalize their setup, assume that individuals can be indexed in some measurable space \(\omega\in\Omega\), so that we write the model as a triple: \((Y_1(\omega),Y_0(\omega),D_{Z}(\omega))\) where \(D_{Z}(\omega)\in\{0,1\}\) indicates the choice of an individual of type \(\omega\) when the instrument takes a value \(Z\).
They combine four assumptions:
Exclusion: \(Y_{D}(\omega,Z) = Y_{D}(\omega)\) for \(D\in\{0,1\}\)
Independence: $Z $
Monotonicity: For any pair \((z,z')\) in the support of \(Z\), either \(D_{Z}(\omega)\geq D_{Z'}(\omega)\) for all \(\omega\in\Omega\) or \(D_{Z}(\omega)\leq D_{Z'}(\omega)\) for all \(\omega\in\Omega\).
In this case, monotonicity rules out (without loss of generality) the existence of defiers.
Exercise
Exercise 7.3 Under these assumptions, show that the 2SLS estimand: \[
\frac{\mathbb{E}[Y|Z=1]-\mathbb{E}[Y|Z=0]}{P[D=1|Z=1]-P[D=1|Z=0]}
\] is equal to the average treatment effect among compliers: \(\mathbb{E}[Y_1(\omega)-Y_0(\omega)|\omega \in\text{Compliers}]\).
Imbens and Angrist (1994) call this object the Local Average Treatment Effect. They show that for multi-valued instruments, 2SLS produces a weighted average of these LATES for difference complier groups.
This interpretation of IV estimators is immensely widely used, so you should know the definition and the assumptions under which it works.
7.4.1 Relationship to the Generalized Roy Model
There is a deeper relationship between the monotonicity assumption and the Generalized Roy Model. Vytlacil (2002) shows that the monotonicity assumption is equivalent to a latent index model representation. One direction of this proof is easy, since it is simple to verify that the latent index model: \[D = \mathbf{1}\{P(Z)-V \geq 0\}\] obeys monotonicity. Going the other way, it is relatively intuitive to see that one can construct a mapping from \(\Omega\) to \([0,1]\) such that: \[ D_{Z}(\omega) = \mathbf{1}\{P(Z)-V(\omega) \geq 0\}. \]
Exercise
Exercise 7.4 Use the Generalized Roy Model to show that for some binary instrument \(Z\in\{0,1\}\), the 2SLS estimand is equal to: \[\int_{P(Z=0)}^{P(Z=1)}MTE(u)du \]
Carneiro, P., J. J. Heckman, and E. J. Vytlacil. 2011. “Estimating Marginal Returns to Education.”American Economic Review 101 (October): 2754–81. http://www.nber.org/papers/w16474.
Cunha, Flavio, James Heckman, and Susanne Schennach. 2010. “Estimating the Technology of Cognitive and Noncognitive Skill Formation.”Econometrica 78 (3): 883–931.
Heckman, James, and Edward Vytlacil. 2005. “Structural equations, treatment effects, and econometric policy evaluation.”Econometrica 73 (3): 669–738.
Imbens, Guido W., and Joshua D. Angrist. 1994. “Identification and Estimation of Local Average Treatment Effects.”Econometrica 62 (2): 467–75. http://www.jstor.org/stable/2951620.
Lewbel, Arthur. 2019. “The Identification Zoo: Meanings of Identification in Econometrics.”Journal of Economic Literature 57 (4): 835–903. https://doi.org/10.1257/jel.20181361.
Vytlacil, Edward J. 2002. “Independence , Monotonicity and Latent Index Models : An Equivalence Result.”Econometrica 70 (1): 331–41.