using Distributions, ForwardDiff, Optim, Random, LinearAlgebra, Plots
Recitation 3
Recitation 3
In this recitation we’ll do some maximum likelihood exercises using the simple probit model. First let’s write code to simulate data from this model and calculate the likelihood of the data given a set of parameters.
Recall that the probit model can be written as:
\[ Y_{n} = \mathbf{1}\{X_{n}\beta - \epsilon_{n} \geq 0 \},\ \epsilon\sim \mathcal{N}(0,1) \]
So it will be simple to simulate data from this model. We’ll assume that \(X_{n} = [1\ X_{n,1}\ X_{n,2}]\) where each \(X_{n,j}\) is distributed as a standard normal.
For no good reason other than that we want to explore the issues later on, suppose that the parameters \(\beta_{1},\beta_{2},\beta_{3}\) are a function of deeper structural parameters \(\gamma\):
\[ \beta_{1} = \exp(\gamma_{1}+\gamma_{2}), \beta_{2} = \gamma_{2}, \beta_{3} = \gamma_{1} \]
Below is code for simulating the data and calculating the log-likelihood given \(\beta\).
function simulate_probit(β,N)
= Normal()
F = [ones(N) rand(F,N,2)]
X = rand(F,N)
eps = (X * β .- eps) .> 0
Y return (Y,X)
end
function log_likelihood(n,Y,X,β,Fϵ)
@views xb = dot(X[n,:],β)
if Y[n]==1
return log(cdf(Fϵ,xb))
else
return log(1-cdf(Fϵ,xb))
end
end
function log_likelihood(Y,X,β)
= 0.
ll = length(Y)
N = Normal()
Fϵ for n in eachindex(Y)
+= log_likelihood(n,Y,X,β,Fϵ)
ll end
return ll / N
end
log_likelihood (generic function with 2 methods)
A Warning when Using Automatic Differentiation
For simple models you are unlikely to run into this issue, but for more complicated models you need to be careful when pre-allocating arrays in functions that you plan to differentiate “automatically”. Here is an example. Suppose we write the following function to get the reduced form coefficients \(\beta\) as a function of \(\gamma\):
function get_β_stupid(γ)
= zeros(3)
β 1] = exp(γ[1]+γ[2])
β[2] = γ[2]
β[3] = γ[1]
β[return β
end
get_β_stupid (generic function with 1 method)
Let’s see what happens when we try to take the derivative of the log-likelihood with respect to \(\gamma\) using this function:
= [0.,0.2]
γ = 200
N = get_β_stupid(γ)
β = simulate_probit(β,N)
Y,X
L(γ) = log_likelihood(Y,X,get_β_stupid(γ))
L(γ)
-0.33713288149822374
= ForwardDiff.gradient(L,γ) dL
MethodError: no method matching Float64(::ForwardDiff.Dual{ForwardDiff.Tag{typeof(L), Float64}, Float64, 2}) Closest candidates are: (::Type{T})(::Real, ::RoundingMode) where T<:AbstractFloat at rounding.jl:200 (::Type{T})(::T) where T<:Number at boot.jl:772 (::Type{T})(::AbstractChar) where T<:Union{AbstractChar, Number} at char.jl:50 ...
Ok, we get an error! Why? When we pre-allocated the array \(\beta\) in the function get_β_stupid
, using the zeros
function, this asserts that the vector can only hold Float64
variables. When ForwardDiff
attempts to take the derivative, it instead passes arguments that have a special Dual
type that cannot be converted to a float. Here are two workarounds for this problem:
function get_β_sensible(γ)
= zeros(eltype(γ),3) #<- we let the array take the same type as the argument γ
β 1] = exp(γ[1]+γ[2])
β[2] = γ[2]
β[3] = γ[1]
β[return β
end
function get_β_also_sensible(γ)
return [exp(γ[1]+γ[2]) γ[2] γ[1]]
end
L(γ) = log_likelihood(Y,X,get_β_sensible(γ))
L2(γ) = log_likelihood(Y,X,get_β_also_sensible(γ))
L(γ) L2(γ)] [
1×2 Matrix{Float64}:
-0.337133 -0.337133
gradient(L,γ) ForwardDiff.gradient(L2,γ)] [ForwardDiff.
2×2 Matrix{Float64}:
0.0175352 0.0175352
0.024424 0.024424
So now we have something we can pass to the optimizer and there will be no drama when taking derivatives.
= 2_000
N = simulate_probit(β,N)
Y,X = optimize(x->-log_likelihood(Y,X,get_β_sensible(x)),γ,Newton(),autodiff=:forward,Optim.Options(show_trace=true))
r = r.minimizer
γ_est γ_est
Iter Function value Gradient norm
0 3.436787e-01 1.513857e-02
* time: 5.0067901611328125e-5
1 3.435150e-01 1.481243e-04
* time: 0.0008308887481689453
2 3.435149e-01 5.684522e-09
* time: 0.0011279582977294922
2-element Vector{Float64}:
-0.004991197627244919
0.22372487257130555
Estimating standard errors
In class we saw that for maximum likelihood:
\[ \sqrt{N}(\hat{\theta}-\theta_{0}) \rightarrow_{d} \mathcal{N}(0,-\mathbb{E}H(w,\theta_{0})^{-1}) \]
or alternatively:
\[ \sqrt{N}(\hat{\theta}-\theta_{0}) \rightarrow_{d} \mathcal{N}(0,\mathbb{E}[s(w,\theta_{0})s(w,\theta_{0})^{T}]^{-1}) \]
We can calculate standard errors by estimating either term. In the first case we just take the hessian of the average log-likelihood at the estimate. In the second we calculate the sample covariance of the score at the estimate. The law of large numbers guarantees that either approach is consistent.
# using the Hessian
= ForwardDiff.hessian(x->log_likelihood(Y,X,get_β_sensible(x)),γ_est)
H = -inv(H) avar_est
2×2 Matrix{Float64}:
1.64561 -1.02332
-1.02332 2.00287
# using the score
= hcat((ForwardDiff.gradient(x->log_likelihood(n,Y,X,get_β_sensible(x),Normal()),γ_est) for n in 1:N)...)
S = inv(cov(S')) avar_est_2
2×2 Matrix{Float64}:
1.64537 -0.964128
-0.964128 1.89245
Notice that the estimates are close to each other but not identical. They are only exactly equal in the population limit. To get standard errors, we note that \(\hat{\theta}\) is going to be approximately normally distributed around \(\theta\) with variance equal to the asymptotic variance divided by \(N\). So for example the standard errors for \(\gamma\) are:
= sqrt.(diag(avar_est / N)) se
2-element Vector{Float64}:
0.02868456117694807
0.03164541875055554
Monte-Carlo Validation
Let’s validate the estimated variances using a monte-carlo simulation:
function monte_carlo(num_trials,γ,N)
= zeros(2,num_trials)
γ_sim = get_β_sensible(γ)
β for b in axes(γ_sim,2)
= simulate_probit(β,N)
Y,X = optimize(x->-log_likelihood(Y,X,get_β_sensible(x)),γ,Newton(),autodiff=:forward)
r :,b] = r.minimizer
γ_sim[end
return γ_sim
end
= monte_carlo(200,γ,N)
γ_sim = cov(γ_sim')
var_sim = sqrt.(diag(var_sim)) se_sim
2-element Vector{Float64}:
0.02941598344758601
0.03372272126529965
Additional Exercises
Some questions for you to ponder.
Exercise 1
Suppose you estimate the \(\beta\) parameters directly. It would be simple enough to do here. How you could you then back out estimates of \(\gamma\) that would give you the same asymptotic variance as maximum likelihood? We will see how to do this when we study Optimal Minimum Distance
Exercise 2
Suppose we keep the same data-generating process, but that we misspecify the model as: \[ \beta = [\gamma_{1}+\gamma_{2},\ \gamma_{1},\ \gamma_{2}] \] The monte-carlo simulation below shows that the estimates of this misspecified model are still approximately normal. Which assumption is now violated and what formula for the asymptotic variance should we use instead for this pseudo-likelihood?
get_β_misspecified(γ) = [γ[1]+γ[2],γ[2],γ[1]]
function monte_carlo_misspecified(num_trials,γ,N)
= zeros(2,num_trials)
γ_sim = get_β_sensible(γ)
β for b in axes(γ_sim,2)
= simulate_probit(β,N)
Y,X = optimize(x->-log_likelihood(Y,X,get_β_misspecified(x)),γ,Newton(),autodiff=:forward)
r :,b] = r.minimizer
γ_sim[end
return γ_sim
end
= monte_carlo_misspecified(200,γ,N)
γ_sim histogram(γ_sim[1,:])