Introduction to Julia, Automatic Differentiation and Optimization

Introduction to Julia, Automatic Differentiation and Optimization

We are going to work through an example of “identification via functional form” that we will see in class. The model looks as follows:

\[ Y = X\beta + \alpha D + \epsilon - \varphi\nu \]

and

\[ D = \mathbf{1}\{X\beta - \nu \geq 0\} \]

where \(\epsilon\) and \(\nu\) are independent with \(\epsilon\sim\mathcal{N}(0,\sigma^2_{\epsilon})\) and \(\nu\sim\mathcal{N}(0,1)\). This is equivalent to writing:

\[ Y = X\beta + \alpha D + \xi \]

with

\[ \left[\begin{array}{c}\xi \\ \nu \end{array}\right] \sim \mathcal{N}\left(0,\left[\begin{array}{cc}\sigma^2_{\epsilon} + \varphi^2 & -\varphi \\ -\varphi & 1\end{array}\right]\right).\]

Let’s start by writing some code to simulate data from this simple selection model:

using Random, Distributions

function sim_data(X ; β,γ,α,φ,σ_ϵ)
    N = size(X,1)
    ν = rand(Normal(),N)
    ϵ = σ_ϵ * rand(Normal(),N)
    D = (X * γ .- ν) .> 0
    Y = X * β .+ α * D .+ ϵ .- φ*ν
    return Y, D
end
sim_data (generic function with 1 method)

Let’s quickly test this function by selecting some default parameters.

N = 1000
X = [ones(N) 2*rand(Normal(),N)]
β = [0., 1.]
γ = [0.1, 0.5]
φ = 1.
α = 0.6
σ_ϵ = 0.5

Y, D = sim_data(X ; β,γ,α,φ,σ_ϵ)
([1.132860301951354, -3.6466921358145283, -2.7966645365786507, 1.074252425709782, 5.251260632435156, 5.333302836895356, 1.048140172519132, -0.012316247760416288, 1.0187652704332761, 3.398234585007554  …  3.1520102965904524, 6.306622263324812, 3.810961616754668, 0.6280107123658936, 0.23612972322392936, -0.735663499435032, 3.6357910280620107, 1.5955063606600457, 2.4628872110429034, 1.1654458069287195], Bool[1, 0, 0, 1, 1, 1, 1, 0, 1, 1  …  1, 1, 1, 1, 1, 0, 1, 1, 1, 1])

To start, let’s think about estimating the parameters \(\gamma\) using maximum likelihood. We will establish the properties of this estimator in class. Here let’s just focus on numerically how to attack the minimization problem. The log-likelihood of the data D given X is:

\[ \mathcal{L}(\gamma) = \sum_{n}l(D_{n}; X_{n},\gamma) = \sum_{n=1}^{N}D_{n}\log(\Phi(X\gamma)) + (1-D_{n})\log(1-\Phi(X\gamma)) \]

Let’s write up this likelihood function.

function log_likelihood(D,X,γ)
    ll = 0.
    for n in eachindex(D)
        xg = X[n,1] * γ[1] + X[n,2] * γ[2] 
        if D[n]
            ll += log(cdf(Normal(),xg))
        else
            ll += log(1-cdf(Normal(),xg))
        end
    end
    return ll
end
log_likelihood(D,X,[0.,0.])
-693.1471805599322

A necessary condition for the maximum likelihood estimator is:

\[ \frac{\partial \mathcal{L}}{\partial \gamma} = \sum_{n}s_{n}(\hat{\gamma}) = 0\]

where

\[s_{n}(\gamma) = \frac{\partial l(D_{n} ; X_{n},\gamma)}{\partial \gamma} \]

is often referred to as the “score” of the likelihood.

Numerical Optimization and Automatic Differentiation

Optimization is most efficient when we have access to the first and second order derivatives of the function. There is a general class of hill-climbing (or descent in the case of minimization) algorithms that find new guesses \(\gamma_{k+1}\) given \(\gamma_{k}\) as:

\[ \gamma_{k+1} = \gamma_{k} + \lambda_{k}A_{k}\frac{\partial Q}{\partial \gamma} \]

where \(Q\) is the function being maximized (or minimized). \(A_{k}\) defines a direction in which to search (providing weights on the derivatives) and \(\lambda_{k}\) is a scalar variable known as a step-size which is often calculated optimally in each iteration \(k\). For Newton’s method, the matrix \(A_{k}\) is the inverse of the Hessian of the objective function \(Q\). Since the hessian can sometimes be expensive to calculate, other methods use approximations to the Hessian that are cheaper to compute.

Since we have a simple model, we can calculate derivatives relatively easily. However julia and many other languages now provide packages for Automatic Differentiation, which essentially trace all of the operations inside the function and implement the chain rule. It is very quick! Below we’ll compare a hard-coded derivative to this automatic differentiation.

using ForwardDiff

function deriv_ll(D,X,γ)
    dll = zeros(2)
    for n in eachindex(D)
        xg = X[n,1] * γ[1] + X[n,2] * γ[2] 
        if D[n]
            dl = pdf(Normal(),xg) / cdf(Normal(),xg)
        else
            dl = - pdf(Normal(),xg) / (1 - cdf(Normal(),xg))
        end
        dll[1] += X[n,1] * dl
        dll[2] += X[n,2] * dl            
    end
    return dll
end

auto_deriv_ll(D,X,γ) = ForwardDiff.gradient(x->log_likelihood(D,X,x),γ)

d1 = deriv_ll(D,X,γ)
d2 = auto_deriv_ll(D,X,γ)
[d1 d2]
2×2 Matrix{Float64}:
 -31.4069  -31.4069
 -29.5423  -29.5423

Ok so we’re confident that these functions work as intended, but how do they compare in performance?

@time deriv_ll(D,X,γ);
@time auto_deriv_ll(D,X,γ);
  0.000028 seconds (1 allocation: 80 bytes)
  0.000043 seconds (6 allocations: 352 bytes)

Both are quite quick but you can see that we’re not losing much with automatic differentiation. In my experience, the gap between the two methods can narrow for more complicated functions.

Numerical Optimization using Optim

So now let’s try implementing the maximum likelihood estimator using two different gradient-based algorithms: Newton’s Method (which uses the Hessian), and the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm (which updates search direction using chanegs in the first derivative). For maximum likelihood there is also the Berndt–Hall–Hall–Hausman (BHHH) algorithm which uses the following equality that we will derive in class:

\[ \mathbb{E}[\nabla{\gamma'}s(\gamma)] = -\mathbb{E}[s(\gamma)s(\gamma)'] \]

This formula states that the hessian of the log-likelihood is equal to the negative covariance of the score. This is a method that only works when maximizing the log-likelihood. Unfortunately Optim does not provide an implementation of BHHH but it is worth knowing about.

While Newton’s method requires calculation of the Hessian (second derivatives), BFGS and related methods only require first derivatives. Typically, this makes each iteration quicker but will take more time to converge. Let’s test them.

using Optim
min_objective(x) = -log_likelihood(D,X,x) #<- Optim assumes that we will minimize a function, hence the negative
γ_guess = zeros(2)
println(" ---- Using Newton's Method ------ ")
res1 = optimize(min_objective,γ_guess,Newton(),autodiff=:forward,Optim.Options(show_trace=true))
println(" ---- Using BFGS ------ ")
res2 = optimize(min_objective,γ_guess,BFGS(),autodiff=:forward,Optim.Options(show_trace=true))
[res1.minimizer res2.minimizer γ]
 ---- Using Newton's Method ------ 
Iter     Function value   Gradient norm 
     0     6.931472e+02     9.072234e+02
 * time: 4.696846008300781e-5
     1     5.170501e+02     1.388497e+02
 * time: 0.0003650188446044922
     2     5.074227e+02     2.790597e+00
 * time: 0.0006308555603027344
     3     5.074194e+02     1.179917e-04
 * time: 0.0008649826049804688
     4     5.074194e+02     9.096168e-12
 * time: 0.0010650157928466797
 ---- Using BFGS ------ 
Iter     Function value   Gradient norm 
     0     6.931472e+02     9.072234e+02
 * time: 7.295608520507812e-5
     1     5.238897e+02     1.767467e+02
 * time: 0.00039386749267578125
     2     5.236553e+02     1.761320e+02
 * time: 0.0006258487701416016
     3     5.076641e+02     2.366312e+01
 * time: 0.0008478164672851562
     4     5.074194e+02     7.988743e-02
 * time: 0.0010349750518798828
     5     5.074194e+02     4.225729e-03
 * time: 0.0012378692626953125
     6     5.074194e+02     7.219726e-08
 * time: 0.0014328956604003906
     7     5.074194e+02     2.127742e-13
 * time: 0.0016629695892333984
2×3 Matrix{Float64}:
 0.0335542  0.0335542  0.1
 0.472962   0.472962   0.5

Identification via Functional Form

In class we will see that

\[ \mathbb{E}[Y|X,D] = X\beta + \alpha D - \varphi\left[(1-D)\frac{\phi(X\gamma)}{1-\Phi(X\gamma)} - D\frac{\phi(X\gamma)}{\Phi(X\gamma)}\right] \]

which allows \(\beta\) and \(\alpha\) to be estimated by virtue of a functional form assumption on the distribution of unobservables. Numerically we can test this by adding the selection correction term to a regression of \(Y\) on \(X\) and \(D\):

using LinearAlgebra
γ_est = res1.minimizer
xg = X * γ_est
selection_correction = (1 .- D) .* pdf.(Normal(),xg) ./ (1 .- cdf.(Normal(),xg)) .- D .* pdf.(Normal(),xg) ./ cdf.(Normal(),xg)
X2 = [X D selection_correction]
b_est = inv(X2' * X2) * X2' * Y
[b_est [β; α; -φ]]
4×2 Matrix{Float64}:
  0.313929   0.0
  1.10537    1.0
 -0.14238    0.6
 -1.44073   -1.0

At first this seems way off, but if we increase the sample size by an order of magnitude we should (depending on sampling error) see that the estimator is consistent:

N = 10000
X = [ones(N) 2*rand(Normal(),N)]
Y, D = sim_data(X ; β,γ,α,φ,σ_ϵ)
res1 = optimize(min_objective,γ_guess,Newton(),autodiff=:forward,Optim.Options(show_trace=true))
γ_est = res1.minimizer
xg = X * γ_est
selection_correction = (1 .- D) .* pdf.(Normal(),xg) ./ (1 .- cdf.(Normal(),xg)) .- D .* pdf.(Normal(),xg) ./ cdf.(Normal(),xg)
X2 = [X D selection_correction]
b_est = inv(X2' * X2) * X2' * Y
[b_est [β; α; -φ]]
Iter     Function value   Gradient norm 
     0     6.931472e+03     8.736635e+03
 * time: 7.987022399902344e-5
     1     5.159514e+03     1.376363e+03
 * time: 0.0021448135375976562
     2     5.054556e+03     3.380701e+01
 * time: 0.00448298454284668
     3     5.054503e+03     2.126288e-03
 * time: 0.005650043487548828
     4     5.054503e+03     2.817037e-10
 * time: 0.00681304931640625
4×2 Matrix{Float64}:
 -0.0458329   0.0
  0.987815    1.0
  0.654354    0.6
 -0.960434   -1.0