Introduction to Julia, Automatic Differentiation and Optimization
Introduction to Julia, Automatic Differentiation and Optimization
We are going to work through an example of “identification via functional form” that we will see in class. The model looks as follows:
\[ Y = X\beta + \alpha D + \epsilon - \varphi\nu \]
and
\[ D = \mathbf{1}\{X\beta - \nu \geq 0\} \]
where \(\epsilon\) and \(\nu\) are independent with \(\epsilon\sim\mathcal{N}(0,\sigma^2_{\epsilon})\) and \(\nu\sim\mathcal{N}(0,1)\). This is equivalent to writing:
To start, let’s think about estimating the parameters \(\gamma\) using maximum likelihood. We will establish the properties of this estimator in class. Here let’s just focus on numerically how to attack the minimization problem. The log-likelihood of the data D given X is:
is often referred to as the “score” of the likelihood.
Numerical Optimization and Automatic Differentiation
Optimization is most efficient when we have access to the first and second order derivatives of the function. There is a general class of hill-climbing (or descent in the case of minimization) algorithms that find new guesses \(\gamma_{k+1}\) given \(\gamma_{k}\) as:
where \(Q\) is the function being maximized (or minimized). \(A_{k}\) defines a direction in which to search (providing weights on the derivatives) and \(\lambda_{k}\) is a scalar variable known as a step-size which is often calculated optimally in each iteration \(k\). For Newton’s method, the matrix \(A_{k}\) is the inverse of the Hessian of the objective function \(Q\). Since the hessian can sometimes be expensive to calculate, other methods use approximations to the Hessian that are cheaper to compute.
Since we have a simple model, we can calculate derivatives relatively easily. However julia and many other languages now provide packages for Automatic Differentiation, which essentially trace all of the operations inside the function and implement the chain rule. It is very quick! Below we’ll compare a hard-coded derivative to this automatic differentiation.
Both are quite quick but you can see that we’re not losing much with automatic differentiation. In my experience, the gap between the two methods can narrow for more complicated functions.
Numerical Optimization using Optim
So now let’s try implementing the maximum likelihood estimator using two different gradient-based algorithms: Newton’s Method (which uses the Hessian), and the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm (which updates search direction using chanegs in the first derivative). For maximum likelihood there is also the Berndt–Hall–Hall–Hausman (BHHH) algorithm which uses the following equality that we will derive in class:
This formula states that the hessian of the log-likelihood is equal to the negative covariance of the score. This is a method that only works when maximizing the log-likelihood. Unfortunately Optim does not provide an implementation of BHHH but it is worth knowing about.
While Newton’s method requires calculation of the Hessian (second derivatives), BFGS and related methods only require first derivatives. Typically, this makes each iteration quicker but will take more time to converge. Let’s test them.
usingOptimmin_objective(x) =-log_likelihood(D,X,x) #<- Optim assumes that we will minimize a function, hence the negativeγ_guess =zeros(2)println(" ---- Using Newton's Method ------ ")res1 =optimize(min_objective,γ_guess,Newton(),autodiff=:forward,Optim.Options(show_trace=true))println(" ---- Using BFGS ------ ")res2 =optimize(min_objective,γ_guess,BFGS(),autodiff=:forward,Optim.Options(show_trace=true))[res1.minimizer res2.minimizer γ]
which allows \(\beta\) and \(\alpha\) to be estimated by virtue of a functional form assumption on the distribution of unobservables. Numerically we can test this by adding the selection correction term to a regression of \(Y\) on \(X\) and \(D\):
usingLinearAlgebraγ_est = res1.minimizerxg = X * γ_estselection_correction = (1.- D) .*pdf.(Normal(),xg) ./ (1.-cdf.(Normal(),xg)) .- D .*pdf.(Normal(),xg) ./cdf.(Normal(),xg)X2 = [X D selection_correction]b_est =inv(X2'* X2) * X2'* Y[b_est [β; α; -φ]]
At first this seems way off, but if we increase the sample size by an order of magnitude we should (depending on sampling error) see that the estimator is consistent:
N =10000X = [ones(N) 2*rand(Normal(),N)]Y, D =sim_data(X ; β,γ,α,φ,σ_ϵ)res1 =optimize(min_objective,γ_guess,Newton(),autodiff=:forward,Optim.Options(show_trace=true))γ_est = res1.minimizerxg = X * γ_estselection_correction = (1.- D) .*pdf.(Normal(),xg) ./ (1.-cdf.(Normal(),xg)) .- D .*pdf.(Normal(),xg) ./cdf.(Normal(),xg)X2 = [X D selection_correction]b_est =inv(X2'* X2) * X2'* Y[b_est [β; α; -φ]]