using ForwardDiff
f(x) = sum(x.^2)
x = [1.0, 2.0, 3.0]
# Gradient
ForwardDiff.gradient(f, x)3-element Vector{Float64}:
2.0
4.0
6.0
Automatic differentiation (AD) is a technique for computing exact derivatives of functions specified by computer programs. Unlike symbolic differentiation (which manipulates mathematical expressions) or numerical differentiation (which uses finite differences), AD exploits the fact that every program, no matter how complex, executes a sequence of elementary operations. By applying the chain rule systematically to these operations, AD computes derivatives to machine precision.
In structural econometrics, we frequently need gradients for:
Hand-coding derivatives is tedious and error-prone. Finite differences are slow (requiring \(O(n)\) function evaluations for an \(n\)-dimensional gradient) and can be numerically unstable. AD provides exact gradients efficiently.
AD comes in two flavors:
Forward mode propagates derivatives forward through the computation. For a function \(f: \mathbb{R}^n \to \mathbb{R}^m\), computing the full Jacobian requires \(n\) forward passes. This is efficient when \(n \ll m\).
Reverse mode propagates derivatives backward (like backpropagation in neural networks). Computing the full Jacobian requires \(m\) reverse passes. This is efficient when \(m \ll n\).
For most estimation problems, we have a scalar objective (\(m = 1\)) and many parameters (\(n\) large), so reverse mode is typically preferred. In my experience however, I have had more success writing code that is compatible with forward differencing. You will learn from experience that these tools can be fussy.
Julia’s AD ecosystem is excellent. The main packages are:
Forward-mode AD. Simple and robust, works out-of-the-box for most pure Julia code.
using ForwardDiff
f(x) = sum(x.^2)
x = [1.0, 2.0, 3.0]
# Gradient
ForwardDiff.gradient(f, x)3-element Vector{Float64}:
2.0
4.0
6.0
# Hessian
ForwardDiff.hessian(f, x)3×3 Matrix{Float64}:
2.0 0.0 0.0
0.0 2.0 0.0
0.0 0.0 2.0
A high-performance AD engine that works at the LLVM level. Supports both forward and reverse mode. Often the fastest option, especially for code with loops and mutations.
using Enzyme
f(x) = x[1]^2 + sin(x[2])
x = [1.0, 2.0]
dx = zeros(2)
# Reverse mode gradient
Enzyme.autodiff(Reverse, f, Active, Duplicated(x, dx))
dx2-element Vector{Float64}:
2.0
-0.4161468365471424
A source-to-source reverse-mode AD system. Popular in machine learning (used by Flux.jl). Works well for array-heavy code but may struggle with control flow.
using Zygote
f(x) = sum(x.^2)
x = [1.0, 2.0, 3.0]
Zygote.gradient(f, x)([2.0, 4.0, 6.0],)
Start with ForwardDiff for problems with few parameters (< 100). It’s the most reliable.
Use Enzyme for performance-critical code, especially if you have loops or in-place mutations.
Be aware of limitations: AD systems can fail on code that uses certain constructs (try-catch, foreign function calls, some global variables). When in doubt, test that your gradients match finite differences:
using ForwardDiff, FiniteDiff
f(x) = log(1 + exp(x[1] * x[2])) + x[3]^2
x = [1.0, 2.0, 3.0]
ad_grad = ForwardDiff.gradient(f, x)
fd_grad = FiniteDiff.finite_difference_gradient(f, x)
maximum(abs.(ad_grad .- fd_grad))5.3512749786932545e-12
Most Julia optimization packages accept AD gradients. Here’s an example with Optim.jl:
using Optim, ForwardDiff
rosenbrock(x) = (1 - x[1])^2 + 100*(x[2] - x[1]^2)^2
x0 = [0.0, 0.0]
# With automatic gradients via ForwardDiff
result = optimize(rosenbrock, x0, LBFGS(); autodiff = :forward)
result.minimizer2-element Vector{Float64}:
0.999999999999928
0.9999999999998559
Optim.jlConsider a simple probit model:
\[ D = \mathbf{1}\{X\beta - \nu \geq 0\},\qquad \nu \sim \mathcal{N}(0,1) \]
Here is code to simulate data for this model:
using Random, Distributions
function sim_data(X ; γ)
N = size(X,1)
ν = rand(Normal(),N)
D = (X * γ .- ν) .> 0
return D
end
# a quick test of the function:
N = 1000
X = [ones(N) 2*rand(Normal(),N)]
γ = [0.1, 0.5]
D = sim_data(X ; γ);Consider the problem of estimating \(\gamma\) using maximum likelihood. We will establish the properties of this estimator in class. Here let’s just focus on numerically how to attack the minimization problem. The log-likelihood of the data D given X is:
\[ \mathcal{L}(\gamma) = \sum_{n}l(D_{n}; X_{n},\gamma) = \sum_{n=1}^{N}D_{n}\log(\Phi(X\gamma)) + (1-D_{n})\log(1-\Phi(X\gamma)) \]
Let’s write up this likelihood function.
function log_likelihood(D,X,γ)
ll = 0.
for n in eachindex(D)
xg = X[n,1] * γ[1] + X[n,2] * γ[2]
if D[n]
ll += log(cdf(Normal(),xg))
else
ll += log(1-cdf(Normal(),xg))
end
end
return ll
end
log_likelihood(D,X,[0.,0.])-693.1471805599322
Optimization is most efficient when we have access to the first and second order derivatives of the function. There is a general class of hill-climbing (or descent in the case of minimization) algorithms that find new guesses \(\gamma_{k+1}\) given \(\gamma_{k}\) as:
\[ \gamma_{k+1} = \gamma_{k} + \lambda_{k}A_{k}\frac{\partial Q}{\partial \gamma} \]
where \(Q\) is the function being maximized (or minimized). \(A_{k}\) defines a direction in which to search (providing weights on the derivatives) and \(\lambda_{k}\) is a scalar variable known as a step-size which is often calculated optimally in each iteration \(k\). For Newton’s method, the matrix \(A_{k}\) is the inverse of the Hessian of the objective function \(Q\). Since the hessian can sometimes be expensive to calculate, other methods use approximations to the Hessian that are cheaper to compute.
Since we have a simple model, we can calculate derivatives relatively easily. Below we’ll compare a hard-coded derivative to this automatic differentiation.
using ForwardDiff
function deriv_ll(D,X,γ)
dll = zeros(2)
for n in eachindex(D)
xg = X[n,1] * γ[1] + X[n,2] * γ[2]
if D[n]
dl = pdf(Normal(),xg) / cdf(Normal(),xg)
else
dl = - pdf(Normal(),xg) / (1 - cdf(Normal(),xg))
end
dll[1] += X[n,1] * dl
dll[2] += X[n,2] * dl
end
return dll
end
dx = zeros(2)
# forward mode
auto_deriv_ll(D,X,γ) = ForwardDiff.gradient(x->log_likelihood(D,X,x),γ)
# reverse mode
auto_deriv2_ll(D,X,γ,dx) = Enzyme.autodiff(Reverse, x->log_likelihood(D,X,x), Active, Duplicated(γ, dx))
d1 = deriv_ll(D,X,γ)
d2 = auto_deriv_ll(D,X,γ)
auto_deriv2_ll(D,X,γ,dx)
[d1 d2 dx]2×3 Matrix{Float64}:
11.1675 11.1675 11.1675
59.9985 59.9985 59.9985
Ok so we’re confident that these functions work as intended, but how do they compare in performance?
@time deriv_ll(D,X,γ);
@time auto_deriv_ll(D,X,γ);
@time auto_deriv2_ll(D,X,γ,dx); 0.000028 seconds (2 allocations: 80 bytes)
0.000044 seconds (7 allocations: 304 bytes)
0.000038 seconds
All are quite quick and you can see that we’re not losing much with automatic differentiation. In my experience, the gap between the two methods can narrow for more complicated functions.
So now let’s try implementing the maximum likelihood estimator using two different gradient-based algorithms: Newton’s Method (which uses the Hessian), and the Broyden–Fletcher–Goldfarb–Shannon (BFGS) algorithm (which updates search direction using changes in the first derivative).
While Newton’s method requires calculation of the Hessian (second derivatives), BFGS and related methods only require first derivatives. Typically, this makes each iteration quicker but will take more time to converge. Let’s test them.
using Optim
min_objective(x) = -log_likelihood(D,X,x) #<- Optim assumes that we will minimize a function, hence the negative
γ_guess = zeros(2)
println(" ---- Using Newton's Method ------ ")
res1 = optimize(min_objective,γ_guess,Newton(),autodiff=:forward,Optim.Options(show_trace=true))
println(" ---- Using BFGS ------ ")
res2 = optimize(min_objective,γ_guess,BFGS(),autodiff=:forward,Optim.Options(show_trace=true))
[res1.minimizer res2.minimizer γ] ---- Using Newton's Method ------
Iter Function value Gradient norm
0 6.931472e+02 9.429516e+02
* time: 0.012720108032226562
1 4.873710e+02 1.564794e+02
* time: 0.43104004859924316
2 4.693793e+02 9.133735e+00
* time: 0.4315500259399414
3 4.693308e+02 2.431698e-03
* time: 0.4318211078643799
4 4.693308e+02 4.359872e-09
* time: 0.43204617500305176
---- Using BFGS ------
Iter Function value Gradient norm
0 6.931472e+02 9.429516e+02
* time: 7.987022399902344e-5
1 4.849950e+02 1.434033e+02
* time: 0.00822591781616211
2 4.825821e+02 1.387009e+02
* time: 0.00848388671875
3 4.696142e+02 2.158030e+01
* time: 0.008702993392944336
4 4.693308e+02 1.473504e-01
* time: 0.008884906768798828
5 4.693308e+02 7.674024e-03
* time: 0.0090789794921875
6 4.693308e+02 3.727930e-07
* time: 0.009229898452758789
7 4.693308e+02 8.768958e-13
* time: 0.009376049041748047
2×3 Matrix{Float64}:
0.129586 0.129586 0.1
0.56237 0.56237 0.5