Recitation 6: Introduction to the Bootstrap

We are going to try getting bootstrapped standard errors for our estimates of the income process parameters.

Setup: Loading the Data

Nothing new here relative to previous weeks. We load the data and do some filtering.

using CSV, DataFrames, DataFramesMeta, Statistics
data = @chain begin 
    CSV.read("../data/abb_aea_data.csv",DataFrame,missingstring = "NA")
    @select :person :y :tot_assets1 :asset :age :year
    @subset :age.>=25 :age.<=64
end
19139×6 DataFrame
19114 rows omitted
Row person y tot_assets1 asset age year
Int64 Int64 Int64 Float64 Int64 Int64
1 17118 54000 60000 0.0 49 98
2 12630 61283 224000 39283.0 59 98
3 12647 42300 28240 0.0 38 98
4 5239 82275 7500 0.0 56 98
5 2671 69501 48000 3600.0 35 98
6 13027 68000 148000 20000.0 49 98
7 6791 93758 80000 160.0 41 98
8 6475 26581 23300 0.0 35 98
9 18332 33785 0 0.0 42 98
10 3856 55300 311000 5300.0 33 98
11 19326 40200 105250 0.0 40 98
12 21818 42500 13000 0.0 36 98
13 7300 121508 178000 10008.0 59 98
19128 6617 115887 241000 21346.0 62 108
19129 626 128600 98000 0.0 46 108
19130 4795 105000 -68000 0.0 34 108
19131 3223 120000 132000 0.0 47 108
19132 8098 26527 4700 0.0 37 108
19133 8954 144026 220000 25.0 46 108
19134 12990 122665 220000 0.0 53 108
19135 8782 55000 69000 0.0 31 108
19136 13059 42728 -10000 0.0 26 108
19137 13535 57000 0 0.0 26 108
19138 3806 87000 74200 0.0 26 108
19139 11085 74000 -50000 0.0 31 108

Here is a function to calculate \(\hat{\mu}_{t}\):

function estimate_mu(data)
    μ = @chain data begin
        groupby(:age)
        @combine :μ = mean(log.(:y))
        @orderby :age
        _.μ
    end
    return μ
end

μ_est = estimate_mu(data)
40-element Vector{Float64}:
 10.760860880321463
 10.841247443701185
 10.902095326248284
 10.91448929395576
 10.998556154206424
 10.953465022792393
 11.100954062096019
 11.091165981452196
 11.084065028498708
 11.063159174262802
 11.164647085813545
 11.09762896621344
 11.0958958847166
  ⋮
 11.305097087283851
 11.30167368879086
 11.289951332574214
 11.218984368534626
 11.22669701047502
 11.157589262616021
 11.212633411833302
 11.139162326045438
 11.048452223663316
 11.020306496546475
 10.955869785866497
 10.995127001069111

A function to calculate residuals and get lagged residuals:

function get_resids(data)

    data = @chain data begin
        groupby(:age)
        @transform :resid = log.(:y) .- mean(log.(:y))
    end

    d1 = @chain data begin
        @select :year :person :resid
        @transform :year = :year .+ 2
        @rename :rlag1 = :resid
    end

    d2 = @chain data begin
        @select :year :person :resid
        @transform :year = :year .+ 4
        @rename :rlag2 = :resid
    end

    data = @chain data begin
        innerjoin(d1 , on=[:person,:year])
        innerjoin(d2 , on=[:person,:year])
    end
    return data
end
get_resids (generic function with 1 method)

Then finally a function to calculate the moments that we are going to target in the minimum distance step.

function get_moments(data)
    data = get_resids(data)
    return [var(data.resid), cov(data.resid,data.rlag1), cov(data.resid,data.rlag2)]
end

m = get_moments(data)
3-element Vector{Float64}:
 0.6042229503656219
 0.3937491407622511
 0.3537264122186026

Creating a boostrapped sample

Recall from class that we have to be careful to sample from the bootstrap at the right unit level. Since we know that observations at the level of person are correlated, we have to sample on this variable with replacement. Here’s a function to create a bootstrapped panel dataset by sampling individuals with replacement.

function draw_bootstrap_sample(data)
    id = unique(data.person) #<- create a list of all person identifiers in the data
    N = length(id)
    id_boot = id[rand(1:N,N)] #<- sample individuals randomly N times with replacement
    d = DataFrame(person = id_boot, id_boot = 1:N) #<- add a unique identified for each draw (:id_boot)
    boot_data = innerjoin(d,data,on=:person) 
    @transform!(boot_data,:id_old = :person, :person = :id_boot) #<- since we are creating lags using the person identified, we need this to be unique for each draw.
    return boot_data
end

db = draw_bootstrap_sample(data)
19078×8 DataFrame
19053 rows omitted
Row person id_boot y tot_assets1 asset age year id_old
Int64 Int64 Int64 Int64 Float64 Int64 Int64 Int64
1 4489 4489 42300 28240 0.0 38 98 12647
2 4158 4158 82275 7500 0.0 56 98 5239
3 3180 3180 69501 48000 3600.0 35 98 2671
4 937 937 68000 148000 20000.0 49 98 13027
5 4278 4278 93758 80000 160.0 41 98 6791
6 4855 4855 93758 80000 160.0 41 98 6791
7 590 590 26581 23300 0.0 35 98 6475
8 3136 3136 55300 311000 5300.0 33 98 3856
9 4937 4937 55300 311000 5300.0 33 98 3856
10 2966 2966 42500 13000 0.0 36 98 21818
11 1300 1300 121508 178000 10008.0 59 98 7300
12 3010 3010 121508 178000 10008.0 59 98 7300
13 3144 3144 121508 178000 10008.0 59 98 7300
19067 2635 2635 128600 98000 0.0 46 108 626
19068 1170 1170 105000 -68000 0.0 34 108 4795
19069 3701 3701 105000 -68000 0.0 34 108 4795
19070 5023 5023 105000 -68000 0.0 34 108 4795
19071 1703 1703 26527 4700 0.0 37 108 8098
19072 2410 2410 26527 4700 0.0 37 108 8098
19073 4654 4654 144026 220000 25.0 46 108 8954
19074 1207 1207 55000 69000 0.0 31 108 8782
19075 4153 4153 55000 69000 0.0 31 108 8782
19076 2238 2238 42728 -10000 0.0 26 108 13059
19077 3722 3722 42728 -10000 0.0 26 108 13059
19078 4025 4025 42728 -10000 0.0 26 108 13059

Bootstrapping for a Weighting Matrix

Suppose we would like to weight out minimum distance criterion using the inverse of the variance for each statistic. A simple way to compute this is to use the bootstrap. Here is code to do that:

using Random
function boot_moment_variance(data,B ; seed = 1010)
    M = zeros(3,B)
    Random.seed!(seed)
    for b in axes(M,2)
        db = draw_bootstrap_sample(data)
        M[:,b] = get_moments(db)
    end
    V = cov(M')
end

V_mom = boot_moment_variance(data,100)
3×3 Matrix{Float64}:
 0.00162775   0.000432301  0.000423284
 0.000432301  0.000378095  0.000195091
 0.000423284  0.000195091  0.00035493

We could use the inverse of this matrix or just the inverse of the diagonal component.

Bootstrapping the Parameters

To bootstrap the parameters we first need to write some simple code to estimate the income process.

using LinearAlgebra, Optim
function model_moms(ρ,σ)
    v = σ^2 / (1 - ρ^2)
    return [v, ρ^2 * v, ρ^4 * v]
end

function min_distance_obj(x,m0,W)
    m1 = model_moms(x[1],x[2])
    dm = m1 .- m0
    return dm' * W * dm
end

function estimate_income_process(data, W)
    μ = estimate_mu(data)
    m0 = get_moments(data)
    res = optimize(x->min_distance_obj(x,m0,W),[0.9,0.1],Newton(),autodiff=:forward)
    return (; μ, ρ = res.minimizer[1],ση = res.minimizer[2])
end

r = estimate_income_process(data,I(3))
(μ = [10.760860880321463, 10.841247443701185, 10.902095326248284, 10.91448929395576, 10.998556154206424, 10.953465022792393, 11.100954062096019, 11.091165981452196, 11.084065028498708, 11.063159174262802  …  11.289951332574214, 11.218984368534626, 11.22669701047502, 11.157589262616021, 11.212633411833302, 11.139162326045438, 11.048452223663316, 11.020306496546475, 10.955869785866497, 10.995127001069111], ρ = 0.8619354766570221, ση = 0.3888428287852483)

Now we can apply the exact same routine as before. In this case, let’s return the entire bootstrapped sample of parameters instead of just the variance.

function boot_parameters(data,B,W ; seed = 2020)
    mu_b = zeros(40,B)
    ρ_b = zeros(B)
    σ_b = zeros(B)
    Random.seed!(seed)
    for b in eachindex(ρ_b)
        db = draw_bootstrap_sample(data)
        r = estimate_income_process(db,W)
        mu_b[:,b] = r.μ
        ρ_b[b] = r.ρ
        σ_b[b] = r.ση
    end
    return mu_b, ρ_b, σ_b
end

W_opt = inv(V_mom)

mu_b, ρ_b, σ_b = boot_parameters(data,100,W_opt)
([10.73580760699535 10.816247401707118 … 10.826997992362788 10.823451250771296; 10.80821234096262 10.838035193746032 … 10.838646491668136 10.860638767368634; … ; 10.932575248306065 10.976627931101024 … 10.993294394103312 10.905486358310869; 10.987084615895972 10.949352984020404 … 10.90096845528756 11.097782268019063], [0.8758970696492175, 0.8749031842046412, 0.876266168284345, 0.8842489702570284, 0.8895021308118279, 0.8813061168077119, 0.8674746704022943, 0.8631370148641132, 0.869562261996464, 0.8691779540595619  …  0.873224740029997, 0.8858033021530047, 0.8832392563927576, 0.8705757274838543, 0.8793051161142941, 0.8883241567846492, 0.9232806638603563, 0.871593632447117, 0.8752252490344966, 0.9017572628768009], [0.3631400069730627, 0.34742241728061213, 0.352626326206807, 0.34117177871761845, 0.33614178219528, 0.343324062252806, 0.3863865751127758, 0.3778308056994547, 0.373132407007769, 0.37179238682244176  …  0.37077420410018186, 0.3435244271756953, 0.339663045471818, 0.36043023196833307, 0.3503088163031108, 0.3481738289608236, 0.2622013601425302, 0.35282338167619254, 0.35564318535399464, 0.3018655922610499])

This would give us standard errors:

se_mu = std(mu_b,dims=2)
@show se_ρ = std(ρ_b)
@show se_σ = std(σ_b)
se_ρ = std(ρ_b) = 0.015419484415975956
se_σ = std(σ_b) = 0.029651429799185287
0.029651429799185287

Exercise

Try comparing the variance of these estimators using the estimated optimal weighting matrix to the identity matrix. Does it make an appreciable difference to the estimated distributino of the parameter estimates?