Assignment 4: Income and Savings from the PSID

In this assignment we’ll start working with data from the PSID. If you would like more details on how these data are constructed, you should refer to Arellano, Blundell, and Bonhomme (2018).

To begin, let’s load the data and pull out the variables we are interested in using. These are person identifiers (person), year, total income (y), savings (tot_assets1) and age. You should bear in mind that it is by no means trivial to measure total income and total assets in these data. The variables we are looking at are the product of a lot of data cleaning and careful choices by the authors.

using CSV, DataFrames, DataFramesMeta, Statistics
data = @chain begin 
    CSV.read("../data/abb_aea_data.csv",DataFrame,missingstring = "NA")
    @select :person :y :tot_assets1 :asset :age :year
end
19317×6 DataFrame
19292 rows omitted
Row person y tot_assets1 asset age year
Int64 Int64 Int64 Float64 Int64 Int64
1 12061 173100 605000 15500.0 65 98
2 17118 54000 60000 0.0 49 98
3 12630 61283 224000 39283.0 59 98
4 12647 42300 28240 0.0 38 98
5 5239 82275 7500 0.0 56 98
6 2671 69501 48000 3600.0 35 98
7 13027 68000 148000 20000.0 49 98
8 6791 93758 80000 160.0 41 98
9 6475 26581 23300 0.0 35 98
10 18332 33785 0 0.0 42 98
11 3856 55300 311000 5300.0 33 98
12 19326 40200 105250 0.0 40 98
13 21818 42500 13000 0.0 36 98
19306 6617 115887 241000 21346.0 62 108
19307 626 128600 98000 0.0 46 108
19308 4795 105000 -68000 0.0 34 108
19309 3223 120000 132000 0.0 47 108
19310 8098 26527 4700 0.0 37 108
19311 8954 144026 220000 25.0 46 108
19312 12990 122665 220000 0.0 53 108
19313 8782 55000 69000 0.0 31 108
19314 13059 42728 -10000 0.0 26 108
19315 13535 57000 0 0.0 26 108
19316 3806 87000 74200 0.0 26 108
19317 11085 74000 -50000 0.0 31 108

You are going to estimate the parameters of the income process in the simple savings model by matching the implied variances and covariances from the model to those that are calculated from the data.

Recall that the income process is:

\[ \log(y_{it}) = \mu_{t} + \varepsilon_{it} \]

where \(\varepsilon_{it}\) is an AR1 process with autocorrelation \(\rho\) and variance \(\sigma^2_{\eta} / (1-\rho^2)\). Thus, there are only two parameters dictating the income process: (\(\rho,\sigma_\eta\)).

Setup

To map to the model, assume that agents begin (\(t=1\)) when aged 25 and live for 40 years (so the “terminal” period is at age 64). Thus, we should filter the data to look at only these ages.

@subset!(data,:age.>=25,:age.<=64)
19139×6 DataFrame
19114 rows omitted
Row person y tot_assets1 asset age year
Int64 Int64 Int64 Float64 Int64 Int64
1 17118 54000 60000 0.0 49 98
2 12630 61283 224000 39283.0 59 98
3 12647 42300 28240 0.0 38 98
4 5239 82275 7500 0.0 56 98
5 2671 69501 48000 3600.0 35 98
6 13027 68000 148000 20000.0 49 98
7 6791 93758 80000 160.0 41 98
8 6475 26581 23300 0.0 35 98
9 18332 33785 0 0.0 42 98
10 3856 55300 311000 5300.0 33 98
11 19326 40200 105250 0.0 40 98
12 21818 42500 13000 0.0 36 98
13 7300 121508 178000 10008.0 59 98
19128 6617 115887 241000 21346.0 62 108
19129 626 128600 98000 0.0 46 108
19130 4795 105000 -68000 0.0 34 108
19131 3223 120000 132000 0.0 47 108
19132 8098 26527 4700 0.0 37 108
19133 8954 144026 220000 25.0 46 108
19134 12990 122665 220000 0.0 53 108
19135 8782 55000 69000 0.0 31 108
19136 13059 42728 -10000 0.0 26 108
19137 13535 57000 0 0.0 26 108
19138 3806 87000 74200 0.0 26 108
19139 11085 74000 -50000 0.0 31 108

Part 1

Estimate the parameters \(\mu\) using the sample mean of log income at each age. Create residuals \(\hat{\varepsilon}_{it}\) for each individual in each period using these estimates.

Part 2

The PSID data are taken biennially (every two years). Thus, write a function that takes a guess of \((\rho,\sigma_\eta)\) and calculates:

  1. The unconditional variance of the residual.
  2. The covariance of the residual with its two year lag.
  3. The covariance of the residual with its four year lag.

Part 3

Calculate the sample equivalent of these moments from the data, and write a function that calculates the sum of squared differences between the data and those predicted by a particular choice of \((\rho,\sigma_\eta)\).

If it helps, here is code to create the lags for income (you could adapt this code to create lags for the residuals you calculated in part 1).

d1 = @chain data begin
    @select :year :person :y
    @transform :year = :year .+ 2
    @rename :ylag1 = :y
end

d2 = @chain data begin
    @select :year :person :y
    @transform :year = :year .+ 4
    @rename :ylag2 = :y
end

data = @chain data begin
    innerjoin(d1 , on=[:person,:year])
    innerjoin(d2 , on=[:person,:year])
end
9785×8 DataFrame
9760 rows omitted
Row person y tot_assets1 asset age year ylag1 ylag2
Int64 Int64 Int64 Float64 Int64 Int64 Int64 Int64
1 17118 43799 -2000 0.0 53 102 51700 54000
2 12630 68554 1519000 29454.0 63 102 104104 61283
3 12647 35000 78000 0.0 42 102 30500 42300
4 5239 49948 29900 0.0 60 102 54332 82275
5 2671 77000 84000 0.0 39 102 75000 69501
6 13027 91000 248000 25000.0 53 102 50678 68000
7 6791 122296 118650 154.0 45 102 100503 93758
8 18332 54000 56000 0.0 46 102 40200 33785
9 3856 95800 357000 9600.0 37 102 76400 55300
10 21818 72334 25540 0.0 40 102 58700 42500
11 7300 64319 710000 6130.0 63 102 111140 121508
12 20796 88000 75000 0.0 48 102 112600 105000
13 8455 50880 110500 130.0 53 102 46000 54000
9774 3360 198300 775000 3300.0 40 108 161000 161000
9775 1204 44710 5200 0.0 30 108 46360 35000
9776 6483 55019 -3000 0.0 38 108 28000 22720
9777 2182 53908 9000 0.0 32 108 51247 57625
9778 3971 48406 108000 15905.0 43 108 109000 55400
9779 12094 70500 83000 0.0 41 108 55830 10375
9780 12975 104500 266800 2500.0 40 108 77240 98200
9781 9940 133074 282000 274.0 45 108 102900 100300
9782 8048 185500 88000 50000.0 45 108 183000 75200
9783 2921 84845 390800 773.0 59 108 76507 77235
9784 13562 39200 18000 0.0 43 108 35000 47500
9785 3193 66000 37400 0.0 44 108 51000 61210

An example of calculating covariances:

@chain data begin
    @combine begin 
        :c1 = cov(log.(:y),log.(:ylag1)) 
        :c2 = cov(log.(:y),log.(:ylag2))
    end
end
1×2 DataFrame
Row c1 c2
Float64 Float64
1 0.402346 0.363079

Part 4

Now estimate the income process parameters by minimizing this weighted sum of squares (i.e. implement a minimum distance estimator with identity weighting matrix).

Part 5

Note that in this model:

\[ \rho = \frac{\mathbb{C}(\varepsilon_{it},\varepsilon_{it-1})}{\mathbb{V}(\varepsilon_{it})} \].

Suppose that the true model is:

\[ \log(y_{it}) = \mu_{t} + \varepsilon_{it} + \zeta_{it} \]

where \(\zeta_{it}\) is an additional shock to income that is completely iid (i.e. no persistence). Suppose we estimate the persistence parameter \(\rho\) using the relationship above (which is now misspecified).

\[ \hat{\rho} = \frac{\widehat{\mathbb{C}(\hat{\varepsilon}_{it},\hat{\varepsilon}_{it-1})}}{\widehat{\mathbb{V}(\hat{\varepsilon}_{it})}} \]

Does the population limit of our estimator over- or under-estimate \(\rho\), the persistence in \(\varepsilon\)?

References

Arellano, Manuel, Richard Blundell, and Stephane Bonhomme. 2018. “Nonlinear Persistence and Partial Insurance: Income and Consumption Dynamics in the PSID.” AEA Papers and Proceedings 108 (May): 281–86. https://doi.org/10.1257/pandp.20181049.