Assignment 4: Income and Savings from the PSID

In this assignment we’ll start working with data from the PSID. If you would like more details on how these data are constructed, you should refer to Arellano, Blundell, and Bonhomme (2018).

To begin, let’s load the data and pull out the variables we are interested in using. These are person identifiers (person), year, total income (y), savings (tot_assets1) and age. You should bear in mind that it is by no means trivial to measure total income and total assets in these data. The variables we are looking at are the product of a lot of data cleaning and careful choices by the authors.

using CSV, DataFrames, DataFramesMeta, Statistics
data = @chain begin 
    CSV.read("../data/abb_aea_data.csv",DataFrame,missingstring = "NA")
    @select :person :y :tot_assets1 :asset :age :year
end

19317×6 DataFrame

19292 rows omitted

Row	person	y	tot_assets1	asset	age	year
	Int64	Int64	Int64	Float64	Int64	Int64
1	12061	173100	605000	15500.0	65	98
2	17118	54000	60000	0.0	49	98
3	12630	61283	224000	39283.0	59	98
4	12647	42300	28240	0.0	38	98
5	5239	82275	7500	0.0	56	98
6	2671	69501	48000	3600.0	35	98
7	13027	68000	148000	20000.0	49	98
8	6791	93758	80000	160.0	41	98
9	6475	26581	23300	0.0	35	98
10	18332	33785	0	0.0	42	98
11	3856	55300	311000	5300.0	33	98
12	19326	40200	105250	0.0	40	98
13	21818	42500	13000	0.0	36	98
⋮	⋮	⋮	⋮	⋮	⋮	⋮
19306	6617	115887	241000	21346.0	62	108
19307	626	128600	98000	0.0	46	108
19308	4795	105000	-68000	0.0	34	108
19309	3223	120000	132000	0.0	47	108
19310	8098	26527	4700	0.0	37	108
19311	8954	144026	220000	25.0	46	108
19312	12990	122665	220000	0.0	53	108
19313	8782	55000	69000	0.0	31	108
19314	13059	42728	-10000	0.0	26	108
19315	13535	57000	0	0.0	26	108
19316	3806	87000	74200	0.0	26	108
19317	11085	74000	-50000	0.0	31	108

You are going to estimate the parameters of the income process in the simple savings model by matching the implied variances and covariances from the model to those that are calculated from the data.

Recall that the income process is:

\[ \log(y_{it}) = \mu_{t} + \varepsilon_{it} \]

where \(\varepsilon_{it}\) is an AR1 process with autocorrelation \(\rho\) and variance \(\sigma^2_{\eta} / (1-\rho^2)\). Thus, there are only two parameters dictating the income process: (\(\rho,\sigma_\eta\)).

Setup

To map to the model, assume that agents begin (\(t=1\)) when aged 25 and live for 40 years (so the “terminal” period is at age 64). Thus, we should filter the data to look at only these ages.

@subset!(data,:age.>=25,:age.<=64)

19139×6 DataFrame

19114 rows omitted

Row	person	y	tot_assets1	asset	age	year
	Int64	Int64	Int64	Float64	Int64	Int64
1	17118	54000	60000	0.0	49	98
2	12630	61283	224000	39283.0	59	98
3	12647	42300	28240	0.0	38	98
4	5239	82275	7500	0.0	56	98
5	2671	69501	48000	3600.0	35	98
6	13027	68000	148000	20000.0	49	98
7	6791	93758	80000	160.0	41	98
8	6475	26581	23300	0.0	35	98
9	18332	33785	0	0.0	42	98
10	3856	55300	311000	5300.0	33	98
11	19326	40200	105250	0.0	40	98
12	21818	42500	13000	0.0	36	98
13	7300	121508	178000	10008.0	59	98
⋮	⋮	⋮	⋮	⋮	⋮	⋮
19128	6617	115887	241000	21346.0	62	108
19129	626	128600	98000	0.0	46	108
19130	4795	105000	-68000	0.0	34	108
19131	3223	120000	132000	0.0	47	108
19132	8098	26527	4700	0.0	37	108
19133	8954	144026	220000	25.0	46	108
19134	12990	122665	220000	0.0	53	108
19135	8782	55000	69000	0.0	31	108
19136	13059	42728	-10000	0.0	26	108
19137	13535	57000	0	0.0	26	108
19138	3806	87000	74200	0.0	26	108
19139	11085	74000	-50000	0.0	31	108

Part 1

Estimate the parameters \(\mu\) using the sample mean of log income at each age. Create residuals \(\hat{\varepsilon}_{it}\) for each individual in each period using these estimates.

Part 2

The PSID data are taken biennially (every two years). Thus, write a function that takes a guess of \((\rho,\sigma_\eta)\) and calculates:

The unconditional variance of the residual.
The covariance of the residual with its two year lag.
The covariance of the residual with its four year lag.

Part 3

Calculate the sample equivalent of these moments from the data, and write a function that calculates the sum of squared differences between the data and those predicted by a particular choice of \((\rho,\sigma_\eta)\).

If it helps, here is code to create the lags for income (you could adapt this code to create lags for the residuals you calculated in part 1).

d1 = @chain data begin
    @select :year :person :y
    @transform :year = :year .+ 2
    @rename :ylag1 = :y
end

d2 = @chain data begin
    @select :year :person :y
    @transform :year = :year .+ 4
    @rename :ylag2 = :y
end

data = @chain data begin
    innerjoin(d1 , on=[:person,:year])
    innerjoin(d2 , on=[:person,:year])
end

9785×8 DataFrame

9760 rows omitted

Row	person	y	tot_assets1	asset	age	year	ylag1	ylag2
	Int64	Int64	Int64	Float64	Int64	Int64	Int64	Int64
1	17118	43799	-2000	0.0	53	102	51700	54000
2	12630	68554	1519000	29454.0	63	102	104104	61283
3	12647	35000	78000	0.0	42	102	30500	42300
4	5239	49948	29900	0.0	60	102	54332	82275
5	2671	77000	84000	0.0	39	102	75000	69501
6	13027	91000	248000	25000.0	53	102	50678	68000
7	6791	122296	118650	154.0	45	102	100503	93758
8	18332	54000	56000	0.0	46	102	40200	33785
9	3856	95800	357000	9600.0	37	102	76400	55300
10	21818	72334	25540	0.0	40	102	58700	42500
11	7300	64319	710000	6130.0	63	102	111140	121508
12	20796	88000	75000	0.0	48	102	112600	105000
13	8455	50880	110500	130.0	53	102	46000	54000
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
9774	3360	198300	775000	3300.0	40	108	161000	161000
9775	1204	44710	5200	0.0	30	108	46360	35000
9776	6483	55019	-3000	0.0	38	108	28000	22720
9777	2182	53908	9000	0.0	32	108	51247	57625
9778	3971	48406	108000	15905.0	43	108	109000	55400
9779	12094	70500	83000	0.0	41	108	55830	10375
9780	12975	104500	266800	2500.0	40	108	77240	98200
9781	9940	133074	282000	274.0	45	108	102900	100300
9782	8048	185500	88000	50000.0	45	108	183000	75200
9783	2921	84845	390800	773.0	59	108	76507	77235
9784	13562	39200	18000	0.0	43	108	35000	47500
9785	3193	66000	37400	0.0	44	108	51000	61210

An example of calculating covariances:

@chain data begin
    @combine begin 
        :c1 = cov(log.(:y),log.(:ylag1)) 
        :c2 = cov(log.(:y),log.(:ylag2))
    end
end

1×2 DataFrame

Row	c1	c2
	Float64	Float64
1	0.402346	0.363079

Part 4

Now estimate the income process parameters by minimizing this weighted sum of squares (i.e. implement a minimum distance estimator with identity weighting matrix).

Part 5

Note that in this model:

\[ \rho = \frac{\mathbb{C}(\varepsilon_{it},\varepsilon_{it-1})}{\mathbb{V}(\varepsilon_{it})} \].

Suppose that the true model is:

\[ \log(y_{it}) = \mu_{t} + \varepsilon_{it} + \zeta_{it} \]

where \(\zeta_{it}\) is an additional shock to income that is completely iid (i.e. no persistence). Suppose we estimate the persistence parameter \(\rho\) using the relationship above (which is now misspecified).

\[ \hat{\rho} = \frac{\widehat{\mathbb{C}(\hat{\varepsilon}_{it},\hat{\varepsilon}_{it-1})}}{\widehat{\mathbb{V}(\hat{\varepsilon}_{it})}} \]

Does the population limit of our estimator over- or under-estimate \(\rho\), the persistence in \(\varepsilon\)?

References

Arellano, Manuel, Richard Blundell, and Stephane Bonhomme. 2018. “Nonlinear Persistence and Partial Insurance: Income and Consumption Dynamics in the PSID.” AEA Papers and Proceedings 108 (May): 281–86. https://doi.org/10.1257/pandp.20181049.