library(tidyverse)
<- read.csv("../data/cps-econ-4261.csv") %>%
D filter(YEAR>=2014,YEAR<=2018) %>%
mutate(EARNWEEK = na_if(EARNWEEK,9999.99),
UHRSWORKT = na_if(na_if(na_if(UHRSWORKT,999),997),0),
HOURWAGE = na_if(HOURWAGE,999.99)) %>%
mutate(Wage = case_when(PAIDHOUR==1 ~ EARNWEEK/UHRSWORKT,PAIDHOUR==2 ~ HOURWAGE)) %>%
filter(!is.na(Wage))
Week 4
(1) Saturated Models
Suppose we have a model with a vector of dummy variables:
\[ \mathbb{E}[y_{n}|\mathbf{d}_{n}] = \mathbf{d}_{n}\beta \]
with \(\mathbf{d}_{n} = [d_{1,n},\ d_{2,n},...,\ d_{K,n}]\) where each dummy is an indicator for a mutually exclusive event, meaning that \(d_{k,n}=1\) implies that \(d_{j,n}=0\) for all \(j\neq k\).
Example
Suppose that \(\mathbf{x}_{n} = [AGE_{n},SEX_{n}]\) and we define \(\mathbf{d}_{n}\) to be a vector of dummies for each combination of age and sex. Argue that this satisfies the conditions above. We call this a saturated model. There is one dummy for each unique potential realization of \(\mathbf{x}_{n}\).
Exercise
Re-writing the model as
\[ \mathbf{Y} = \mathbf{D}\beta + \epsilon \]
the OLS estimator is:
\[ \hat{\beta} = \left(\mathbf{D}^{T}\mathbf{D}\right)^{-1}\mathbf{D}^{T}\mathbf{Y}.\]
Show that each element \(\hat{\beta}_{k}\) is equal to the sample mean of \(y_{n}\) using only observations for which \(d_{k,n}=1\).
It follows that the following rule holds for any regression involving dummy variables. Suppose that the vector \(\mathbf{x}_{n}\) can take one of \(K\) unique values \((x^{k})_{k=1}^{K}\), and suppose that there is a unique combination of dummy variables \(\mathbf{d}^{k}\) for each value. Then the OLS estimator solves:
\[ \mathbf{d}^{k}\hat{\beta} =\frac{\sum_{n}y_{n}\mathbf{1}\{\mathbf{x}_{n}=x^{k}\}}{\sum_{n}\mathbf{1}\{\mathbf{x}_{n}=x^k\}} \]
For example, if \(\mathbf{x}_{n} = [1,F_{n},C_{n},F_{n}C_{n}]\) where \(F_{n}\) is a dummy variable equal to 1 if person \(n\) is female, and \(C_{n}\) is a dummy variable equal 1 if person \(n\) has a college education, then \(\mathbf{x}_{n}\) has four unique values. Hence for example \(\mathbb{E}[y_{n}|\text{Male},\text{Non-College}]=\beta_{0}\) and \(\hat{\beta}_{0}\) is the sample mean of \(y_{n}\) for all observations of non-college educated men. Similarly \(\hat{\beta}_{0} + \hat{\beta}_{1}\) is equal to the sample mean of all \(y_{n}\) for all non-college educated women, and so on.
(2) Testing hypotheses in a linear model
Here’s an exercise in how to use the correct interpretation of coefficients in a linear model to test hypothesis. First we read in the cps data.
Suppose we wanted to test if the wage gap is larger for married vs unmarried couples. The estimand (i.e. the population parameter we are going for) is:
\[ \begin{multline} \Delta = \left(\mathbb{E}[\log(W)|\text{Female,Married}] - \mathbb{E}[\log(W)|\text{Male,Married}]\right) \\ - \left(\mathbb{E}[\log(W)|\text{Female,Single}] - \mathbb{E}[\log(W)|\text{Male,Single}]\right) \end{multline} \]
A simple way to test this would be to write the model with a dummy \(F_{n}=1\) if female (0 otherwise) and a dummy \(M_{n}=1\) if married (0 otherwise): \[\mathbb{E}[\log(W)|F_{n},M_{n}] = \beta_{0} + \beta_{1}F_{n} + \beta_{2}M_{n} + \beta_{3}M_{n}F_{n} \]
Exercise
Show that the population parameter \(\Delta\) defined above is equal to \(\beta_{3}\).
Exercise
Use the following regression to test the null hypothesis that wage gaps are the same for married and single individuals.
%>%
D mutate(Married = MARST==1,Female = SEX==2) %>%
lm(log(Wage) ~ Female*Married,data=.) %>%
summary()
Call:
lm(formula = log(Wage) ~ Female * Married, data = .)
Residuals:
Min 1Q Median 3Q Max
-7.4330 -0.4229 -0.0461 0.3956 3.5880
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.803056 0.005147 544.640 <2e-16 ***
FemaleTRUE -0.068686 0.007134 -9.629 <2e-16 ***
MarriedTRUE 0.353256 0.006866 51.450 <2e-16 ***
FemaleTRUE:MarriedTRUE -0.136014 0.009717 -13.997 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5902 on 59518 degrees of freedom
Multiple R-squared: 0.07279, Adjusted R-squared: 0.07274
F-statistic: 1557 on 3 and 59518 DF, p-value: < 2.2e-16
Notice the two-sided p-values are tiny: we strongly reject that wage gaps are the same. They appear to be much bigger for unmarried individuals.
Exercise
Which parameter indicates the wage gap amongst unmarried individuals?