Week 4

(1) Saturated Models

Suppose we have a model with a vector of dummy variables:

\[ \mathbb{E}[y_{n}|\mathbf{d}_{n}] = \mathbf{d}_{n}\beta \]

with \(\mathbf{d}_{n} = [d_{1,n},\ d_{2,n},...,\ d_{K,n}]\) where each dummy is an indicator for a mutually exclusive event, meaning that \(d_{k,n}=1\) implies that \(d_{j,n}=0\) for all \(j\neq k\).

Example

Suppose that \(\mathbf{x}_{n} = [AGE_{n},SEX_{n}]\) and we define \(\mathbf{d}_{n}\) to be a vector of dummies for each combination of age and sex. Argue that this satisfies the conditions above. We call this a saturated model. There is one dummy for each unique potential realization of \(\mathbf{x}_{n}\).

Exercise

Re-writing the model as

\[ \mathbf{Y} = \mathbf{D}\beta + \epsilon \]

the OLS estimator is:

\[ \hat{\beta} = \left(\mathbf{D}^{T}\mathbf{D}\right)^{-1}\mathbf{D}^{T}\mathbf{Y}.\]

Show that each element \(\hat{\beta}_{k}\) is equal to the sample mean of \(y_{n}\) using only observations for which \(d_{k,n}=1\).

It follows that the following rule holds for any regression involving dummy variables. Suppose that the vector \(\mathbf{x}_{n}\) can take one of \(K\) unique values \((x^{k})_{k=1}^{K}\), and suppose that there is a unique combination of dummy variables \(\mathbf{d}^{k}\) for each value. Then the OLS estimator solves:

\[ \mathbf{d}^{k}\hat{\beta} =\frac{\sum_{n}y_{n}\mathbf{1}\{\mathbf{x}_{n}=x^{k}\}}{\sum_{n}\mathbf{1}\{\mathbf{x}_{n}=x^k\}} \]

For example, if \(\mathbf{x}_{n} = [1,F_{n},C_{n},F_{n}C_{n}]\) where \(F_{n}\) is a dummy variable equal to 1 if person \(n\) is female, and \(C_{n}\) is a dummy variable equal 1 if person \(n\) has a college education, then \(\mathbf{x}_{n}\) has four unique values. Hence for example \(\mathbb{E}[y_{n}|\text{Male},\text{Non-College}]=\beta_{0}\) and \(\hat{\beta}_{0}\) is the sample mean of \(y_{n}\) for all observations of non-college educated men. Similarly \(\hat{\beta}_{0} + \hat{\beta}_{1}\) is equal to the sample mean of all \(y_{n}\) for all non-college educated women, and so on.

(2) Testing hypotheses in a linear model

Here’s an exercise in how to use the correct interpretation of coefficients in a linear model to test hypothesis. First we read in the cps data.

library(tidyverse)
D <- read.csv("../data/cps-econ-4261.csv") %>%
  filter(YEAR>=2014,YEAR<=2018) %>%
  mutate(EARNWEEK = na_if(EARNWEEK,9999.99),
         UHRSWORKT = na_if(na_if(na_if(UHRSWORKT,999),997),0),
         HOURWAGE = na_if(HOURWAGE,999.99)) %>%
  mutate(Wage = case_when(PAIDHOUR==1 ~ EARNWEEK/UHRSWORKT,PAIDHOUR==2 ~ HOURWAGE)) %>%
  filter(!is.na(Wage))

Suppose we wanted to test if the wage gap is larger for married vs unmarried couples. The estimand (i.e. the population parameter we are going for) is:

\[ \begin{multline} \Delta = \left(\mathbb{E}[\log(W)|\text{Female,Married}] - \mathbb{E}[\log(W)|\text{Male,Married}]\right) \\ - \left(\mathbb{E}[\log(W)|\text{Female,Single}] - \mathbb{E}[\log(W)|\text{Male,Single}]\right) \end{multline} \]

A simple way to test this would be to write the model with a dummy \(F_{n}=1\) if female (0 otherwise) and a dummy \(M_{n}=1\) if married (0 otherwise): \[\mathbb{E}[\log(W)|F_{n},M_{n}] = \beta_{0} + \beta_{1}F_{n} + \beta_{2}M_{n} + \beta_{3}M_{n}F_{n} \]

Exercise

Show that the population parameter \(\Delta\) defined above is equal to \(\beta_{3}\).

Exercise

Use the following regression to test the null hypothesis that wage gaps are the same for married and single individuals.

D %>%
  mutate(Married = MARST==1,Female = SEX==2) %>%
  lm(log(Wage) ~ Female*Married,data=.) %>%
  summary()


Call:
lm(formula = log(Wage) ~ Female * Married, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.4330 -0.4229 -0.0461  0.3956  3.5880 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)             2.803056   0.005147 544.640   <2e-16 ***
FemaleTRUE             -0.068686   0.007134  -9.629   <2e-16 ***
MarriedTRUE             0.353256   0.006866  51.450   <2e-16 ***
FemaleTRUE:MarriedTRUE -0.136014   0.009717 -13.997   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5902 on 59518 degrees of freedom
Multiple R-squared:  0.07279,   Adjusted R-squared:  0.07274 
F-statistic:  1557 on 3 and 59518 DF,  p-value: < 2.2e-16

Notice the two-sided p-values are tiny: we strongly reject that wage gaps are the same. They appear to be much bigger for unmarried individuals.

Exercise

Which parameter indicates the wage gap amongst unmarried individuals?