Recitation 3

Part 1: Linear Regression

We are going to play around with the data we cleaned last week and test some regression specifications. This is going to be a simple exercise in fitting data by experimenting with different versions of the linear model. Part of the goal is to show how flexible the class of “linear models” really is.

First, let’s repeat the code to impute wages from last week. The only twist is that here we’ll just look at the year 2018.

library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
✔ ggplot2 3.3.5     ✔ purrr   0.3.4
✔ tibble  3.2.1     ✔ dplyr   1.1.1
✔ tidyr   1.1.3     ✔ stringr 1.4.0
✔ readr   2.0.0     ✔ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
D <- read.csv("../data/cps-econ-4261.csv") %>%
  filter(YEAR==2018) %>%
  mutate(EARNWEEK = na_if(EARNWEEK,9999.99),
         UHRSWORKT = na_if(na_if(na_if(UHRSWORKT,999),997),0),
         HOURWAGE = na_if(HOURWAGE,999.99)) %>%
  mutate(Wage = case_when(PAIDHOUR==1 ~ EARNWEEK/UHRSWORKT,PAIDHOUR==2 ~ HOURWAGE)) %>%
  filter(!is.na(Wage)) %>%
  mutate(SEX = factor(SEX,levels= c(1,2),labels=(c("Male","Female"))))

Suppose we wanted to estimate the conditional expectation of wages given age and sex using a linear model. As a first step, let’s try:

mod1 <- lm(log(Wage) ~ AGE + SEX,D)
summary(mod1)

Call:
lm(formula = log(Wage) ~ AGE + SEX, data = D)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.8218 -0.3929 -0.0533  0.3807  3.4134 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.6046250  0.0189484  137.46   <2e-16 ***
AGE          0.0114159  0.0004245   26.89   <2e-16 ***
SEXFemale   -0.1747719  0.0110014  -15.89   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5801 on 11131 degrees of freedom
Multiple R-squared:  0.07859,   Adjusted R-squared:  0.07842 
F-statistic: 474.7 on 2 and 11131 DF,  p-value: < 2.2e-16

Clearly, age is a strong predictor of wages (for obvious reasons). How do we know that this linear model works well? Let’s try adding \(AGE^2\) to the specification:

mod2 <- lm(log(Wage) ~ poly(AGE,2) + SEX,D)
summary(mod2)

Call:
lm(formula = log(Wage) ~ poly(AGE, 2) + SEX, data = D)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.9636 -0.3659 -0.0366  0.3606  3.6006 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     3.072204   0.007502  409.49   <2e-16 ***
poly(AGE, 2)1  15.597198   0.564526   27.63   <2e-16 ***
poly(AGE, 2)2 -14.203695   0.564341  -25.17   <2e-16 ***
SEXFemale      -0.169776   0.010703  -15.86   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5642 on 11130 degrees of freedom
Multiple R-squared:  0.1282,    Adjusted R-squared:  0.128 
F-statistic: 545.6 on 3 and 11130 DF,  p-value: < 2.2e-16

Did this help very much? One way to see this is to look at the \(R^2\), which we will go over in class. Informally, you can think of it as the fraction of the overall variation in the outcome variable (log wages) that is explained by the right hand side variables. Notice that when we add \(AGE^2\), we increase the amount of variation that we can explain by more than half.

Often in simple scenarios like this one, we can visualize the fit of the model. Since age is a discrete random variable, this is actually pretty simple in this case. First let’s take averages.

mean_wage <- D %>%
  group_by(AGE,SEX) %>%
  summarize(meanwage=mean(log(Wage),na.rm = TRUE)) %>%
  mutate(case="true mean")
`summarise()` has grouped output by 'AGE'. You can override using the `.groups`
argument.

Now in the same dataframe let’s fill in both models’ prediction for mean wages.

mean_wage$mod1 <- predict(mod1,mean_wage)
mean_wage$mod2 <- predict(mod2,mean_wage)

And now let’s plot the prediction against the actual first model:

ggplot(mean_wage,aes(x=AGE,y=meanwage,color=SEX)) + geom_point() + geom_line(aes(x=AGE,y=mod1,color=SEX))

And the second:

ggplot(mean_wage,aes(x=AGE,y=meanwage,color=SEX)) + geom_point() + geom_line(aes(x=AGE,y=mod2,color=SEX))

So it looks like the second model is doing pretty well! One might be tempted to allow for different age trends by sex. This can be done by specifying an interaction between sex and the polynomial like so:

mod3 <- lm(log(Wage) ~ SEX*poly(AGE,2),D)
summary(mod3)

Call:
lm(formula = log(Wage) ~ SEX * poly(AGE, 2), data = D)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.9768 -0.3635 -0.0382  0.3613  3.6187 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)               3.072712   0.007498 409.795  < 2e-16 ***
SEXFemale                -0.169765   0.010694 -15.875  < 2e-16 ***
poly(AGE, 2)1            17.964729   0.798518  22.498  < 2e-16 ***
poly(AGE, 2)2           -15.304270   0.798145 -19.175  < 2e-16 ***
SEXFemale:poly(AGE, 2)1  -4.804650   1.128593  -4.257 2.09e-05 ***
SEXFemale:poly(AGE, 2)2   2.352687   1.128223   2.085   0.0371 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5637 on 11128 degrees of freedom
Multiple R-squared:   0.13, Adjusted R-squared:  0.1296 
F-statistic: 332.5 on 5 and 11128 DF,  p-value: < 2.2e-16

Note that the gain in model fit (as measured by \(R^2\)) is much more modest. The visual fit is:

mean_wage$mod3 <- predict(mod3,mean_wage)
ggplot(mean_wage,aes(x=AGE,y=meanwage,color=SEX)) + geom_point() + geom_line(aes(x=AGE,y=mod3,color=SEX))

There seems to be a clear improvement in fitting wages for younger ages here.

Of course, the linear model can also perfectly fit the conditional sample means, we just need to include a dummy variable for each combination of age and sex in the data. We have either talked about dummy variables in class already or we will do very soon. More on this in recitation next week. For reference, this can be achieved in lm by interacting AGE dummies with SEX dummies:

mod4 <- lm(log(Wage) ~ as.factor(AGE)*SEX,D)
summary(mod4)

Call:
lm(formula = log(Wage) ~ as.factor(AGE) * SEX, data = D)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.7996 -0.3654 -0.0306  0.3577  3.4768 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 2.313976   0.068544  33.759  < 2e-16 ***
as.factor(AGE)19            0.104450   0.091660   1.140 0.254505    
as.factor(AGE)20            0.138925   0.092396   1.504 0.132719    
as.factor(AGE)21            0.144780   0.085694   1.689 0.091152 .  
as.factor(AGE)22            0.287991   0.086799   3.318 0.000910 ***
as.factor(AGE)23            0.313305   0.083948   3.732 0.000191 ***
as.factor(AGE)24            0.390041   0.086948   4.486 7.33e-06 ***
as.factor(AGE)25            0.600890   0.086510   6.946 3.97e-12 ***
as.factor(AGE)26            0.559281   0.084602   6.611 4.00e-11 ***
as.factor(AGE)27            0.596582   0.084054   7.098 1.35e-12 ***
as.factor(AGE)28            0.656963   0.084832   7.744 1.05e-14 ***
as.factor(AGE)29            0.710577   0.082275   8.637  < 2e-16 ***
as.factor(AGE)30            0.654140   0.083063   7.875 3.72e-15 ***
as.factor(AGE)31            0.719313   0.084268   8.536  < 2e-16 ***
as.factor(AGE)32            0.711952   0.082880   8.590  < 2e-16 ***
as.factor(AGE)33            0.730648   0.083642   8.735  < 2e-16 ***
as.factor(AGE)34            0.721370   0.086092   8.379  < 2e-16 ***
as.factor(AGE)35            0.823971   0.083845   9.827  < 2e-16 ***
as.factor(AGE)36            0.882312   0.083157  10.610  < 2e-16 ***
as.factor(AGE)37            0.944540   0.083063  11.371  < 2e-16 ***
as.factor(AGE)38            0.844584   0.083444  10.122  < 2e-16 ***
as.factor(AGE)39            0.946144   0.084602  11.183  < 2e-16 ***
as.factor(AGE)40            1.010466   0.085314  11.844  < 2e-16 ***
as.factor(AGE)41            0.872607   0.084268  10.355  < 2e-16 ***
as.factor(AGE)42            0.747968   0.085438   8.754  < 2e-16 ***
as.factor(AGE)43            0.894767   0.087408  10.237  < 2e-16 ***
as.factor(AGE)44            0.949078   0.084378  11.248  < 2e-16 ***
as.factor(AGE)45            0.974837   0.085191  11.443  < 2e-16 ***
as.factor(AGE)46            0.879415   0.083742  10.501  < 2e-16 ***
as.factor(AGE)47            0.974550   0.083444  11.679  < 2e-16 ***
as.factor(AGE)48            0.905618   0.085191  10.631  < 2e-16 ***
as.factor(AGE)49            0.853233   0.084716  10.072  < 2e-16 ***
as.factor(AGE)50            0.851412   0.083444  10.203  < 2e-16 ***
as.factor(AGE)51            0.944642   0.085694  11.023  < 2e-16 ***
as.factor(AGE)52            0.959412   0.086368  11.108  < 2e-16 ***
as.factor(AGE)53            0.968176   0.084268  11.489  < 2e-16 ***
as.factor(AGE)54            0.930445   0.083948  11.084  < 2e-16 ***
as.factor(AGE)55            0.919127   0.083157  11.053  < 2e-16 ***
as.factor(AGE)56            0.838584   0.085069   9.858  < 2e-16 ***
as.factor(AGE)57            0.777065   0.088940   8.737  < 2e-16 ***
as.factor(AGE)58            0.895532   0.087252  10.264  < 2e-16 ***
as.factor(AGE)59            0.935476   0.086948  10.759  < 2e-16 ***
as.factor(AGE)60            0.885407   0.088757   9.976  < 2e-16 ***
as.factor(AGE)61            0.866235   0.089315   9.699  < 2e-16 ***
as.factor(AGE)62            0.902350   0.092914   9.712  < 2e-16 ***
as.factor(AGE)63            0.773231   0.096230   8.035 1.03e-15 ***
as.factor(AGE)64            0.777869   0.098870   7.868 3.95e-15 ***
as.factor(AGE)65            0.739744   0.101098   7.317 2.71e-13 ***
SEXFemale                  -0.055022   0.100626  -0.547 0.584528    
as.factor(AGE)19:SEXFemale -0.071465   0.134910  -0.530 0.596314    
as.factor(AGE)20:SEXFemale -0.083930   0.130024  -0.645 0.518619    
as.factor(AGE)21:SEXFemale  0.001741   0.125805   0.014 0.988959    
as.factor(AGE)22:SEXFemale  0.016112   0.126679   0.127 0.898795    
as.factor(AGE)23:SEXFemale -0.027590   0.126080  -0.219 0.826787    
as.factor(AGE)24:SEXFemale  0.086259   0.126320   0.683 0.494710    
as.factor(AGE)25:SEXFemale -0.163560   0.126480  -1.293 0.195980    
as.factor(AGE)26:SEXFemale  0.003127   0.122733   0.025 0.979677    
as.factor(AGE)27:SEXFemale -0.080538   0.122517  -0.657 0.510965    
as.factor(AGE)28:SEXFemale -0.065484   0.123304  -0.531 0.595374    
as.factor(AGE)29:SEXFemale -0.112805   0.119990  -0.940 0.347178    
as.factor(AGE)30:SEXFemale -0.064558   0.122181  -0.528 0.597250    
as.factor(AGE)31:SEXFemale -0.116877   0.121832  -0.959 0.337412    
as.factor(AGE)32:SEXFemale -0.039126   0.121317  -0.323 0.747069    
as.factor(AGE)33:SEXFemale  0.010848   0.122664   0.088 0.929531    
as.factor(AGE)34:SEXFemale -0.065837   0.124901  -0.527 0.598125    
as.factor(AGE)35:SEXFemale -0.039240   0.121978  -0.322 0.747689    
as.factor(AGE)36:SEXFemale -0.147871   0.122608  -1.206 0.227825    
as.factor(AGE)37:SEXFemale -0.211890   0.123909  -1.710 0.087287 .  
as.factor(AGE)38:SEXFemale -0.010687   0.122803  -0.087 0.930650    
as.factor(AGE)39:SEXFemale -0.205477   0.123782  -1.660 0.096945 .  
as.factor(AGE)40:SEXFemale -0.304310   0.122287  -2.488 0.012843 *  
as.factor(AGE)41:SEXFemale -0.044927   0.123092  -0.365 0.715127    
as.factor(AGE)42:SEXFemale -0.023990   0.126651  -0.189 0.849767    
as.factor(AGE)43:SEXFemale -0.079584   0.125006  -0.637 0.524370    
as.factor(AGE)44:SEXFemale -0.212457   0.122659  -1.732 0.083284 .  
as.factor(AGE)45:SEXFemale -0.064950   0.125462  -0.518 0.604687    
as.factor(AGE)46:SEXFemale -0.125364   0.122142  -1.026 0.304736    
as.factor(AGE)47:SEXFemale -0.119691   0.122993  -0.973 0.330500    
as.factor(AGE)48:SEXFemale -0.212884   0.121417  -1.753 0.079573 .  
as.factor(AGE)49:SEXFemale -0.196208   0.121677  -1.613 0.106876    
as.factor(AGE)50:SEXFemale -0.119624   0.123822  -0.966 0.334019    
as.factor(AGE)51:SEXFemale -0.264906   0.122892  -2.156 0.031136 *  
as.factor(AGE)52:SEXFemale -0.253078   0.124282  -2.036 0.041742 *  
as.factor(AGE)53:SEXFemale -0.286925   0.121179  -2.368 0.017913 *  
as.factor(AGE)54:SEXFemale -0.133262   0.122049  -1.092 0.274913    
as.factor(AGE)55:SEXFemale -0.219904   0.121661  -1.808 0.070710 .  
as.factor(AGE)56:SEXFemale -0.105486   0.123821  -0.852 0.394275    
as.factor(AGE)57:SEXFemale  0.039996   0.126166   0.317 0.751243    
as.factor(AGE)58:SEXFemale -0.134651   0.126312  -1.066 0.286441    
as.factor(AGE)59:SEXFemale -0.250031   0.127276  -1.964 0.049499 *  
as.factor(AGE)60:SEXFemale -0.184519   0.125954  -1.465 0.142955    
as.factor(AGE)61:SEXFemale -0.062423   0.128299  -0.487 0.626592    
as.factor(AGE)62:SEXFemale -0.261293   0.132506  -1.972 0.048642 *  
as.factor(AGE)63:SEXFemale -0.037717   0.135299  -0.279 0.780426    
as.factor(AGE)64:SEXFemale -0.231381   0.137499  -1.683 0.092445 .  
as.factor(AGE)65:SEXFemale -0.043713   0.143682  -0.304 0.760957    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5611 on 11038 degrees of freedom
Multiple R-squared:  0.1452,    Adjusted R-squared:  0.1378 
F-statistic: 19.73 on 95 and 11038 DF,  p-value: < 2.2e-16
mean_wage$mod4 <- predict(mod4,mean_wage)
ggplot(mean_wage,aes(x=AGE,y=meanwage,color=SEX)) + geom_point() + geom_line(aes(x=AGE,y=mod4,color=SEX))

We will show how this works next week.

Part 2: Theory

Suppose that you have iid data \((X_{n},Y_{n})_{n=1}^{N}\). While each pair indexed by \(n\) is independent of each other pair \(n'\), we still have:

\[ \mathbb{C}(X_{n},Y_{n}) = \beta.\]

Suppose you want to test the hypothesis that \(\mu_{X}=\mu_{Y}\) against a two-sided alternative.

First, show that:

\[ \mathbb{V}[\overline{X} - \overline{Y}] = \frac{\sigma^2_{X}+\sigma^2_{Y} - 2\beta}{N} \]

Now, suppose that (1) \(\beta>0\); and (2) You incorrectly test the hypothesis under the assumption that \(X\) and \(Y\) are independent; and (3) The null is true (\(\mu_{X}=\mu_{Y}\))

For a given target size \(\alpha\) of the test, is the actual size of the test larger or smaller than what you intended as \(N\) grows large? you can use the simulation code below to help you:

monte_carlo_test <- function(B,N,beta,alpha) {
  zcrit = qnorm(1-alpha/2) #<- two-sided critical value
  reject <- numeric(B)
  for (b in 1:B) {
    x = rnorm(N) #<- draw X as a standard normal with mean 0 and variance 1
    y = beta*x + rnorm(N) #<- draw Y as a standard normal with mean 0, variance 1+beta and covarariance(x,y) =beta
    test_stat <- (mean(x)-mean(y))/sqrt((var(x)+var(y))/N) #<- test statistic under incorrect assumption
    reject[b] <- abs(test_stat)>zcrit
  }
  mean(reject) #<- estimate size of test using simulation
}

This simulation is basically telling you the answer but it’s more important that you understand why.

monte_carlo_test(5000,50,0.3,0.05)
[1] 0.026
monte_carlo_test(5000,100,0.3,0.05)
[1] 0.023
monte_carlo_test(5000,1000,0.3,0.05)
[1] 0.0172

Compared to:

monte_carlo_test(5000,50,0.,0.05)
[1] 0.0504
monte_carlo_test(5000,100,0.,0.05)
[1] 0.0524
monte_carlo_test(5000,1000,0.,0.05)
[1] 0.0474

A Disclaimer for IPUMS CPS data

These data are a subsample of the IPUMS CPS data available from cps.ipums.org. Any use of these data should be cited as follows:

Sarah Flood, Miriam King, Renae Rodgers, Steven Ruggles, J. Robert Warren, Daniel Backman, Annie Chen, Grace Cooper, Stephanie Richards, Megan Schouweiler, and Michael Westberry. IPUMS CPS: Version 11.0 [dataset]. Minneapolis, MN: IPUMS, 2023. https://doi.org/10.18128/D030.V11.0

The CPS data file is intended only for exercises as part of ECON4261. Individuals are not to redistribute the data without permission. Contact for redistribution requests. For all other uses of these data, please access data directly via cps.ipums.org.