Week 6

library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
✔ ggplot2 3.3.5     ✔ purrr   0.3.4
✔ tibble  3.2.1     ✔ dplyr   1.1.1
✔ tidyr   1.1.3     ✔ stringr 1.4.0
✔ readr   2.0.0     ✔ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

1 - Deriving “clustered” standard errors

Recall that if we have a random sample of \(C\) “clusters” of observations of \(X\) from \(c=1,2,,,C\), each with \(N_{c}\) observations, then the sample mean can be written as:

\[ \overline{X} = \frac{1}{\overline{N}}\sum_{c=1}^{C}\sum_{n=1}^{N_{c}} X_{n}.\] where \(\overline{N} = \sum_{c}N_{c}\) is the total number of observations.

Suppose that \(\mathbb{E}[X]=0\), and that observations of \(X\) are independent across clusters but not within them. Then the variance of the sample mean is:

\[ V[\overline{X}] = \frac{1}{\overline{N}^2}\times C \times \mathbb{E}\left[\left(\sum_{n=1}^{N_{c}}X_{n}\right)^2\right] \]

which we could then estimate by substituting in the sample mean.

Now suppose instead that we have the regression model:

\[ Y_{c,n} = X_{c,n}\beta + \epsilon_{c,n} \]

where the error terms \(\epsilon_{c,n}\) are independent across clusters but not within them. In other words, \(\epsilon_{c,n}\) and \(\epsilon_{d,m}\) are independent only if \(c\neq d\). Recall that:

\[ V[\hat{\beta}] = V[(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'{\epsilon}]\]

which we now must calculate allowing for the correlation with each cluster. Following the logic above, we can estimate the variance of \(\hat{\beta}\) as:

\[ \widehat{V[\hat{\beta}]} = (\mathbf{X}'\mathbf{X})^{-1}\sum_{c=1}^{C}\left(\sum_{n=1}^{N_{c}}X_{c,n}'\hat{\epsilon_{c,n}}\right)\left(\sum_{n=1}^{N_{c}}\hat{\epsilon}_{n,c}X_{c,n}\right)(\mathbf{X}'\mathbf{X})^{-1}\]

which if we let \(\mathbf{X}_{c}\) be all the observations in cluster \(c\) stacked in a matrix, and the same for \(\hat{\epsilon}_{c}\), can be written as:

\[ \widehat{V[\hat{\beta}]} = (\mathbf{X}'\mathbf{X})^{-1}\sum_{c=1}^{C}\left(\mathbf{X}_{c}'\hat{\epsilon}_{c}\hat{\epsilon}_{c}'\mathbf{X}_{c}\right)(\mathbf{X}'\mathbf{X})^{-1}.\]

This is the calculation being made when researchers refer to “clustering” their standard errors.

2 - An Example

Suppose that we are evaluating a classroom intervention and the model of outcomes for student \(n\) in classroom \(c\) for school \(s\) is:

\[ Y_{s,c,n} = \alpha D_{s} + \mu_{s} + \gamma_{c} + \epsilon_{n} \]

We will assume 3 classrooms per school and 10 students per classroom. Here, \(\alpha\) is the parameter of interest, the treatment effect of the intervention. Notice that \(D_{s}\) is assigned at the level of the school.

Here is code to generate the outcomes:

generate_data <- function(alpha,S) {
  # draw the classroom effect
  mu_s = rnorm(S)
  D = runif(S)<0.5
  d = data.frame()
  for (c in 1:3) {
    gamma_c = rnorm(S,0.2)
    for (n in 1:10) {
      eps_n = rnorm(S,0.2)
      d = rbind(d,data.frame(school = 1:S,class = (1:S - 1)*3 + c,D = D,y = alpha*D + mu_s + gamma_c + eps_n))
    }
  }
  d
}
d = generate_data(0.1,100)

Questions:

  1. Notice that the error terms must be correlated. What is the appropriate level of clustering here?
  2. Previously we saw that including fixed effects could also remove correlation. What prohibits us from doing that in this model? (hint: would the rank condition hold if we used school or classroom fixed effects?)

Here is a way to estimate the model with proper standard errors:

library(fixest)

feols(y ~ D,d) %>%
  summary(cluster = "school")
OLS estimation, Dep. Var.: y
Observations: 3,000 
Standard-errors: Clustered (school) 
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.321996   0.159157 2.02313 0.045756 *  
DTRUE       0.597401   0.231198 2.58394 0.011228 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 1.73363   Adj. R2: 0.027789

Which we can compare to standard errors without the clustering:

feols(y ~ D,d) %>%
  summary("standard")
OLS estimation, Dep. Var.: y
Observations: 3,000 
Standard-errors: IID 
            Estimate Std. Error t value   Pr(>|t|)    
(Intercept) 0.321996   0.041574 7.74504 1.2975e-14 ***
DTRUE       0.597401   0.064151 9.31245  < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 1.73363   Adj. R2: 0.027789

In order to see the consequence of failing to deal with the correlation in errors, let’s check the actual size of a two-sided test that \(\alpha=0\) when this is the true value. If the standard errors are correct, we should get close to 5% when we choose critical values of 1.96.

# a function to test that $\alpha=0$ against a two-sided alternative
testcoef <- function(mod) {
  stat <- mod$coefficients[2]/mod$se[2]
  abs(stat)>1.96 
}

# tests the two assumptions for a sample of size S for B montecarlo trials
simulation_test <- function(B,S) {
  test1 <- numeric(B)
  test2 <- numeric(B)
  for (b in 1:B) {
    d = generate_data(0,S)
    mod <- feols(y ~ D,d)
    se1 = mod$se[2] #<- assuming iid erros
    se2 = sqrt(vcov(mod,cluster ="school")[2,2])
    t1 = mod$coefficients[2]/se1
    t2 = mod$coefficients[2]/se2
    test1[b] <- abs(t1)>1.96
    test2[b] <- abs(t2)>1.96
  }
  print(paste("The mean rejection rate under the iid assumption is ",mean(test1)))
  print(paste("The mean rejection rate when clustering standard errors by school is ",mean(test2)))
}
simulation_test(100,100)
[1] "The mean rejection rate under the iid assumption is  0.61"
[1] "The mean rejection rate when clustering standard errors by school is  0.09"

Look how badly the test does when the correlation in errors is not accounted for! In this case, it drastically overstates the precision that is available from these 3000 observations. The size of the clustered standard errors won’t always be perfect, but it will get increasingly correct as \(S\rightarrow\infty\).