Recall that if we have a random sample of \(C\) “clusters” of observations of \(X\) from \(c=1,2,,,C\), each with \(N_{c}\) observations, then the sample mean can be written as:
\[ \overline{X} = \frac{1}{\overline{N}}\sum_{c=1}^{C}\sum_{n=1}^{N_{c}} X_{n}.\] where \(\overline{N} = \sum_{c}N_{c}\) is the total number of observations.
Suppose that \(\mathbb{E}[X]=0\), and that observations of \(X\) are independent across clusters but not within them. Then the variance of the sample mean is:
\[ V[\overline{X}] = \frac{1}{\overline{N}^2}\times C \times \mathbb{E}\left[\left(\sum_{n=1}^{N_{c}}X_{n}\right)^2\right] \]
which we could then estimate by substituting in the sample mean.
Now suppose instead that we have the regression model:
\[ Y_{c,n} = X_{c,n}\beta + \epsilon_{c,n} \]
where the error terms \(\epsilon_{c,n}\) are independent across clusters but not within them. In other words, \(\epsilon_{c,n}\) and \(\epsilon_{d,m}\) are independent only if \(c\neq d\). Recall that:
which we now must calculate allowing for the correlation with each cluster. Following the logic above, we can estimate the variance of \(\hat{\beta}\) as:
which if we let \(\mathbf{X}_{c}\) be all the observations in cluster \(c\) stacked in a matrix, and the same for \(\hat{\epsilon}_{c}\), can be written as:
We will assume 3 classrooms per school and 10 students per classroom. Here, \(\alpha\) is the parameter of interest, the treatment effect of the intervention. Notice that \(D_{s}\) is assigned at the level of the school.
Notice that the error terms must be correlated. What is the appropriate level of clustering here?
Previously we saw that including fixed effects could also remove correlation. What prohibits us from doing that in this model? (hint: would the rank condition hold if we used school or classroom fixed effects?)
Here is a way to estimate the model with proper standard errors:
In order to see the consequence of failing to deal with the correlation in errors, let’s check the actual size of a two-sided test that \(\alpha=0\) when this is the true value. If the standard errors are correct, we should get close to 5% when we choose critical values of 1.96.
# a function to test that $\alpha=0$ against a two-sided alternativetestcoef <-function(mod) { stat <- mod$coefficients[2]/mod$se[2]abs(stat)>1.96}# tests the two assumptions for a sample of size S for B montecarlo trialssimulation_test <-function(B,S) { test1 <-numeric(B) test2 <-numeric(B)for (b in1:B) { d =generate_data(0,S) mod <-feols(y ~ D,d) se1 = mod$se[2] #<- assuming iid erros se2 =sqrt(vcov(mod,cluster ="school")[2,2]) t1 = mod$coefficients[2]/se1 t2 = mod$coefficients[2]/se2 test1[b] <-abs(t1)>1.96 test2[b] <-abs(t2)>1.96 }print(paste("The mean rejection rate under the iid assumption is ",mean(test1)))print(paste("The mean rejection rate when clustering standard errors by school is ",mean(test2)))}
simulation_test(100,100)
[1] "The mean rejection rate under the iid assumption is 0.61"
[1] "The mean rejection rate when clustering standard errors by school is 0.09"
Look how badly the test does when the correlation in errors is not accounted for! In this case, it drastically overstates the precision that is available from these 3000 observations. The size of the clustered standard errors won’t always be perfect, but it will get increasingly correct as \(S\rightarrow\infty\).
Source Code
---title: "Week 6"format: html: code-tools: true---```{r}library(tidyverse)```# 1 - Deriving "clustered" standard errorsRecall that if we have a random sample of $C$ "clusters" of observations of $X$ from $c=1,2,,,C$, each with $N_{c}$ observations, then the sample mean can be written as:$$ \overline{X} = \frac{1}{\overline{N}}\sum_{c=1}^{C}\sum_{n=1}^{N_{c}} X_{n}.$$where $\overline{N} = \sum_{c}N_{c}$ is the total number of observations.Suppose that $\mathbb{E}[X]=0$, and that observations of $X$ are independent across clusters but *not* within them. Then the variance of the sample mean is:$$ V[\overline{X}] = \frac{1}{\overline{N}^2}\times C \times \mathbb{E}\left[\left(\sum_{n=1}^{N_{c}}X_{n}\right)^2\right] $$which we could then estimate by substituting in the sample mean.Now suppose instead that we have the regression model:$$ Y_{c,n} = X_{c,n}\beta + \epsilon_{c,n} $$where the error terms $\epsilon_{c,n}$ are independent across clusters but not within them. In other words, $\epsilon_{c,n}$ and $\epsilon_{d,m}$ are independent only if $c\neq d$. Recall that:$$ V[\hat{\beta}] = V[(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'{\epsilon}]$$which we now must calculate allowing for the correlation with each cluster. Following the logic above, we can estimate the variance of $\hat{\beta}$ as:$$ \widehat{V[\hat{\beta}]} = (\mathbf{X}'\mathbf{X})^{-1}\sum_{c=1}^{C}\left(\sum_{n=1}^{N_{c}}X_{c,n}'\hat{\epsilon_{c,n}}\right)\left(\sum_{n=1}^{N_{c}}\hat{\epsilon}_{n,c}X_{c,n}\right)(\mathbf{X}'\mathbf{X})^{-1}$$which if we let $\mathbf{X}_{c}$ be all the observations in cluster $c$ stacked in a matrix, and the same for $\hat{\epsilon}_{c}$, can be written as:$$ \widehat{V[\hat{\beta}]} = (\mathbf{X}'\mathbf{X})^{-1}\sum_{c=1}^{C}\left(\mathbf{X}_{c}'\hat{\epsilon}_{c}\hat{\epsilon}_{c}'\mathbf{X}_{c}\right)(\mathbf{X}'\mathbf{X})^{-1}.$$This is the calculation being made when researchers refer to "clustering" their standard errors.# 2 - An ExampleSuppose that we are evaluating a classroom intervention and the model of outcomes for student $n$ in classroom $c$ for school $s$ is:$$ Y_{s,c,n} = \alpha D_{s} + \mu_{s} + \gamma_{c} + \epsilon_{n} $$We will assume 3 classrooms per school and 10 students per classroom. Here, $\alpha$ is the parameter of interest, the treatment effect of the intervention. Notice that $D_{s}$ is assigned at the level of the school.Here is code to generate the outcomes:```{r}generate_data <-function(alpha,S) {# draw the classroom effect mu_s =rnorm(S) D =runif(S)<0.5 d =data.frame()for (c in1:3) { gamma_c =rnorm(S,0.2)for (n in1:10) { eps_n =rnorm(S,0.2) d =rbind(d,data.frame(school =1:S,class = (1:S -1)*3+ c,D = D,y = alpha*D + mu_s + gamma_c + eps_n)) } } d}``````{r}d =generate_data(0.1,100)```Questions: (1) Notice that the error terms must be correlated. What is the appropriate level of clustering here? (2) Previously we saw that including fixed effects could also remove correlation. What prohibits us from doing that in this model? (hint: would the rank condition hold if we used school or classroom fixed effects?)Here is a way to estimate the model with proper standard errors:```{r}library(fixest)feols(y ~ D,d) %>%summary(cluster ="school")```Which we can compare to standard errors without the clustering:```{r}feols(y ~ D,d) %>%summary("standard")```In order to see the consequence of failing to deal with the correlation in errors, let's check the actual size of a two-sided test that $\alpha=0$ when this is the true value. If the standard errors are correct, we should get close to 5\% when we choose critical values of 1.96.```{r}# a function to test that $\alpha=0$ against a two-sided alternativetestcoef <-function(mod) { stat <- mod$coefficients[2]/mod$se[2]abs(stat)>1.96}# tests the two assumptions for a sample of size S for B montecarlo trialssimulation_test <-function(B,S) { test1 <-numeric(B) test2 <-numeric(B)for (b in1:B) { d =generate_data(0,S) mod <-feols(y ~ D,d) se1 = mod$se[2] #<- assuming iid erros se2 =sqrt(vcov(mod,cluster ="school")[2,2]) t1 = mod$coefficients[2]/se1 t2 = mod$coefficients[2]/se2 test1[b] <-abs(t1)>1.96 test2[b] <-abs(t2)>1.96 }print(paste("The mean rejection rate under the iid assumption is ",mean(test1)))print(paste("The mean rejection rate when clustering standard errors by school is ",mean(test2)))}``````{r}simulation_test(100,100)```Look how badly the test does when the correlation in errors is not accounted for! In this case, it drastically overstates the precision that is available from these 3000 observations. The size of the clustered standard errors won't always be perfect, but it will get increasingly correct as $S\rightarrow\infty$.