CUPED is a well known technique to greatly increase statistical power in a controlled experiment. The CUPED formulation has similarities to the OLS model

\[y = \beta_0 + T\beta_1 + (y_{\text{pre}}-1_n \bar{y}_\text{pre, c}) \beta_2 + \varepsilon\]

This blog post describes how the CUPED formulation can come out of the Frisch Waugh Lovell theorem for ordinary least squares. This is my favorite way to derive the CUPED estimator, and this exercise can help to build a deeper appreciation for the algebra of variance reduction. I particularly like this formulation because the FWL is very similar to the formulation of other modern causal ML techniques for variance reduction like the R learner, MLRate, Double ML etc.

In this model, variation in \(y\) is explained by the previous values contained in \(y_{\text{pre}}\). For example, if someone watched a lot of movies during the pretreatment window, they are likely to watch a lot of movies during the posttreatment window too. Their particular \(y\) values are explainable, and we can reduce the residual variance. This increases power.

While there are 3 parameters in the model, only 2 are meaningful. \(\hat{\beta}_1\) represents the average treatment effect, and \(\hat{\beta}_0\) represents the baseline value for the average user who is in the control, so \(\hat{\beta}_0 = E[y | T = 0, y_{\text{pre}} = \bar{y}_\text{pre, c}]\). While \(\beta_2\) is useful for decreasing variance, it is not useful for inference. It does not play a role in computing the treatment effect. \(\beta_2\) is referred to as a nuisance variable.

The CUPED formulation has a special computing strategy, making it not only great for statistical power, but also easy to implement.

The FWL

If a regression can be partitioned into variables of interest, \(T\), and nuisance variables, \(X\), then inference on \(T\) can be done extremely efficiently.

\[y = \beta_0 + T\beta_1 + X\beta_2 + \varepsilon\]

Regress \(y\) on \([1, X]\). Then get the residuals, \(\tilde{y}\).
Regress \(T\) on \([1, X]\). Then get the residuals, \(\tilde{T}\).
Regress \(\tilde{y}\) on \(\tilde{T}\). The estimated parameters from this regression are exactly \(\hat{\beta}_1\) from the original regression.

Step 1

The regression \(y\) on \([1, X]\) is a simple least squares regression with an intercept, e.g. \(y = \gamma_{y, 0} + X\gamma_{y, 1} + \nu\). The solution for \(\gamma\), and the residuals \(\tilde{y}\) are simply

\[\begin{align} \hat{\gamma}_{y, 1} &= \frac{Cov(X, y)}{Var(X)} \\ \hat{\gamma}_{y, 0} &= \bar{y} - \hat{\gamma}_{y, 1} \bar{X} \\ \tilde{y} &= y - (\hat{\gamma}_{y, 0} + X \hat{\gamma}_{y, 1}) \end{align}\]

Step 2

The regression \(T\) on \([1, X]\) can arguably be a no-op. \(T\) is by definition not correlated with anything, so the slope of this regression must be 0. The intercept would just be the proportion of users that are in the treatment. This elegant subtlety is leveraged in CUPED.

\[\begin{align} \hat{\gamma}_{T, 1} &= 0 \\ \hat{\gamma}_{T, 0} &= \bar{T} \\ \tilde{T} &= y - (\bar{T}) \end{align}\]

However, a fully complete OLS solution would say

\[\begin{align} \hat{\gamma}_{T, 1} &= \frac{Cov(X, T)}{Var(X)} \\ \hat{\gamma}_{T, 0} &= \bar{T} - \hat{\gamma}_{T, 1} \bar{X} \\ \tilde{T} &= T - (\hat{\gamma}_{T, 0} + X \hat{\gamma}_{T, 1}) \end{align}\]

Step 3

Now we regress \(\tilde{y}\) on \(\tilde{T}\) to get the treatment effect.

\[\hat{\beta}_1 = \frac{Cov(\tilde{y}, \tilde{T})}{Var(\tilde{T})}\]

The numerator can be consolidated. \[\begin{align} Cov(\tilde{y}, \tilde{T}) &= Cov(y - (\hat{\gamma}_{y, 0} + X \hat{\gamma}_{y, 1}), T - (\hat{\gamma}_{T, 0} + X \hat{\gamma}_{T, 1})) \\ &= Cov(y, T) - \hat{\gamma}_{T, 1} Cov(y, X) - \hat{\gamma}_{y, 1} Cov(X, T) + \hat{\gamma}_{y, 1} \hat{\gamma}_{T, 1} Var(X) \end{align}\]

The second and fourth terms cancel out because \(\hat{\gamma}_{y, 1} = \frac{Cov(X, y)}{Var(X)}\), leaving \[Cov(\tilde{y}, \tilde{T}) = Cov(y, T) - \hat{\gamma}_{y, 1} Cov(X, T)\]

The denominator is also consolidated \[\begin{align} Var(\tilde{T}) &= Var(T - (\hat{\gamma}_{T, 0} + X \hat{\gamma}_{T, 1})) \\ &= Var(T) - 2\hat{\gamma}_{T, 1} Cov(T, X) + \hat{\gamma}_{T, 1}^2 Var(X) \end{align}\]

Note \(\hat{\gamma}_{T, 1} = \frac{Cov(T, X)}{Var(X)}\), which buys the reduction \[Var(\tilde{T}) = Var(T) - \hat{\gamma}_{T, 1} Cov(T, X)\] This yields

\[\begin{align} \hat{\gamma}_{y, 1} &= \frac{Cov(X, y)}{Var(X)} \\ \hat{\gamma}_{T, 1} &= \frac{Cov(X, T)}{Var(X)} \\ \hat{\beta}_1 &= \frac{Cov(y, T) - \hat{\gamma}_{y, 1} Cov(X, T)}{Var(T) - \hat{\gamma}_{T, 1} Cov(X, T)} \end{align}\]

In the same spirit of conditionally sufficient statistics, this compute strategy for the average treatment effect, controlling for a continuous covariate, can be estimated entirely with aggregated data.

Estimating \(\hat{\beta}_2\)

Our original linear model is

\[y = \beta_0 + T\beta_1 + X\beta_2 + \varepsilon.\]

Thinking of \(\beta_2\) as a nuisance variable, we partialed it out using the FWL. However, there is a way to compute \((\hat{\beta}_1, \hat{\beta}_2)\) simultaneously. Having \((\hat{\beta}_1, \hat{\beta}_2)\) will be useful for computing robust standard errors below.

Consider the demeaned version of the problem, where there is no intercept.

\[y = T\beta_1 + X\beta_2 + \varepsilon.\]

Take the covariance with respect to \(T\) on both sides to get

\[\begin{align} Cov(y, T) &= Cov(T\beta_1 + X\beta_2 + \varepsilon, T) \\ &= \beta_1 Var(T) + \beta_2 Cov(X, T) + Cov(\varepsilon, T) \\ &= \beta_1 Var(T) + \beta_2 Cov(X, T) \end{align}\]

where, at the least squares solution, \(Cov(\varepsilon, T) = 0\). Similarly

\[\begin{align} Cov(y, X) &= Cov(T\beta_1 + X\beta_2 + \varepsilon, X) \\ &= \beta_1 Cov(T, X) + \beta_2 Var(X) + Cov(\varepsilon, X) \\ &= \beta_1 Cov(T, X) + \beta_2 Var(X) \end{align}\]

Because of this property of covariances, we can formulate a system relating \((\beta_1, \beta_2)\)

\[ \begin{bmatrix} Var(T) & Cov(T, X) \\ Cov(T, X) & Var(X) \end{bmatrix} \begin{bmatrix} \beta_1 \\ \beta_2 \end{bmatrix} = \begin{bmatrix} Cov(y, T) \\ Cov(y, X) \end{bmatrix} \]

This is a tiny \(2 \times 2\) system. Using Cramer’s Rule, we write a solution for the coefficients in closed form. First, we define the determinant of the covariance matrix. Then the estimates are

\[\begin{align} D &= Var(T)Var(X) - Cov(T, X)^2 \\ \hat{\beta}_1 &= \frac{Var(X)Cov(y, T) - Cov(T, X)Cov(y, X)}{D} \\ \hat{\beta}_2 &= \frac{Var(T)Cov(y, X) - Cov(T, X)Cov(y, T)}{D} \end{align}\]

Heteroskedastic Consistent Covariances (Robust Standard Errors)

The above section covers how to fit point estimates. Below we will discuss the variance of the effect, approximations, and how to estimate the variance using only aggregates of data.

Normally in OLS we would want to fit the parameters using robust standard errors. This says that not only is the expectation of \(y\) a linear combination of the features and the parameters, the variance is also a function of the parameters. We see this in the history of t tests already, though t tests don’t make use of \(X\). When computing the standard error for the average effect according to a t test, we calculate

\[Var(\widehat{ATE}) = \hat{\sigma}^2_T / n_T + \hat{\sigma}^2_C / n_C\] which fits separate variance terms for the treatment and the control. In an OLS model, the t test is the same as fitting \(y = \beta_0 + T\beta_1 + \varepsilon\) with HC0 robust standard errors. When there is only 1 feature, we are saying the variance of \(y\) is allowed to vary, but it simply varies across the \(T = 0\) group and the \(T = 1\) group. The HC0 robust standard error in this scenario could have also been called a “group robust standard error”.

As we introduce \(X\) into the regression, the OLS formulation would say the variance of each \(y_i\) is different and is a function of the \(x_i\). This is formalized through the sandwich estimator.

\[\begin{align} M &= \begin{bmatrix} 1 & T & X \end{bmatrix} \\ Var(\hat{\beta}) &= (M^T M)^{-1} \left( \sum_{i=1}^{n} m_i m_i^T \hat{\varepsilon}_i^2 \right) (M^T M)^{-1} \end{align}\]

After running FWL steps 1-3, we can invoke the sandwich estimator using the residuals \(\tilde{y}\) and \(\tilde{T}\).

Compute Strategy with 4th order moments

The point estimate for the parameters can be derived from just means, variances, and covariances. This applies to both FWL and CUPED. To compute the traditional sandwich, we either need the individual level data, or we need fourth order aggregates. The bread matrix can be computed using second order aggregates, which is what is used to get the point estimates. However, the meat matrix is much more challenging. The intuition behind it is that the \(\varepsilon\) term has \(x\) in it, so \(m_i m_i^T \hat{\varepsilon}_i^2\) has \(x^4\) in it. If we are able to store the 4th order moments, we can completely recover the HC0 covariance matrix. But this is a hassle. If we want to limit ourselves to the best approximation to the standard error using just second order aggregates, we can formulate the group robust standard error.

Heteroskedastic Consistent (by groups) with only 2nd order moments

Finding an estimator that only uses 2nd order moments would be very convenient. The best estimator operating within these constraints is the HC by group sandwich estimator. Similar to how the t test allowed variability across the treatment and control groups, we allow the meat matrix to vary by groups. However, we do not allow the meat matrix to vary by individual \(m_i\) and \(\hat{\varepsilon}_i\). The formulation would be \[\widehat{\text{Var}}(\hat{\beta}) = (M^T M)^{-1} \left( \sum_g \sigma_g^2 (X_g^T X_g) \right) (M^T M)^{-1}\]

The residual variance for a group \(g\) is \(\sigma^2_g\), and is defined \[Var(\varepsilon_{i, g}) = Var(y_{i,g} - (\hat{\beta}_0 + \hat{\beta}_1 T_i + \hat{\beta}_2 y_{i, \text{pre}}))\]

If \(g\) is defined by treatment or control, then \(\hat{\beta}_0\) and \(\hat{\beta}_1 T_i\) are constants within group \(g\). Thus, the group variance is simply \[\sigma^2_g = Var(y_g - \hat{\beta}_2 y_{\text{pre}, g}) = Var(y_g) + \hat{\beta}_2^2 Var(y_{\text{pre}, g}) - 2\hat{\beta}_2 Cov(y_g, y_{\text{pre}, g})\]

The entire covariance matrix for the parameters can be derived from aggregated data. Remember, this is the best approximation possible to the sandwich when only 2nd order information is available.

CUPED is a special case. But it is not OLS.

CUPED leverages perfectly randomized data, so \(Cov(X, T) = 0\). This effectively makes step 2 of FWL a no-op. Now we will rewrite the solution for \(\hat{\beta}_1\) in a clever way, based on our FWL solution.

\[\begin{align} \hat{\beta}_1 &= \frac{Cov(y, T) - \hat{\gamma}_{y, 1} Cov(X, T)}{Var(T)} \\ &= \frac{Cov(y, T)}{Var(T)} - \hat{\gamma}_{y, 1} \frac{Cov(X, T)}{Var(T)} \end{align}\] which looks like the difference between 2 different OLS solutions: \(y\) regressed on \(T\) and \(X\) regressed on \(T\). Knowing \(T\) is binary, each of these OLS solutions is simply the difference in group means:

\[\begin{align} \hat{\beta}_1 &= E[y | T = 1] - E[y | T = 0] - \hat{\gamma}_{y, 1} (E[X | T = 1] - E[X | T = 0]) \\ &= \biggl(E[y | T = 1] - \hat{\gamma}_{y, 1} E[X | T = 1] \biggr) - \biggl(E[y | T = 0] - \hat{\gamma}_{y, 1} E[X | T = 0] \biggr) \end{align}\]

This is exactly the CUPED formulation! It is a difference in means that adjusts for a difference in the pretreatment baseline. Or, it can be thought of as the difference in the means of the adjusted \(y\) variable. CUPED is an output of FWL when step 2 is skipped. But this requires \(Cov(X, T) = 0\).

Since CUPED is a difference in the means of the adjusted \(y\), its variance looks just like the variance of a t test. \[Var(\widehat{ATE}) = \frac{Var(y_T - \hat{\gamma}_{y, 1}x_T)}{n_T} + \frac{Var(y_C - \hat{\gamma}_{y, 1} x_C)}{n_C}.\]

Major Differences with OLS

There are 3 major differences between CUPED and OLS

They produce different point estimates for the ATE. CUPED hyper optimizes for the fact that \(Cov(T, X) = 0\) and pushes the estimator closer to the asymptotic properties.
CUPED does not produce sandwich variances.
CUPED is more vulernable to imperfect experiment environments, such as pretreatment biases.

Regarding 2. Ironically, the above formulation of the variance is neither equal to HC variances, nor is it equal to group robust variances. This variance simply weights the variance of the residuals according to their sample size. The group robust variance weights the residuals according to \(X_g^T X_g\). This is a more desirable property when data has whales; the \(X_g^T X_g\) term makes the variance larger for whales that have high leverage. Under CUPED, \(\hat{\gamma}_{y, 1}\) is assumed to have no estimation variance of its own, it is a fixed constant. Moreover, that \(\hat{\gamma}_{y, 1}\) variable is derived using the assumption \(Cov(X, T) = 0\).

Regarding 3. CUPED and OLS (via 2nd order aggs) differ in cases of pretreatment bias. Say that the true linear model has a heterogeneous effect, so \[y = \beta_0 + T\beta_1 + X\beta_2 + (T \cdot X)\beta_3 + \varepsilon\] Also say that there is a pretreatment imbalance, where there are more whales in the treatment and these whales have large values of \(X\), and hence large treatment effects. Say we only estimate the smaller model via CUPED \[y = \beta_0 + T\beta_1 + X\beta_2 + \varepsilon\] This model is missing the \((T \cdot X)\beta_3\) term. CUPED’s \(\beta_2\) value, which is not estimated jointly with \(T\), will absorb some of \(\beta_3\). As a result, when we subtract \(y - \beta_2 X\) to get the adjusted \(y\), the treatment group will be over corrected. This creates bias on the treatment effect.

On the other hand, in this heterogeneous treatment effect setting, OLS is less biased. When we estimate \(\beta_2\), we do this jointly together with \(\beta_1\). This prevents \(\beta_2\) from absorbing some of the heterogeneous effect. This third difference between CUPED and OLS is hard to see, so we demonstrate with an example.

Below we simulate data that has a heterogeneous effect along the \(X\) variable. There are some whales in the data that have large \(X\). Also, the treatment group has more whales in it.

set.seed(100)
n <- 100000
baseline_effect <- 1
hte_factor <- 1

# Generate X with whales
X <- rlnorm(n, meanlog = 2, sdlog = 1)

# Simulate imbalance
df <- data.frame(X = X) %>%
  arrange(desc(X)) %>%
  mutate(
    # Create a propensity vector where the first half has high propensity
    # but the second half has low propensity. Since the data is sorted this
    # corresponds to a correlation with X.
    prob_t = c(rep(0.6, n/2), rep(0.4, n/2)),
    treated = rbinom(n, 1, prob_t),
    noise = rnorm(n, 0, 15)
   ) %>%
  mutate(
    effect_if_treated = rnorm(n, baseline_effect, 1) + hte_factor * X,
    y = noise + rnorm(n, 1, 1)*X + effect_if_treated * treated
  )

In this example, there is no SRM detected, but there is a pretreatment bias detected.

df$treated %>% table() %>% chisq.test(p = c(.5, .5))


    Chi-squared test for given probabilities

data:  .
X-squared = 3.0692, df = 1, p-value = 0.07979

# Demonstrate pretreatment imbalance
lm(X ~ treated, df) %>% summary() %>% tidy()

# A tibble: 2 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)    10.5     0.0698     150.  0        
2 treated         3.38    0.0984      34.3 4.23e-256

Now we move to the model. The best model would have modeled the interaction term. But we will benchmark CUPED and OLS with main effects only.

lm(y ~ treated * X, data = df) %>% summary() %>% tidy()

# A tibble: 4 × 5
  term        estimate std.error statistic    p.value
  <chr>          <dbl>     <dbl>     <dbl>      <dbl>
1 (Intercept)   0.0588   0.141       0.418 0.676     
2 treated       0.990    0.203       4.89  0.00000102
3 X             1.00     0.00796   126.    0         
4 treated:X     0.995    0.0104     95.6   0

The CUPED model produces an ATE that is small, as well as an SE that is artificially small. The ATE should be larger, but CUPED’s estimation for the \(X\) variable does not see the \(T\) variable, and hence absorbs part of the effect.

# CUPED model
theta <- cov(df$y, df$X) / var(df$X)
df <- df %>%
  mutate(y_cuped = y - theta * X)
cuped_res <- t.test(y_cuped ~ treated, data = df)
cuped_ate <- diff(cuped_res$estimate)
cuped_se <- cuped_res$stderr

# OLS results
ols_model <- lm(y ~ treated + X, data = df)
ols_res <- coeftest(ols_model, vcov = vcovHC(ols_model, type = "HC0"))
ols_ate <- ols_res["treated", "Estimate"]
ols_se <- ols_res["treated", "Std. Error"]

# Comparison
cat("--- Simulation Results ---\n")

--- Simulation Results ---

cat(sprintf("True Population Avg Effect:  %.4f\n", mean(df$effect_if_treated)))

True Population Avg Effect:  13.1620

cat(sprintf("CUPED ATE:                   %.4f (SE: %.4f)\n", cuped_ate, cuped_se))

CUPED ATE:                   12.6496 (SE: 0.1667)

cat(sprintf("OLS ATE (Group-Robust):      %.4f (SE: %.4f)\n", ols_ate, ols_se))

OLS ATE (Group-Robust):      12.7983 (SE: 0.2044)

For Relative Effects

The relative effect wants to compute \[\hat{R} = \frac{\hat{\beta}_1}{E[y | T = 0, y_\text{pre} = \bar{y}_\text{pre}]} = \frac{\hat{\beta}_1}{\hat{\beta}_0 + \hat{\beta}_2 \bar{y}_\text{pre}} = \frac{\hat{\beta}_1}{\hat{\mu}_C}\]

Since we were able to derive the point estimates of the parameter vector \(\hat{\beta}\), the point estimate of the relative effect is simple. Now onto its variance.

Using a trick from Delta Vectors, we can write the denominator can be written as a multiplication with the parameter vector. Then its variance is simply

\[\begin{align} \hat{\mu}_C &= \begin{bmatrix} 1 & 0 & \bar{y}_\text{pre} \end{bmatrix} \hat{\beta} \\ Var(\hat{\mu}_C) &= \begin{bmatrix} 1 & 0 & \bar{y}_\text{pre} \end{bmatrix} Cov(\hat{\beta}) \begin{bmatrix} 1 \\ 0 \\ \bar{y}_\text{pre} \end{bmatrix} \end{align}\]

The standard error for \(\hat{R}\) is computed using the delta method. In general, when the relative effect is expressed as the ratio of two random variables, \(R = \frac{N}{D}\), the standard error of the estimator is \[Var\left( \frac{\hat{N}}{\hat{D}} \right) \approx \frac{Var(\hat{N})}{\mu_D^2} + \frac{\mu_N^2 Var(\hat{D})}{\mu_D^4} - \frac{2 \mu_N Cov(\hat{N}, \hat{D})}{\mu_D^3}\]

The final variance of the relative effect uses

\[\begin{align} Var(\hat{\mu}_C) &= \begin{bmatrix} 1 & 0 & \bar{y}_\text{pre} \end{bmatrix} Cov(\hat{\beta}) \begin{bmatrix} 1 \\ 0 \\ \bar{y}_\text{pre} \end{bmatrix} \\ Cov(\hat{\beta}_1, \hat{\mu}_c) &= \begin{bmatrix} 0 & 1 & 0 \end{bmatrix} Cov(\hat{\beta}) \begin{bmatrix} 1 \\ 0 \\ \bar{y}_\text{pre} \end{bmatrix} \\ Var(\hat{R}) &\approx \frac{Var(\hat{\beta}_1)}{\hat{\mu}_C^2} + \frac{\hat{\beta}_1^2 Var(\hat{\mu}_C)}{\hat{\mu}_C^4} - \frac{2 \hat{\beta}_1 Cov(\hat{\beta}_1, \hat{\mu}_C)}{\hat{\mu}_C^3} \end{align}\]

Leveraging CUPED Assumption

When leveraging CUPED’s assumption of perfect randomization, there is a clever rewrite. In general, the intercept of a regression is computed as \(\hat{\beta}_0 = \bar{y}_C - \hat{\beta}_2 \bar{X}_C\). Then, we can make the rewrite

\[\begin{align} \hat{\mu}_C &= E[y | T = 0, y_\text{pre} = \bar{y}_\text{pre}] \\ &= \hat{\beta}_0 + \hat{\beta}_2 \bar{y}_\text{pre} \\ &= \bar{y}_C - \hat{\beta}_2 (\bar{X}_C - \bar{X}) \\ R &= \frac{\hat{\beta}_1}{\hat{\mu}_C} \end{align}\]

For variance

\[\begin{align} Var(\hat{\mu}_C) &= Var(\bar{y}_C) + Var(\hat{\beta}_2 (\bar{X}_C - \bar{X})) - 2Cov(\bar{y}_C, \hat{\beta}_2 (\bar{X}_C - \bar{X})) \\ &= \sigma^2_C / n_C \end{align}\] Because of centering, the covariance term is zero. The final variance of the relative effect uses

\[\begin{align} Var(\hat{\mu}_C) &= \sigma^2_C / n_C \\ Cov(\hat{\beta}_1, \hat{\mu}_C) &= -\sigma^2_C / n_C \\ Var(\hat{R}) &\approx \frac{Var(\hat{\beta}_1)}{\hat{\mu}_C^2} + \frac{\hat{\beta}_1^2 Var(\hat{\mu}_C)}{\hat{\mu}_C^4} - \frac{2 \hat{\beta}_1 Cov(\hat{\beta}_1, \hat{\mu}_C)}{\hat{\mu}_C^3} \end{align}\]

Code

In this section we’ll demonstrate 2 examples. We will replicate the difficult data generator above, with whales and preassignment bias. In addition, we will run an easier data generator that follows naive assumptions.

Below are the implementations for estimating the linear model via OLS, OLS with 2nd order aggregates, and CUPED

ols = function(data) {
model = lm(y ~ treated + x, data)
covariances = vcovHC(model, 'HC0')

# ATE
ate = coef(model)[2]
var_ate = covariances[2,2]
se_ate = sqrt(var_ate)

# Relative ATE
mean_x_c = data %>% filter(treated == 0) %>% pull(x) %>% mean
mean_x = data %>% pull(x) %>% mean
baseline = coef(model)[1] + coef(model)[3] * mean_x
rel_effect = ate / baseline

var_baseline = matrix(c(1, 0, mean_x), nrow = 1) %*% covariances %*% c(1, 0, mean_x)
cov_ate_baseline = matrix(c(0, 1, 0), nrow = 1) %*% covariances %*% c(1, 0, mean_x)

# Delta Method
var_rel_effect = var_ate/baseline^2 + ate^2 * var_baseline/baseline^4 - 2*ate*cov_ate_baseline/baseline^3
se_rel_effect = sqrt(var_rel_effect)

ols_solution = list(
  method = 'OLS',
  ate = as.numeric(ate),
  se_ate = as.numeric(se_ate),
  relative_effect = as.numeric(rel_effect),
  se_relative = as.numeric(se_rel_effect)
)
}

ols_from_2nd_order_aggs <- function(n_treatment, sum_y_treatment, sum_x_treatment, sum_y2_treatment, sum_x2_treatment, sum_y_x_treatment,
                             n_control, sum_y_control, sum_x_control, sum_y2_control, sum_x2_control, sum_y_x_control) {

  n <- n_treatment + n_control
  sum_y <- sum_y_treatment + sum_y_control
  sum_x <- sum_x_treatment + sum_x_control
  sum_x2 <- sum_x2_treatment + sum_x2_control
  sum_y_x <- sum_y_x_treatment + sum_y_x_control

  XtX <- matrix(c(
    n,           n_treatment,     sum_x,
    n_treatment, n_treatment,     sum_x_treatment,
    sum_x,       sum_x_treatment, sum_x2
  ), nrow = 3, byrow = TRUE)
  XtX_inv <- solve(XtX)
  Xty <- c(sum_y, sum_y_treatment, sum_y_x)
  beta <- XtX_inv %*% Xty
  
  ate <- beta[2]
  slope_x <- beta[3]

  # Group Robust Covariances
  residual_variance_for_group <- function(n_g, s_y, s_x, s_y2, s_x2, s_yx, b_x) {
    m_y <- s_y / n_g
    m_x <- s_x / n_g
    v_y <- (s_y2 / n_g) - m_y^2
    v_x <- (s_x2 / n_g) - m_x^2
    cov_yx <- (s_yx / n_g) - (m_y * m_x)
    return(v_y + (b_x^2 * v_x) - (2 * b_x * cov_yx))
  }
  sigma2_treatment <- residual_variance_for_group(n_treatment, sum_y_treatment, sum_x_treatment, 
                                                  sum_y2_treatment, sum_x2_treatment, sum_y_x_treatment, slope_x)
  sigma2_control <- residual_variance_for_group(n_control, sum_y_control, sum_x_control, 
                                                sum_y2_control, sum_x2_control, sum_y_x_control, slope_x)

  # meat matrix for treatment
  XtX_treatment <- matrix(c(
    n_treatment,     n_treatment,     sum_x_treatment,
    n_treatment,     n_treatment,     sum_x_treatment,
    sum_x_treatment, sum_x_treatment, sum_x2_treatment
  ), nrow = 3, byrow = TRUE)
  # meat matrix for control. Many elements are zeroed out.
  XtX_control <- matrix(c(
    n_control,       0,               sum_x_control,
    0,               0,               0,
    sum_x_control,   0,               sum_x2_control
  ), nrow = 3, byrow = TRUE)
  Meat <- (XtX_treatment * sigma2_treatment) + (XtX_control * sigma2_control)

  covariances <- XtX_inv %*% Meat %*% XtX_inv
  var_ate = covariances[2, 2]
  se_ate <- sqrt(var_ate)

  # Relative effects
  ## Point estimate
  mean_x_global <- sum_x / n
  mean_y_c <- sum_y_control / n_control
  mean_x_c <- sum_x_control / n_control
  mu_control <- mean_y_c - slope_x * (mean_x_c - mean_x_global)
  rel_effect <- ate / mu_control
  
  # Variance
  var_mu_control = matrix(c(1, 0, mean_x_global), nrow = 1) %*% covariances %*% c(1, 0, mean_x_global)
  cov_ate_mu_control = matrix(c(0, 1, 0), nrow = 1) %*% covariances %*% c(1, 0, mean_x_global)
  
  # Delta Method
  var_rel_effect = var_ate/mu_control^2 + ate^2 * var_mu_control/mu_control^4 - 2*ate*cov_ate_mu_control/mu_control^3
  se_rel_effect = sqrt(var_rel_effect)

  return(list(
    method = 'OLS with 2nd order aggs',
    ate = as.numeric(ate),
    se_ate = as.numeric(se_ate),
    relative_effect = as.numeric(rel_effect),
    se_relative = as.numeric(se_rel_effect)
  ))
}

ols_from_2nd_order = function(data) {
aggs = data %>%
  group_by(treated) %>%
  summarise(
    n = n(),
    sum_y = sum(y),
    sum_y2 = sum(y^2),
    sum_x = sum(x),
    sum_x2 = sum(x^2),
    sum_y_x = sum(y * x)
  )

treatment_aggs = aggs %>%
  filter(treated == 1)
control_aggs = aggs %>%
  filter(treated == 0)

n_treatment = treatment_aggs %>% pull(n)
sum_y_treatment = treatment_aggs %>% pull(sum_y)
sum_x_treatment = treatment_aggs %>% pull(sum_x)
sum_y2_treatment = treatment_aggs %>% pull(sum_y2)
sum_x2_treatment = treatment_aggs %>% pull(sum_x2)
sum_y_x_treatment = treatment_aggs %>% pull(sum_y_x)
n_control = control_aggs %>% pull(n)
sum_y_control = control_aggs %>% pull(sum_y)
sum_x_control = control_aggs %>% pull(sum_x)
sum_y2_control = control_aggs %>% pull(sum_y2)
sum_x2_control = control_aggs %>% pull(sum_x2)
sum_y_x_control = control_aggs %>% pull(sum_y_x)

ols_2nd_order_solution = ols_from_2nd_order_aggs(
  n_treatment = n_treatment,
  sum_y_treatment = sum_y_treatment,
  sum_x_treatment = sum_x_treatment,
  sum_y2_treatment = sum_y2_treatment,
  sum_x2_treatment = sum_x2_treatment,
  sum_y_x_treatment = sum_y_x_treatment,
  n_control = n_control,
  sum_y_control = sum_y_control,
  sum_x_control = sum_x_control,
  sum_y2_control = sum_y2_control,
  sum_x2_control = sum_x2_control,
  sum_y_x_control = sum_y_x_control
)
}

cuped <- function(data) {
  # 1. Calculate the CUPED slope (theta) using the pooled covariance
  # CUPED typically assumes Cov(X, T) = 0, so it calculates theta globally
  theta <- cov(data$y, data$x) / var(data$x)
  
  # 2. Generate the adjusted outcome
  # Y_cuped = Y - theta * (X - mean(X))
  data$y_cuped <- data$y - theta * (data$x - mean(data$x))
  
  # 3. Calculate Group Statistics
  stats <- data %>%
    group_by(treated) %>%
    summarise(
      n = n(),
      mean_y = mean(y_cuped),
      var_y = var(y_cuped),
      .groups = 'drop'
    )
  
  n_t <- stats$n[stats$treated == 1]
  n_c <- stats$n[stats$treated == 0]
  mu_t <- stats$mean_y[stats$treated == 1]
  mu_c <- stats$mean_y[stats$treated == 0]
  v_t <- stats$var_y[stats$treated == 1]
  v_c <- stats$var_y[stats$treated == 0]
  
  # 4. Average Treatment Effect (ATE)
  ate <- mu_t - mu_c
  se_ate <- sqrt((v_t / n_t) + (v_c / n_c))
  
  # 5. Relative Effect (Lift)
  relative_effect <- ate / mu_c
  
  # 6. Delta Method for Relative SE
  # Var(N/D) approx Var(N)/D^2 + N^2*Var(D)/D^4 - 2*N*Cov(N,D)/D^3
  # Here: N = mu_t - mu_c, D = mu_c
  # Cov(mu_t - mu_c, mu_c) = -Var(mu_c) = -v_c / n_c
  
  var_n <- (v_t / n_t) + (v_c / n_c)
  var_d <- v_c / n_c
  cov_nd <- -(v_c / n_c)
  
  se_rel <- sqrt(
    (var_n / mu_c^2) + 
    (ate^2 * var_d / mu_c^4) - 
    (2 * ate * cov_nd / mu_c^3)
  )
  
  return(list(
    method = "CUPED",
    ate = ate,
    se_ate = se_ate,
    relative_effect = relative_effect,
    se_relative = se_rel
  ))
}

Hard Scenario

Data Generator

set.seed(100)
n <- 100000
baseline_effect <- 1
hte_factor <- 1

# Generate X with whales
x <- rlnorm(n, meanlog = 2, sdlog = 1)

# Simulate imbalance
data <- data.frame(x = x) %>%
  arrange(desc(x)) %>%
  mutate(
    # Create a propensity vector where the first half has high propensity
    # but the second half has low propensity. Since the data is sorted this
    # corresponds to a correlation with X.
    prob_t = c(rep(0.6, n/2), rep(0.4, n/2)),
    treated = rbinom(n, 1, prob_t),
    noise = rnorm(n, 0, 15)
   ) %>%
  mutate(
    effect_if_treated = rnorm(n, baseline_effect, 1) + hte_factor * x,
    y = noise + rnorm(n, 1, 1)*x + effect_if_treated * treated
  )

Estimates

ols_solution = ols(data)
ols_2nd_order_solution = ols_from_2nd_order(data)
cuped_solution = cuped(data)
rbind(
  ols_solution %>% data.frame(),
  ols_2nd_order_solution %>% data.frame(),
  cuped_solution %>% data.frame()
)

                   method      ate    se_ate relative_effect se_relative
1                     OLS 12.79828 0.2043816       0.9680216  0.02515193
2 OLS with 2nd order aggs 12.79828 0.1676463       0.9680216  0.01924609
3                   CUPED 12.64960 0.1667247       0.9513963  0.01900091

Easy Scenario

Data Generator

set.seed(100)
n = 1000
data = data.frame(
  treated = sample(0:1, n, replace = TRUE),
  baseline = rnorm(n, 20),
  x = rnorm(n),
  effect_if_treated = rnorm(n, 2, 1)
) %>%
  mutate(
    effect = treated * effect_if_treated,
    y = baseline + effect + 0.5 * x
  )

Estimates

ols_solution = ols(data)
ols_2nd_order_solution = ols_from_2nd_order(data)
cuped_solution = cuped(data)
rbind(
  ols_solution %>% data.frame(),
  ols_2nd_order_solution %>% data.frame(),
  cuped_solution %>% data.frame()
)

                   method      ate     se_ate relative_effect se_relative
1                     OLS 2.008227 0.07559402       0.1002997 0.003926712
2 OLS with 2nd order aggs 2.008227 0.07548562       0.1002997 0.003920985
3                   CUPED 2.005816 0.07555123       0.1001731 0.003924205