A Better, Robust CUPED

experimentation
statistics
Author

Jeffrey Wong

Published

April 21, 2026

We previously discussed the similarities and differences between CUPED and OLS. While CUPED is a computationally efficient implementation of a linear model, it has some unexpected properties. It has a strict assumption of perfect randomization. By using OLS, we can relax that assumption to the textbook assumption of conditional exogeneity. Using OLS, we can also formulate a group robust standard error.

While OLS has some of these advantages, the form of the model still makes the ATE vulnerable. The model

\[y = \beta_0 + T \beta_1 + X \beta_2 + \varepsilon\] relies on a critical assumption: that the treatment effect \(\tau(X) = \beta_1\) is constant across all units. In reality, treatment effects are often heterogeneous. In these environments, the effect can be biased, or the regression adjustment can actually increase the variance of the effect over a regular t-test. This was Freedman’s major criticism of regression for experiments.

Lin 2013 argued that changing the form of the model can make it robust. By including the interaction term, the regression adjustment will be at least as efficient as the unadjusted difference in means, even if the model is misspecified. In this case, even if there is a pretreatment difference in \(X\)

The Model: Centering and Interaction

To guard against these edge cases, Lin proposes a fully interacted model. Crucially, we center our covariate \(X\) around the global population mean \(\bar{X}\).

The model form is: \[y = \beta_0 + \beta_1 T + \beta_2 (X - \bar{X}) + \beta_3 T \cdot (X - \bar{X}) + \varepsilon\]

Where:

  1. \(\beta_0\): The expected baseline (the counterfactual mean of the population under control).
  2. \(\beta_1\): The Average Treatment Effect (ATE).
  3. \(\beta_2\): The relationship between \(X\) and \(Y\) in the control group.
  4. \(\beta_3\): The Heterogeneous Treatment Effect (HTE)—how the effect changes as \(X\) varies.

Derivation from Second-Order Aggregates

Just as before we derive a solution using only second-order aggregates. The OLS normal equation is

\[\begin{align} M^T M &= \begin{bmatrix} n & \sum T_i & \sum (X_i - \bar{X}) & \sum T_i(X_i - \bar{X}) \\ \sum T_i & \sum T_i & \sum T_i(X_i - \bar{X}) & \sum T_i(X_i - \bar{X}) \\ \sum (X_i - \bar{X}) & \sum T_i(X_i - \bar{X}) & \sum (X_i - \bar{X})^2 & \sum T_i(X_i - \bar{X})^2 \\ \sum T_i(X_i - \bar{X}) & \sum T_i(X_i - \bar{X}) & \sum T_i(X_i - \bar{X})^2 & \sum T_i(X_i - \bar{X})^2 \end{bmatrix} \\ M^T y &= \begin{bmatrix} \sum Y_i \\ \sum T_i Y_i \\ \sum (X_i - \bar{X}) Y_i \\ \sum T_i(X_i - \bar{X}) Y_i \end{bmatrix} \\ M^T M \beta &= M^T y \end{align}\]

Code. But it can be better.

The following function computes the aggregates using dplyr, solves for the Lin estimator, and applies the Delta Method to find the standard error for the relative effect. This is how we can construct the Lin estimator using only aggregated data, like CUPED.

We also benchmark the second order approximation to the sandwich with ground truth model that is able to see all raw data. The implementation below is an intermediate implementation that represents the simplest pivot of OLS to the Lin Estimator using only aggregated data, but it will fall short. The true estimate of the sandwich will require 4th order aggregates. Using only 2nd order aggregates means our estimate of the standard error is wrong when there is strong heteroskedasticity. In the next section we will show a better approximation.

Modeling functions

#' Compute 2nd-Order Aggregates for Lin Model
#' 
#' @param df Dataframe containing raw user-level data.
#' @param x_col String. Name of the covariate column.
#' @param y_col String. Name of the outcome column.
#' @param t_col String. Name of the treatment indicator column (0/1).
#' @return A summary dataframe with group-level sums and cross-products.
compress <- function(df, x_col = "x", y_col = "y", t_col = "treated") {
  x_bar_global <- mean(df[[x_col]], na.rm = TRUE)
  
  df %>%
    rename(x = !!sym(x_col), y = !!sym(y_col), treated = !!sym(t_col)) %>%
    mutate(xc = x - x_bar_global) %>%
    group_by(treated) %>%
    summarise(
      n     = n(),
      s_xc  = sum(xc),
      s_y   = sum(y),
      s_y2  = sum(y^2),
      s_xc2 = sum(xc^2),
      s_xcy = sum(xc * y),
      .groups = "drop"
    )
}

#' Calculate Residual Sum of Squares from Aggregates

#' @param compressed_data Dataframe of aggregates from compress. 
#' @param params Numeric vector of coefficients [beta0, beta1, beta2, beta3].
#' @param group_label Integer (0 or 1). The group to calculate RSS for.
#' @return Numeric RSS for the group.
approximate_group_rss <- function(compressed_data, params, group_label) {
  g_data <- compressed_data %>% filter(treated == group_label)
  
  # Group-specific sums
  n_g   <- g_data$n
  s_y   <- g_data$s_y
  s_y2  <- g_data$s_y2
  s_xc  <- g_data$s_xc
  s_xc2 <- g_data$s_xc2
  s_xcy <- g_data$s_xcy
  
  # Reconstruct the Z'y (Target) and Z'Z (Moment) matrices for the specific group
  if(group_label == 1) {
    # Treated: All features [1, 1, xc, xc] active
    zy_sum <- c(s_y, s_y, s_xcy, s_xcy)
    zz_sum <- matrix(c(
      n_g, n_g, s_xc, s_xc, 
      n_g, n_g, s_xc, s_xc,
      s_xc, s_xc, s_xc2, s_xc2, 
      s_xc, s_xc, s_xc2, s_xc2
    ), 4)
  } else {
    # Control: Only [1, 0, xc, 0] active
    zy_sum <- c(s_y, 0, s_xcy, 0)
    zz_sum <- matrix(c(
      n_g, 0, s_xc, 0, 
      0, 0, 0, 0, 
      s_xc, 0, s_xc2, 0, 
      0, 0, 0, 0
    ), 4)
  }
  
  # RSS = sum(y^2) - 2*beta'*(Z'y) + beta'*(Z'Z)*beta
  as.numeric(s_y2 - 2 * (t(params) %*% zy_sum) + (t(params) %*% zz_sum %*% params))
}

#' Fit Lin Model and Compute Robust Effects
#' 
#' @param compressed_data Dataframe of aggregates.
#' @return A list of estimates and robust standard errors.
fit_lin_model <- function(compressed_data) {
  t_compressed_data <- compressed_data %>% filter(treated == 1)
  c_compressed_data <- compressed_data %>% filter(treated == 0)
  n_tot   <- sum(compressed_data$n)
  
  MtM <- matrix(c(
    n_tot, t_compressed_data$n, (t_compressed_data$s_xc + c_compressed_data$s_xc), t_compressed_data$s_xc,
    t_compressed_data$n, t_compressed_data$n, t_compressed_data$s_xc, t_compressed_data$s_xc,
    (t_compressed_data$s_xc + c_compressed_data$s_xc), t_compressed_data$s_xc, (t_compressed_data$s_xc2 + c_compressed_data$s_xc2), t_compressed_data$s_xc2,
    t_compressed_data$s_xc, t_compressed_data$s_xc, t_compressed_data$s_xc2, t_compressed_data$s_xc2
  ), nrow = 4, byrow = TRUE)
  MtM_inv <- solve(MtM)
  Mty <- c(sum(compressed_data$s_y), t_compressed_data$s_y, sum(compressed_data$s_xcy), t_compressed_data$s_xcy)
  beta <- MtM_inv %*% Mty
  
  # Meat matrix and covariances
  sig0 <- approximate_group_rss(compressed_data, beta, 0) / (c_compressed_data$n - 2)
  sig1 <- approximate_group_rss(compressed_data, beta, 1) / (t_compressed_data$n - 2)
  Meat_C <- sig0 * matrix(c(
    c_compressed_data$n, 0, c_compressed_data$s_xc, 0, 
    0, 0, 0, 0, 
    c_compressed_data$s_xc, 0, c_compressed_data$s_xc2, 0,
    0, 0, 0, 0
  ), 4)
  Meat_T <- sig1 * matrix(c(
    t_compressed_data$n, t_compressed_data$n, t_compressed_data$s_xc, t_compressed_data$s_xc, 
    t_compressed_data$n, t_compressed_data$n, t_compressed_data$s_xc, t_compressed_data$s_xc, 
    t_compressed_data$s_xc, t_compressed_data$s_xc, t_compressed_data$s_xc2, t_compressed_data$s_xc2, 
    t_compressed_data$s_xc, t_compressed_data$s_xc, t_compressed_data$s_xc2, t_compressed_data$s_xc2
  ), 4)
  covariances <- MtM_inv %*% (Meat_C + Meat_T) %*% MtM_inv
  
  # 5. Delta Method for Relative Effect: g(beta0, beta1) = beta1 / beta0
  # Gradient: [-beta1/beta0^2, 1/beta0]
  beta0 <- beta[1]
  beta1   <- beta[2]
  grad  <- c(-beta1 / (beta0^2), 1 / beta0)
  rel_var <- t(grad) %*% covariances[1:2, 1:2] %*% grad
  
  list(
    ate               = beta1,
    ate_se            = sqrt(covariances[2, 2]),
    hte_slope         = beta[4],
    hte_se            = sqrt(covariances[4, 4]),
    expected_baseline = beta0,
    relative_lift     = beta1 / beta0,
    relative_lift_se  = sqrt(as.numeric(rel_var))
  )
}
library(dplyr)
library(sandwich)

#' Benchmark Lin Estimator using lm() and sandwich
#' 
#' @param df Dataframe containing the raw user-level data.
#' @param x_col String. Name of the covariate.
#' @param y_col String. Name of the outcome.
#' @param t_col String. Name of the treatment indicator.
#' @return A list of estimates and HC standard errors.
ground_truth <- function(df, x_col = "x", y_col = "y", t_col = "treated") {
  # 1. Global Centering
  # This is crucial: centering at the global mean ensures 
  # the 'treated' coefficient is the population ATE.
  x_bar <- mean(df[[x_col]], na.rm = TRUE)
  df$xc <- df[[x_col]] - x_bar
  
  # 2. Fit the Interacted Model
  # Y ~ T + (X - x_bar) + T:(X - x_bar)
  # Intercept = Alpha, treated = Tau (ATE), xc = Beta, treated:xc = Gamma (HTE)
  formula <- as.formula(paste(y_col, "~", t_col, "* xc"))
  fit <- lm(formula, data = df)
  
  # 3. Robust Covariance Matrix (HC1 is standard)
  # This provides the "Sandwich" estimator (M^-1 %*% Sigma %*% M^-1)
  v_cov_robust <- vcovHC(fit, type = "HC0")
  se_robust    <- sqrt(diag(v_cov_robust))
  
  # 4. Extract Parameters
  # beta[1]: Intercept (Alpha)
  # beta[2]: Treated (Tau / ATE)
  # beta[4]: Treated:xc (Gamma / HTE)
  beta <- coef(fit)
  alpha_hat <- beta[1]
  tau_hat   <- beta[2]
  gamma_hat <- beta[4]
  
  # 5. Delta Method for Relative Lift: g(alpha, tau) = tau / alpha
  # Grad: [-tau/alpha^2, 1/alpha]
  grad <- c(-tau_hat / (alpha_hat^2), 1 / alpha_hat)
  
  # Sub-matrix for [Intercept, Treated]
  sub_cov <- v_cov_robust[1:2, 1:2]
  rel_var <- t(grad) %*% sub_cov %*% grad
  rel_se  <- sqrt(as.numeric(rel_var))
  
  return(list(
    ate               = as.numeric(tau_hat),
    ate_se            = se_robust[2],
    hte_slope         = as.numeric(gamma_hat),
    hte_se            = se_robust[4],
    expected_baseline = as.numeric(alpha_hat),
    relative_lift     = as.numeric(tau_hat / alpha_hat),
    relative_lift_se  = rel_se
  ))
}

Data Generator

set.seed(100)
n <- 100000
baseline_effect <- 1
hte_factor <- 1

# Generate X with whales
x <- rlnorm(n, meanlog = 2, sdlog = 1)

# Simulate imbalance
data <- data.frame(x = x) %>%
  arrange(desc(x)) %>%
  mutate(
    # Create a propensity vector where the first half has high propensity
    # but the second half has low propensity. Since the data is sorted this
    # corresponds to a correlation with X.
    prob_t = c(rep(0.6, n/2), rep(0.4, n/2)),
    treated = rbinom(n, 1, prob_t),
    noise = rnorm(n, 0, 15)
   ) %>%
  mutate(
    effect_if_treated = rnorm(n, baseline_effect, 1) + hte_factor * x,
    y = noise + rnorm(n, 1, 1)*x + effect_if_treated * treated
  )

Benchmark

cbind(
  model = c("naive 2nd order aggs", "ground truth"),
  rbind(
  data %>%
    compress() %>%
    fit_lin_model() %>%
    as.data.frame(),
  data %>%
    ground_truth() %>%
    as.data.frame(row.names = NULL)
  )
)
                 model      ate    ate_se hte_slope     hte_se
1 naive 2nd order aggs 13.09455 0.1604445 0.9951046 0.01029416
2         ground truth 13.09455 0.1987060 0.9951046 0.07023514
  expected_baseline relative_lift relative_lift_se
1          12.23248      1.070474       0.02055102
2          12.23248      1.070474       0.02979132

An Even Better 2nd Order Approximation

The problem with the above implementation of the sandwich is that it is only group robust. In this example the groups are simply defined as the control and treatment groups. We are able to capture variation in the variance along these 2 groups, but we do not capture variation in the \(x\) variable.

Another way to think about group robust covariances is to say that the data is locally homoskedastic. For example, within the control group it is locally homoskedastic, but between the control and treatment group it is heteroskedastic. We can define additional grouping variables that allow us to be more precise. Say that we bin the \(x\) variable into quantiles We can argue that data within the same bin truly is locally homoskedastic, and across different bins it is heteroskedastic. Instead of collapsing the user level data into just 2 rows, one for treatment aggregates and one for control aggregates, we’d collapse the data into say 200 rows for (treatment/control, bin) aggregates. The aggregates we compute are the same as before.

We use the variable n_bins to control how many bins are created in the aggregation step. When n_bins = 1 there is a lot of error on the variance of the estimator, though the parameters themselves are recovered with zero error. When n_bins = n, there is zero error on the variance of the estimator. The error is proportional to the inverse of n_bins squared.

compress_with_binning <- function(df, n_bins = 1000) {
  df %>%
    mutate(bin = ntile(x, n_bins)) %>%
    group_by(treated, bin) %>%
    summarise(
      n      = n(),
      sum_x  = sum(x),
      sum_y  = sum(y),
      sum_x2 = sum(x^2),
      sum_y2 = sum(y^2),
      sum_xy = sum(x * y),
      .groups = "drop"
    )
}

#' Vectorized YOCO Lin Estimator with Expected Baseline
#' Efficiently recovers raw-data OLS, Robust SEs, and the Baseline Intercept
fit_yoco_lin <- function(agg_df) {
  # 1. Global Constants & Centering
  n_total <- sum(agg_df$n)
  n_t1 <- sum(agg_df$n[agg_df$treated == 1])
  
  # Weighted global mean for centering (Ensures intercept = baseline)
  global_x_bar <- sum(agg_df$sum_x) / n_total
  
  # 2. Reconstruct the Global Moment Matrix (Z'Z) and Vector (Z'y)
  agg_df <- agg_df %>%
    mutate(
      xc      = (sum_x / n) - global_x_bar,
      sum_xc  = sum_x - n * global_x_bar,
      sum_xc2 = sum_x2 - 2 * global_x_bar * sum_x + n * (global_x_bar^2),
      sum_yxc = sum_xy - global_x_bar * sum_y
    )
  
  t1 <- agg_df[agg_df$treated == 1, ]
  t0 <- agg_df[agg_df$treated == 0, ]
  
  s_xc_t1  <- sum(t1$sum_xc);  s_xc_t0  <- sum(t0$sum_xc)
  s_xc2_t1 <- sum(t1$sum_xc2); s_xc2_t0 <- sum(t0$sum_xc2)
  s_y_t1   <- sum(t1$sum_y);   s_y_t0   <- sum(t0$sum_y)
  s_yxc_t1 <- sum(t1$sum_yxc); s_yxc_t0 <- sum(t0$sum_yxc)
  
  # Z'Z Matrix construction
  ZZ <- matrix(c(
    n_total,       n_t1,          s_xc_t1+s_xc_t0, s_xc_t1,
    n_t1,          n_t1,          s_xc_t1,         s_xc_t1,
    s_xc_t1+s_xc_t0, s_xc_t1,     s_xc2_t1+s_xc2_t0, s_xc2_t1,
    s_xc_t1,       s_xc_t1,       s_xc2_t1,        s_xc2_t1
  ), nrow = 4, byrow = TRUE)
  
  # Z'y Vector
  Zy <- c(s_y_t1 + s_y_t0, s_y_t1, s_yxc_t1 + s_yxc_t0, s_yxc_t1)
  
  # Solve for Theta (Point Estimates)
  theta <- solve(ZZ, Zy)
  
  # 3. Vectorized "Meat" Matrix (Robust SE calculation)
  Zb <- cbind(1, agg_df$treated, agg_df$xc, agg_df$treated * agg_df$xc)
  y_mu_b <- agg_df$sum_y / agg_df$n
  e_b    <- y_mu_b - as.numeric(Zb %*% theta)
  
  # Within-bin variance adjustment (The YOCO magic)
  within_ss <- agg_df$sum_y2 - (agg_df$sum_y^2 / agg_df$n)
  W <- (agg_df$n * e_b^2) + within_ss
  Meat <- t(Zb) %*% (W * Zb)
  
  # 4. Sandwich Finalization
  ZZ_inv      <- solve(ZZ)
  vcov_robust <- ZZ_inv %*% Meat %*% ZZ_inv
  
  # 5. Extraction
  beta  <- theta
  alpha <- beta[1]  # The Expected Baseline
  tau   <- beta[2]  # The ATE
  se    <- sqrt(diag(vcov_robust))
  
  # Delta Method for Relative Lift SE
  grad   <- c(-tau / (alpha^2), 1 / alpha)
  rel_se <- sqrt(as.numeric(t(grad) %*% vcov_robust[1:2, 1:2] %*% grad))
  
  return(list(
    ate               = tau,
    ate_se            = se[2],
    hte_slope         = beta[4],
    hte_se            = se[4],
    expected_baseline = alpha,
    relative_lift     = tau / alpha,
    relative_lift_se  = rel_se
  ))
}
cbind(
  model = c("naive 2nd order", "yoco with quantile binning", "ground truth"),
  rbind(
    data %>%
      compress() %>%
      fit_lin_model() %>%
      as.data.frame(),
    data %>%
      compress_with_binning(n_bins = 1000) %>%
      fit_yoco_lin() %>%
      as.data.frame(),
    data %>%
      ground_truth() %>%
      as.data.frame(row.names = NULL)
  )
)
                       model      ate    ate_se hte_slope     hte_se
1            naive 2nd order 13.09455 0.1604445 0.9951046 0.01029416
2 yoco with quantile binning 13.09455 0.1818185 0.9951046 0.05583564
3               ground truth 13.09455 0.1987060 0.9951046 0.07023514
  expected_baseline relative_lift relative_lift_se
1          12.23248      1.070474       0.02055102
2          12.23248      1.070474       0.02711254
3          12.23248      1.070474       0.02979132