chisq.test(c(10100, 9990), p = c(.5, .5))
Chi-squared test for given probabilities
data: c(10100, 9990)
X-squared = 0.60229, df = 1, p-value = 0.4377
Jeffrey Wong
January 30, 2026
The SRM is a popular test to tell if your AB test is healthy. But it’s not enough. The SRM is powered by a chi square test underneath the hood, which compares observed count data with an expected distribution. For example, if we run an AB test and observe 10100 users in the treatment, and 9990 in the control, we can compare that to an expected 50/50 split among the users. In this case, the observed counts show a 50.3% allocation into the treatment and 49.7% in the control. While this looks imbalanced, it’s actually within tolerance, and the AB data would not fail the SRM.
Chi-squared test for given probabilities
data: c(10100, 9990)
X-squared = 0.60229, df = 1, p-value = 0.4377
The SRM is not enough. Even if the count data was 60% in the treatment, and 40% in the control, and the SRM rejected, it is not enough to tell if the AB test is healthy. It is certainly a red flag, but it is not enough.
The SRM only matters if there is a pretreatment covariate imbalance. Let’s build up that conclusion from first principles. We’ll also examine other possible dangerous scenarios, such as when there are heterogeneous effects.
\[\begin{align} E[y | T = 1] - E[y | T = 0] &= \frac{1}{n_T} \sum_{i \in T} \beta_0 + k - \frac{1}{n_C} \sum_{i \in C} \beta_0 \\ &= \frac{1}{n_T} (n_T \beta_0 + n_T k) - \frac{1}{n_C} (n_C \beta_0) \\ &= k \end{align}\]
What’s interesting here is that we can derive the treatment effect, \(k\), without relying on what the values of \(n_T\) and \(n_C\) are. It did not matter what the distribution was, or if an SRM was triggered.
\[\begin{align} E[y | T = 1] - E[y | T = 0] &= \frac{1}{n_T} \sum_{i \in T} \beta_0 + \tau(x_i) - \frac{1}{n_C} \sum_{i \in C} \beta_0 \\ &= \frac{1}{n_T} \sum_{i \in T} \tau(x_i) \end{align}\]
The ground truth for the average treatment effect is \(ATE = \int \tau(x) f(x) dx\), which is the weighted average of \(\tau(x)\) weighted by the pdf of \(x\). \(\frac{1}{n_T} \sum_{i \in T} \tau(x_i)\) is the average value of \(\tau(x_i)\) evaluated among the treatment users. When \(f(x | T = 1) = f(x | T = 0) = f(x)\), the average among the treatment users will still be the same as the ATE. So even when SRM is violated, and when there is a heterogeneous effect, there is still no issue.
\[\begin{align} ATE &= \int \tau(x) f(x) dx \\ E[y | T = 1] - E[y | T = 0] &= \frac{1}{n_T} \sum_{i \in T} \beta_0(x_i) + \tau(x_i) - \frac{1}{n_C} \sum_{i \in C} \beta_0(x_i) \\ &= \frac{1}{n_T} \sum_{i \in T} \tau(x_i) + \left[ \frac{1}{n_T} \sum_{i \in T} \beta_0(x_i) - \frac{1}{n_C} \sum_{i \in C} \beta_0(x_i) \right] \\ &= \underbrace{\int \tau(x) f(x | T = 1) dx}_{\text{looks like an ATE}} + \underbrace{\left[ \int \beta_0(x) \left( f(x | T = 1) - f(x | T = 0) \right) dx \right]}_{\text{bias in the baseline}} \end{align}\]
The difference in means is flawed in 2 ways: weighted average of \(\tau(x)\) weighted by the pdf of \(x\) only among the treated users is biased for the ATE, and there is a nonzero bias in the baseline. Unless we control for \(x\), we cannot untangle which part is the treatment effect and which part is just baseline.
What we have shown is that the danger doesn’t come from the SRM, the danger comes from covariate imbalance. Ultimately, we need to know if the randomization in our test is controlled randomization.
One way to test this is to examine the correlation between \(x\) and \(T\). Ideally the correlation is 0, which would imply \(f(x | T = 1) = f(x | T = 0) = f(x)\). We can do this by regressing
\[T = \alpha_0 + X \alpha_1 + \varepsilon\] If the t-test for \(H_0: \alpha_1 = 0\) is rejected, then there is nonzero correlation.
In practice though, we want to check for covariate imbalance along multiple \(x\) variables. So the regression extends to
\[T = \alpha_0 + X_1 \alpha_1 + X_2 \alpha_2 + ... + \varepsilon\]
Ideally the entire set \(\alpha_1, \alpha_2, ...\) are jointly all zero. Another way to frame this is to say that the \(R^2\) of the regression is ideally 0. This can be tested using the F test.
An interesting note on the side is that the test for \(H_0: \alpha_0 = 0.5\) is akin to the SRM test, but formulated from regression (assuming the intended traffic split was 0.5).
An important duty in AB testing is determining whether the test is healthy. We do not want the F test to reject, as that would indicate \(T\) is not actually random. If the F test is well powered, and fails to reject, we could conclude that the test is healthy.
Power for the F test requires
\[\boxed{\text{Power} = P(F > F_{\text{crit}} | p, N - p - 1, \lambda)}\]
We can use a root finder algorithm to invert this equation and find the necessary sample size to achieve 80% power. If we have this much data, power is 80%, and we fail to reject, then we can conclude the test is healthy. If we want to scan across 20 covariates, and want to make sure the \(R^2\) of the regression is weaker than 0.0001, we only need \(N = 200000\). This is a very interesting result, as large companies with small effect sizes tend to need \(N > 1e6\) to detect an effect, but we only need \(N = 200000\) to tell if the test is healthy.
#' Calculate required sample size for F-test power
#'
#' @param r_squared The R-squared value to detect
#' @param p Number of covariates
#' @param target_power Desired power (e.g., 0.80)
#' @param alpha Significance level (default 0.05)
#'
#' @return The required sample size N (rounded up)
required_sample_size <- function(r_squared, p, target_power = 0.8, alpha = 0.05) {
power_root_finder <- function(n) {
df1 <- p
df2 <- n - p - 1
ncp <- n * (r_squared / (1 - r_squared))
f_crit <- qf(1 - alpha, df1, df2)
current_power <- pf(f_crit, df1, df2, ncp = ncp, lower.tail = FALSE)
return(current_power - target_power)
}
solution <- uniroot(power_root_finder, interval = c(p + 2, 1e9))
return(ceiling(solution$root))
}
p <- 20
target_r2 <- 0.0001
needed_n <- required_sample_size(target_r2, p)
cat(sprintf("Required Sample Size: %s\n", format(needed_n, big.mark=",")))Required Sample Size: 209,603