Introduction

Randomized experiments have a rich history, almost none of which is the concern of this post. Rather, I want to focus on a tiny slice of randomized experiment literature: CUPED (Controlled-experiment Using Pre-Experiment Data). The technique has been impactful in the online randomized experiment (a.k.a. A/B Testing) community, with many vendors – Eppo included – offering the technique. Despite the popularity and impact, customers and data scientists still seem to misunderstand why CUPED works, why different implementations of the procedure (a la regression adjustment) still provide the same benefit, and why different implementations are equivalent.

This sequence of posts seeks to answer popular questions regarding the CUPED, and in particular frames the technique as a regression procedure first as opposed to a “new” or different technique. I discuss papers that pre-date the CUPED as well as discuss how the literature has evolved since the publication of CUPED. This is not a systematic review, nor a scoping review; this is a story of regression erasure (I kid).

This post in particular will demonstrate the equivlence between CUPED and regression.

Set Up

Most of this blog post concerns online randomized experiments or “A/B tests” as they are known in industry. Generally, subjects will be exposed to treatment or control at some known probability (usually 50/50 since this is the most efficient design). Some outcome, \(Y\), is then recorded and groups are compared statistically. Various estimands are used in industry, but for simplicity we will use the average treatment effect as our estimand of choice.

CUPED: For Those Who Don’t Know

Deng and colleagues published CUPED in their 2013 paper “Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data”. The main motivation for the paper is increasing sensitivity of experiments to detect smaller effects, detect effects faster, or both. Deng et. al motivate their approach through “control variates”. Briefly – and aside from experimentation for a moment – given pairs of random variables \((Y_i, X_i)\) in which \(E[X]\) is known, define

\[\widehat{Y}_{c v}=\bar{Y}-\theta \bar{X}+\theta E[X]\]

where \(\theta\) is an arbitrary constant. It can be shown that \(\widehat{Y}_{c v}\) is an unbiased estimator of \(E[Y]\). Because \(\theta\) is arbitrary, we can choose a \(\theta\) so that the variance of the estimator is minimized. The variance of \(\widehat{Y}_{c v}\) is

\[ \begin{align} \operatorname{Var}(\widehat{Y}_{c v}) &= \operatorname{Var}(\bar{Y}-\theta \bar{X}+\theta E[X])\\ &=\dfrac{1}{n} \left( \operatorname{Var}(Y) + \theta^2 \operatorname{Var}(X) - 2\theta\operatorname{Cov}(X,Y) \right) \end{align}\]

and is notably quadratic in \(\theta\). Application of elementary calculus shows that the variance is then minimized when

\[ \theta = \dfrac{\operatorname{Cov}(Y, X)}{\operatorname{Var}(X)} \] and further algebraic manipulation shows

\[ \operatorname{var}\left(\widehat{Y}_{c v}\right)=\operatorname{var}(\bar{Y})\left(1-\rho^2\right) \] where \(\rho = \operatorname{Cor}(Y, X)\).

In short, the standard error of the mean can be reduced so long as there is some other variable, \(X\), correlated with \(Y\), where larger correlations lead to more variance reduction.

Back to experimentation, the insight in the CUPED paper was that while finding an \(X\) with known expectation is hard, the difference in expectations for any random variable in the pre-treatment period is 0. Deng et. al then estimate the difference in means between treatment in control by computing the control variate versions of outcomes in each group and then estimating the difference in means using ¹

\[ \Delta_{c v}=\widehat{Y}_{c v}^{(t)}-\widehat{Y}_{c v}^{(c)} \>. \]

In the remainder of the paper, Deng et. al provide some recommendations on choosing \(X\), landing on the recommendation to “[…]to use the same variable from the pre-experiment period as the covariate” – i.e. to use the pre-experiment outcome as the control variable. Deng et. al write

It is interesting to point out the connection with linear regression. The optimal \(\theta\) tunrs out to be the ordinary least squares (OLS) solution regressing (centered) \(Y\) on (centered) \(X\)

so the connection to regression was clearly made to readers. As we will see, the method is formally equivalent to regression, but in order to see this more clearly we will need the Frisch Waugh Lovell theorem.

Frisch Waugh Lovell Theorem: For Those Who Don’t Know

Econometricians will know the Frisch Waugh Lovell (FWL) theorem well. Borrowing from wikipedia, the theorem states that the estimate for \(\beta_2\) in the model (henceforth full model)

\[ Y = X_1 \beta_1 + X_2 \beta_2 +u \] will be the same as the estimate of \(\beta_2\) in the model (henceforth FWL model)

\[M_{X_1} Y=M_{X_1} X_2 \beta_2+M_{X_1} u\]

where \(M_{X_1} = I-X_1(X^T_1X_1)^{-1}X_1^T\). Squint hard enough and you will see that the projection, or hat, matrix appears in \(M_{X_1}\). The FWL theorem then says, more or less,

First regress \(Y\) onto \(X_1\) and compute the residuals. Then, regress \(X_2\) onto \(X_1\) and compute the residuals. The regression of the result from the first operation onto the result from the second operation will yield the same estimate of \(\beta_2\).

The proof can (probably) be found in most econometrics textbooks, but for now we will suffice for an example in R.

fit <- lm(am ~ vs + mpg, data=mtcars)

(ols_estimate <- coef(fit)['vs'])

        vs 
-0.4072501

r1 <- resid(lm(am ~ mpg, data=mtcars))
r2 <- resid(lm(vs ~ mpg, data=mtcars))

fwl_fit <- lm(r1~r2)

(fwl_estimate <- coef(fwl_fit)['r2'])

        r2 
-0.4072501

The procedure described above is sometimes called “partialling out” or “residualizing”. Point estimates and standard errors should be the same from both procedures (note that you have to adjust the partialled out estimate of the standard error so that it has the same degrees of freedom as the full estimate, see code below).

(se_vs_full <- summary(fit)$coefficients["vs", "Std. Error"])

[1] 0.1816192

se_vs_fwl_raw <- summary(fwl_fit)$coefficients["r2", "Std. Error"]

n <- nrow(mtcars)
p_full <- length(coef(fit))  # Number of parameters in full model
p_fwl <- length(coef(fwl_fit))  # Number of parameters in FWL regression

sigma2_full <- sum(residuals(fit)^2) / (n - p_full)
sigma2_fwl <- sum(residuals(fwl_fit)^2) / (n - p_fwl)

(se_vs_fwl_corrected <- se_vs_fwl_raw * sqrt(sigma2_full / sigma2_fwl))

[1] 0.1816192

Note also that the standard error from either the FWL model or the full model is smaller than it would be had \(X_2\) (or vs in my example) been the only variable in the regression. While I omit a proof of this claim, the intuition is straight forward. When including \(X_1\) in the regression (either in the full or FWL model), some of the variation in \(Y\) will be explained by variation in \(X_1\). This leads to smaller residual variation, which in turn shrinks the sampling variability of the coefficient for \(X_2\). Hence variance is reduced by including \(X_1\) in the regression. Furthermore, there is little risk of inflated standard errors due to collinearity since \(X_2\) is independent of \(X_1\) and hence uncorrelated in expectation. Hence, variance should only decrease when adjusting for pre-treatment variables correlated with \(Y\).

Why You Can’t Spell CUPED without FWL

Let’s bring this all together.² I’m going to start with a regression model and apply the FWL theorem in order to prove that CUPED is equivalent to regression.

Let \((Y, X, D)\) be a triple of random variables such that

\(E[Y] = E[X] = E[D] = 0\)
\(D\) is a randomly assigned binary treatment indicator and takes on values \(\{-0.5, 0.5\}\). This is mostly for convenience.
\(Y, X \perp D\) due to randomization
\(\operatorname{Cov}(Y, X) \neq 0\)
The true data generating mechanism is \(Y = \theta X + \Delta D + \epsilon\) with \(E[\epsilon] = 0\) and \(E[\epsilon D] = 0\).

First, partial out the effect of \(X\) on \(Y\) by regression \(Y\) on \(X\). The residual \(r_1\) is defined as

\[ r_1 = Y - \hat{\theta}X \]

with \(\hat{\theta} = \operatorname{Cov}(Y, X) / \operatorname{Var}(X)\). Note that this choice of \(\theta\) minimizes the variance of \(r_1\).

Second, partial out the effect of \(X\) on \(D\) by computing

\[r_2 = D-\hat{\beta} X \]

Since \(D\) is assumed independent of \(X\), then \(\hat{beta}=0\) and therefore \(r_2=D\).

By the FWL theorem, the coefficeint \(\Delta\) can be estimated using the following regression

\[ r_1 = \Delta r_2 + \varepsilon = \Delta D + \varepsilon \] Since \(D\) is binary and treatment corresponds to a unit change in \(D\), this model estimates the difference in means between the two groups. The difference in means is

\[ \Delta = E[Y - \hat{\theta} X \mid D=0.5] - E[Y - \hat{\theta} X \mid D=-0.5]\]

which is equivalent to the CUPED estimator.

Statistics, machine learning, and online experimentation are just different dialects of the same language. It can be useful to be proficient in all dialects because that means we can borrow ideas from one to apply to the other without the need to re-invent the wheel. Now that we know CUPED is just regression, we can just skip right to fitting a linear model and interpreting the coefficient knowing that we benefit from the variance reduction CUPED promises us.

I want to bring special attention to a quote from the CUPED paper regarding the choice of \(\theta\).

There is a slight subtlety that’s worth pointing out. The pair \((Y, X)\) may have different distributions in treatment and control when there is an experiment effect. For \(\Delta_{cv}\) to be unbiased, the same \(\theta\) has to be used for both control and treatment. The simplest way to estimate it is from the pooled population of control and treatment. The impact on variance reduction will likely be negligible.

In the next post in this series, we’ll examine this statement more closely along with earler criticisms of the use of OLS for estimating the average treatment effect in randomized experiments.

Footnotes

The algebraic details are as follows. Let \(Y_i^{(t)}\) be an outcome in the treatment group and let \(Y_i^{(c)}\) be an outcome in the control group. Then \[\begin{align} \Delta_{c v}&=\widehat{Y}_{c v}^{(t)}-\widehat{Y}_{c v}^{(c)}\\ &= \bar{Y}^{(t)}-\theta \bar{X}^{(t)}+\theta E[X^{(t)}] - (\bar{Y}^{(c)}-\theta \bar{X}^{(c)}+\theta E[X^{(c)}] ) \end{align}\] Note that prior to the treatment \(E[X^{(t)}] = E[X^{(c)}]\). Deng et. al also recommend using the same \(\theta\) for both treatment and control groups, which we will see in a future post can be improved upon.↩︎
Hat tip Evan Miller, I saw this section’s title in an internal doc and had a good laugh.↩︎