Bayesian Futility Analysis for A/B Tests

Introduction

The sequential (or always valid) confidence interval we use at Datadog is great because experimenters can run an experiment until they detect an effect (albeit with a loss in power, but hey there is no free lunch). But what if that day never comes?

In a previous post, I discussed how to think about experiment run times for Bayesian A/B tests. The key result was a closed-form expression for the expected impact to the product as a function of total sample size, given a prior on the true lift and a decision rule of the form \(S(N) < \hat \lambda\). That post answered the question: before seeing any data, how long should I plan to run?

This post answers the complementary question: given what I’ve seen so far, should I keep going? That is to say, this post concerns futility. If at some interim point the probability of eventually reaching a ship decision is sufficiently low, we can stop the experiment early and save the traffic for something more promising.

Frequentists have a well-developed theory for this in the form of group sequential designs (I’ve written about group sequential designs here and here, although I didn’t cover futility back then). The Bayesian version is, in my opinion, more natural: we condition directly on the posterior and ask a straightforward probability question.

Setup and Notation

Recall the setup from the previous post. We target the lift

\[ \lambda = \dfrac{E[Y(1)]}{E[Y(0)]} - 1 \>. \]

We observe an estimate \(\hat \lambda\) with standard error \(s_\lambda\) (via the delta method), and use a conjugate normal-normal model with prior \(\lambda \sim \operatorname{Normal}(\mu, \tau^2)\). The posterior variance and mean are

\[ V^{-1} = \dfrac{1}{s^2_\lambda} + \dfrac{1}{\tau^2} \>, \qquad M = V \times \left(\dfrac{\hat \lambda}{s^2_\lambda} + \dfrac{\mu}{\tau^2}\right) \>. \]

The three decision rules from the previous post all could be expressed in the form

\[ S(N) < \hat \lambda \>, \]

where

\[ S(N) = s_\lambda^2 \left[ \Phi^{-1}(p) \sqrt{\frac{1}{\tau^2} + \frac{1}{s_\lambda^2}} - \frac{\mu}{\tau^2} \right] \]

and \(p = c\) (probability to beat control), \(p = 1-\alpha/2\) (credible interval), or \(p = 0.5\) (posterior mean positive).

Suppose we have run the experiment for \(K\) weeks, accruing \(N_K\) total subjects. We have observed \(\hat \lambda_K\) with standard error \(s_K\), giving us a posterior \((M_K, V_K)\). If we continue for \(W\) more weeks at a rate of \(n_w\) subjects per week, the final sample size will be \(N_f = N_K + W \cdot n_w\). What is the probability we reach our shipping criteria after getting \(W \cdot n_w\) more samples?

The Closed-Form Solution

We can avoid simulation entirely. The key insight is that the posterior \((M_K, V_K)\) plays exactly the same role as the prior \((\mu, \tau^2)\) did in the previous post.

Conditional on the current posterior, the true lift is \(\lambda \sim \operatorname{Normal}(M_K, V_K)\) and the final observed lift given \(\lambda\) is \(\hat \lambda_f \mid \lambda \sim \operatorname{Normal}(\lambda, s_f^2)\). Marginalizing over \(\lambda\):

\[ \hat \lambda_f \sim \operatorname{Normal}(M_K, \> V_K + s_f^2) \>. \]

The ship decision requires \(\hat \lambda_f > S(N_f)\), so

\[\Pr(\text{ship} \mid \text{continue until reaching } N_f) = \Phi\left(\frac{M_K - S(N_f)}{\sqrt{V_K + s_f^2}}\right) \>.\]

Seems pretty straight forward. The nice part about this approach is that we can check how good our forecasts are since \(\Pr(\text{ship} \mid \text{continue until reaching } N_f)\) can be empirically validated. This brings us to one of my favorite topics to talk to data scientists about: calibration!

Calibration, Such an Aggravation

The closed-form \(\Pr(\text{ship} \mid \text{continue until reaching } N_f)\) is a prediction. We should verify it is calibrated: among experiments where we predict \(\Pr(\text{ship} \mid \text{continue until reaching } N_f) = p\), roughly fraction \(p\) should actually ship.

To check this, we simulate the full lifecycle of many experiments end-to-end:

Draw a true lift from the prior.
Simulate interim data at week \(K\) and compute the posterior \((M_K, V_K)\).
Compute the closed-form \(\Pr(\text{ship} \mid \text{continue until reaching } N_f)\).
Simulate the actual remaining data and check whether the experiment actually ships.
Evaluate the calibration using something like rms::val.prob or similar tools.

You can see that below

Calibration of the closed-form futility probability. We simulate 100,000 experiments with an interim analysis at \(N_K = 1{,}000\) and a final analysis at \(N_f = 5{,}000\). Each point represents a bin of experiments grouped by their predicted \(\Pr(\text{ship} \mid \text{continue until reaching } N_f)\). The dashed line is perfect calibration.

Now, you might say “Demetri, you were simulating from the prior. I am not surprised that the procedure is calibrated”. While correct, I think you are missing my point. You can check if your procedure is calibrated yourself! You can even do this retrospectively. Take a look at what you knew a week before an experiment ended. Then, compute the probability you would have made a ship decision a week later, and voila. This is a great way to check if your prior is “right” in my opinion (a topic I anticipate I will write a lot more on in the coming months).

Bonus: Futility Boundary

We can also ask “At each interim week, what is the minimum posterior mean \(M_K\) such that \(\Pr(\text{ship} \mid \text{continue until reaching } N_f) > x\%\)” for any given \(x<0.5\)? This gives a futility boundary – if your posterior mean is below the line at any point, stop the experiment! Simply invert the win ship calculation to obtain

\[S(N_f) - z_{1-x} \sqrt{V_k + s^2_f} + \leq M_k \>. \]

Futility boundaries for three thresholds (\(x = 0.1\%\), \(1\%\), \(10\%\)): the minimum posterior mean \(M_K\) at each interim week such that \(\Pr(\text{ship} \mid \text{continue until reaching } N_f) > x\) if the experiment continues to week 8. Decision rule: probability to beat control \(> 0.95\).