Demetri Pananos Ph.D - How Long To Run an A/B Test as a Bayesian

Disclaimer: This blog post is inspired, almost entirely, by Tyler Buffington’s excellent statistics reading group presentation today at Datadog on the value of information in A/B testing. The ideas presented here are mainly his, I’ve just gone ahead and done a little more math (but not much more).

Introduction

In this post, I discuss decision criteria for Bayesian A/B testing and a way to think about experiment run times for Bayesian A/B tests. I’ll also present some analytic results that can help us avoid doing simulations in practice at the cost of some mild assumptions.

First, I’ll discuss a little background on some straight forward ways to analyze A/B tests as a Bayesian and discuss a few options for ship/no-ship decision rules for A/B tests. Then, I’ll introduce a way to think about expected impact as a Bayesian, and finally share some analytic results.

Bayesian Approaches to A/B Testing

Most A/B tests target the lift in a metric, defined as

\[ \lambda = \dfrac{E[Y(1)]}{E[Y(0)]} - 1 \>. \]

We can obtain an estimate of the lift, \(\hat \lambda\), via the plug in principle and the associated standard error, \(s_\lambda\) via the delta method. Assuming \(\hat \lambda\) is drawn from a normal distribution (I’ve talked about this before here), and that \(s_\lambda\) is known exactly, we can use a conjugate normal-normal model to obtain the posterior distribution. Assuming the prior mean and variance is \(\mu\) and \(\tau^2\) respectively, the posterior variance and mean, \(V\) and \(M\), are

\[ V^{-1} = \dfrac{1}{s^2_\lambda} + \dfrac{1}{\tau^2} \>, \] \[ M = V \times \left(\dfrac{\hat \lambda}{s^2_\lambda} + \dfrac{\mu}{\tau^2}\right) \>. \]

Whereas a Frequentist would make their ship/no-ship decision based on statistical significance, a Bayesian has a few options. Assuming increases to the metric are favorable, the following are some of the most popular I have seen in the wild:

Ship if \(c \lt \Pr(\lambda \gt 0 \mid \hat \lambda)\). This is sometimes called “Probability to beat control”.
Ship if \(0 \lt M - z_{1-\alpha/2}\sqrt{V}\). This is like a Bayesian p-value being less than \(\alpha\).
Ship if \(E[\mathcal{L}(\lambda)] < \epsilon\). Here \(\mathcal{L}\) is a loss function, and so the decision is based on expected loss being sufficiently small for the decision to ship the treatment. In what follows, I will use \(\mathcal{L}(\lambda) = \max(-\lambda, 0)\).

Now, these are just a few ways to make decisions (they are certainly not exhaustive). Prior to Tyler’s talk, I didn’t really have a great way to answer how long an A/B test should take (save some ramblings about EVSI here) were you to use a Bayesian analysis method. However, with these details, I think I can now give an answer.

How Long To Run an A/B Test as a Bayesian

The question of how long to run an A/B test is really a question of sample size. Assuming you accrue some number subjects per week, you can do some back of the napkin math to say how long the test should run to sufficiently power the experiment given an MDE and some information on the metric. Often, we flip the script and present MDEs as a function of time (e.g. if you want a 3% MDE, run this experiment for 2 weeks. Want a smaller MDE? Run it longer). Run length is now a knob one can tune.

This is do-able as a Frequentist because we have a very clear relationship between power, MDE, and sample size (which is a function of time). Although there are Bayesian conceptions of power (e.g. conditional power, unconditional power, etc etc), the formulae for these don’t so straight forwardly relate time and desired effect sizes. This is further complicated by the prior – if your prior says a 5% lift is not probable, you probably shouldn’t design experiments to detect 5% lifts.

Rather than seek a formula, we can simply draw random numbers to see what happens under different scenarios. Consider the following algorithm:

Draw a true lift from your prior, \(\lambda\).
Draw a control group mean from the implied sampling distribution (you need the control mean and standard error to do this, but you should know this if you were going to do a power calculation anyway).
Draw a treatment group mean from the implied sampling distribution.
Compute \(\hat \lambda\) and \(s_\lambda\)
Compute \(M\) and \(V\).
Apply your decision rule. If you reach ship criteria, then return the true lift, else return 0.
Repeat a few thousand times.

Here, time affects the sampling distributions of treatment and control. The longer the experiment runs, the more samples we accrue, the larger the precision in the group means will be. This procedure is shown for 4 scenarios below. The histograms show the distribution of true impacts to the product under the decision rule. The large spike at 0 indicates those A/B tests which resulted in no ship decisions, and the remainder are the impacts from shipping.

Note a few things. First, while difficult to see, there are some cases (especially in small sample sizes) where we ship harm to the product! This makes sense, when we have noisy measurements sometimes we can make type \(S\) errors. As the sample size increases, the probability we make these errors decreases as does the proportion of no-ship decisions (the height of the spike decreases). Second, the mean of this distribution is the “average impact to the product”. This mean changes slightly due to my first point, and eventually the distribution of these impacts stabilizes to some limiting distribution (Tyler referred to this as the Clairvoyant distribution, essentially the distribution of impacts were you to know the true lift perfectly). Hence, there is some upper bound to the mean.

Distribution of shipped lifts under the probability to beat control decision rule (\(c = 0.95\)) for increasing sample sizes. Prior is \(\operatorname{Normal}(0, 0.05^2)\), \(\sigma = 0.5\), and equal allocation (\(p = 0.5\)). At small \(N\), most simulations result in no-ship (the spike at 0). As \(N\) grows, the posterior tightens and more true positives are shipped.

Below is a plot of the mean of this distribution as a function of sample size, and you can see this limiting behavior quite clearly. The proposal is to base your run time on the expectation of the Clairvoyant distribution. From the plot, running A/B tests too short means the impact is going to be very small on average, but running longer means that (under the prior) you will impact the product more. There comes a point where running the test longer is not helpful, you encounter diminishing returns. This is a nice way to think about run time in my opinion! You know what your average impact would be were you to have perfect information, and you can collect enough samples so that you are sufficiently close to this without wasting time obtaining additional samples. I think that is pretty elegant!

Average impact to the product as a function of sample size under the probability to beat control rule (\(c = 0.95\)). The curve flattens as \(N\) grows, reflecting diminishing returns from longer experiments. The loess-smoothed curve is based on 100,000 simulations at each sample size.

Note that I’ve just shown simulations for probability to beat, but the simulation approach is sufficiently flexible to accommodate any decision rule (expected loss, Bayesian p values, or whatever you can cook up). If you’re an analyst, you can do this fairly readily on your computer. What if you wanted to make this a product many people can use, and use with reproducible results? Simulating is fine in my opinion, generating random numbers and counting should be pretty cheap, but it would be cool to have a formula for the expected value of this distribution as a function of the total sample size.

An Analytic Solution

Let’s stick with probability to beat for a moment. The ship decision is, equivalently,

\[ c \lt \Phi\left(\dfrac{M}{\sqrt{V}}\right) \>, \]

or equivalently

\[ \Phi^{-1}(c) \lt \dfrac{M}{\sqrt{V}} \>. \]

Here, \(\Phi\) is the CDF of a standard normal distribution. The discourse and evidence across the A/B testing suggests that most A/B tests fail to move metrics, meaning that we should expect \(\mu \approx 0\) as our prior mean. As a consequence, our ship condition would be

\[ \Phi^{-1}(c) \lt \dfrac{\hat \lambda}{s^2_\lambda\sqrt{\dfrac{1}{s^2_\lambda} + \dfrac{1}{\tau^2}}} \>.\]

Additionally, I personally think that an appropriate prior standard deviation is going to be small since we don’t routinely detect enormous hurt/help to the product. As a consequence of this, this would mean that \(\lambda \approx 0\), or that the average outcomes in groups are going to be similar, at least to a first order approximation! This means that the sampling variance of the lift can be approximated as

\[ s^2_\lambda \approx \left( \dfrac{s^2_t + s^2_c}{\bar y^2_c} \right) \>. \]

Here, the subscripts indicate estimates for treatment, \(t\), or control, \(c\). Note that this is where the sample size, and hence time, enters the equation. The sampling variability is a function of \(N\), though I’ve not written this out explicitly. I’ll add a dependency on \(N\) going forward, and we can re-write our decision criteria solely as a function of the estimated lift as

\[ S(N) = \Phi^{-1}(c) \cdot s^2_\lambda \cdot \sqrt{\dfrac{1}{s^2_\lambda} + \dfrac{1}{\tau^2}} \lt \hat \lambda \>. \]

If we know the true lift \(\lambda\), we could calculate the probability that this criteria would be met by \(\hat \lambda\). It would be

\[ \Pr(S(N) \lt \hat \lambda \mid \lambda) = 1 - \Phi\left(\dfrac{S(N)-\lambda}{s_\lambda} \right)\]

but of course we don’t know \(\lambda\). We do however have a prior over \(\lambda\), so we can compute the expected impact as

\[ E[\mbox{impact} \mid N] = \int_{-\infty}^{\infty} \lambda \cdot \left(1 - \Phi\left(\dfrac{S(N)-\lambda}{s_\lambda} \right) \right) f(\lambda) d\lambda \>. \]

Here, \(f\) is the prior density of \(\lambda\). Now, this looks like a right hairy bastard, but it is actually quite manageable for R or other numerical integration since the integrand is nice and smooth. Shown below are the simulated and analytic results for the probability to beat control decision rule. The approximation isn’t bad!

Simulated vs analytic expected impact for the probability to beat control rule (\(c = 0.95\)). The close agreement confirms the analytic solution is a good approximation, at least under the proposed parameters. Prior is \(\operatorname{Normal}(0, 0.05^2)\), \(\sigma = 0.5\), and equal allocation (\(p = 0.5\)).

This sort of exercise can be done for the remaining two decision rules. While I leave that exercise to the reader (or perhaps to their AI), I will leave you with just a plot comparing the analytic formulae and the simulations.

Simulated (red) and analytic (blue) expected impact curves for three decision rules: probability to beat control (solid, \(c = 0.95\)), 95% credible interval excluding zero (dashed), and expected loss (dotted, \(\epsilon = 0.001\)). The expected loss rule is the most permissive, shipping more often and yielding higher average impact at each sample size.