library(tidyverse)
set.seed(0)
# Simulation Parameters
<- 0.5
p<- 250
n<- as.integer((1/0.01)^2)
nsims
# Run the simulation
<-rerun(nsims, {
sims# K=1
<- rbinom(1, n, p)
x1 # K=2, accumulating data from each state
<- x1 + rbinom(1, n, p)
x2
# Compute some various quntities we will need, like the Z score
<- str_c('K=',1:2)
K <- c(x1, x2) / ((1:2)*n)
X <- p
mu <- sqrt(p*(1-p)/(n*1:2))
sds <- (X-p)/sds
Z <- abs(Z)>1.96
reject
tibble(K, X, mu, sds, Z, reject)
%>%
}) bind_rows(.id='sim')
<-sims %>%
fprgroup_by(sim) %>%
summarise(result=any(reject)) %>%
summarise(fpr = mean(result)) %>%
pull(fpr)
Special thanks to Jacob Fiksel for writing a great blog post which inspired me to write my own.
At Zapier, AB testing kind of has a bad rap. AB testing is perceived as slow – sometimes taking up to a month to complete a single test– with the chance that we don’t get a definitive result (i.e. we fail to reject the null). One of our priorities (and hence my priority) is to find a way to speed up AB testing so we can learn faster.
Peeking is one way to do that. Peeking involves testing the experimental data before the end of the experiment (“peeking” at the results to see if they indicate a change). As you may know from other popular posts on the matter, or from sophomore stats, this can inflate the type one error. That’s a real shame, because peeking is a really attractive way to end an experiment early and save some time. Additionally, people are curious! They want to know how things are going. Fortunately, there are ways to satisfy the urge to peek while preserving the type one error rate.
One way to peek while preserving the type one error rate is through Group Sequential Designs (GSDs). This series of blog posts is intended to delve into some of the theory of GSDs. To me, theoretical understanding – knowing why something works, or at least being able to understand how in principle I could do this myself – is the key to learning. I’m happy to just do this in isolation, but I bet someone else may benefit too.
I’m working mainly from this book, but I don’t anticipate I will discuss the entirety of the book. I really want to know a few key things:
- What is the foundational problem for peeking?
- How can we address that problem (i.e. How can we preserve the type one error when we peek)?
- How else can we speed up experiments (e.g. by declaring an experiment futile)?
- What is the theory underlying each of the above?
Goal For This Post
We know that under “peeking conditions” – just testing the data as they roll in – inflates the type one error rate. In this post, I want to understand why that happens. Like…where is the problem exactly? Where will be our theoretical basis for attacking the problem of controlling the type one error rate?
But first, a little background on GSDs.
Background
The “G” in GSD means that the hypothesis test is performed on groups of observations. Given a maximum number of groups
The “S” in GSD means the test is performed sequentially. If after observing the
Some Math on Means
Means are a fairly standard place to start for a statsitical test, so we will start there too. Let
Since we are accumulating data, let’s write the cumulative mean up to and including group
A Simple Example
Remember that our goal is to understand why the type one error rate increases when we peek as data accumulates, as we might do in an AB test. Answering how much is a little easier, so let’s do that first. Let’s do so by analyzing a
- That each group has the same sample size
. - That the data we observe are IID bernoulli trials
for and . - That our false postie rate
How Much Does The Type One Inflate?
Let’s just simulate data under the assumptions above. At each stage, let’s test the null that
From our simulation, we reject the null around 8.6% of the time. That is certainly higher than the nominal 5%, but if we recall our sophomore stats classes, isn’t there a 9.8% (
The 8.6% isn’t simulation error. We forgot that
Why The Type One Inflates
The assumptions we made above allow us to get a little analytic traction. We know that the sampling distribution of
Code
<- rgb(45/250, 62/250, 80/250, 1)
my_blue theme_set(theme_classic())
%>%
sims ggplot(aes(X))+
geom_histogram(aes(y=..density..), fill = 'light gray', color = 'black')+
facet_wrap(~K) +
geom_line(aes(y = dnorm(X,
mean = mu,
sd = sds[PANEL])),
color = my_blue,
size = 1)+
theme(
panel.grid.major = element_line()
+
)labs(y='Density',
x = expression(bar(X)^(k)))
Consider the random vector
with mean
Code
<- sqrt(qchisq(0.95, 2))
sigma_1 <- sqrt(qchisq(0.99, 2))
sigma_2 <- p*(1-p) * matrix(c(1/n, 1/(2*n), 1/(2*n), 1/(2*n) ), nrow = 2)
sig<- seq(0, 1, 0.01)
tt <- cos(2*pi*tt)
x <- sin(2*pi*tt)
y <- cbind(x,y)
R = eigen(sig)
e = sqrt(diag(e$values))
V
<- sigma_1*R %*% (e$vectors %*% V %*% t(e$vectors)) + p
level_curve_1 colnames(level_curve_1) <- c("X1", "X2")
<- as_tibble(level_curve_1)
level_curve_1 <- sigma_2*R %*% (e$vectors %*% V %*% t(e$vectors)) + p
level_curve_2 colnames(level_curve_2) <- c("X1", "X2")
<- as_tibble(level_curve_2)
level_curve_2
<-sims %>%
joint select(sim, K, X) %>%
pivot_wider(names_from='K', values_from='X') %>%
rename(X1 = `K=1`, X2=`K=2`) %>%
select(-sim) %>%
sample_n(1000)
%>%
joint ggplot(aes(X1, X2))+
geom_point(color = 'dark gray', fill='gray', alpha = 0.5, shape=21)+
geom_path(data=level_curve_1, aes(X1, X2), color = my_blue, linetype='dashed')+
geom_path(data=level_curve_2, aes(X1, X2), color = my_blue)+
lims(x=c(.4, .6), y=c(.4, .6))+
theme(
panel.grid.major = element_line(),
aspect.ratio = 1
+
)labs(x=expression(bar(X)^(1)),
y=expression(bar(X)^(2)))
Now that we know the joint sampling distribution for our statistics of interest (namely
Because the joint is multivariate normal, we can compute this probability directly. However, I’m just going to simulate it.
# Standardize the MVN by converting covariance matrix into a correlation matrix
<- solve(diag(sqrt(diag(sig))))
D <- D %*% sig %*% D
cormat <- MASS::mvrnorm((1/0.001)^2, rep(0, 2), cormat)
Z
= abs(Z[, 1])>1.96
z1 = abs(Z[, 2])>1.96
z2
<- mean(z1|z2) fpr
and we get something like 8.3%. But that doesn’t answer why, that just means I did my algebra correctly. As always, a visualization might help. take a look at Figure 3. The shaded regions show the areas where the null would be rejected. These are the areas we would make a false positive. The dots indicate the standardized draws from the density of
The shaded region depends on critical values we use for each test in the sequence. If we naively use
as the critical value for each group as in “peeking” conditions, then the shaded region is too big!
Code
%>%
sims select(sim, K, Z) %>%
pivot_wider(names_from='K', values_from='Z') %>%
rename(Z1 = `K=1`, Z2=`K=2`) %>%
select(-sim) %>%
sample_n(1000) %>%
ggplot(aes(Z1, Z2))+
geom_point(color = 'dark gray', fill='gray', alpha = 0.5, shape=21)+
scale_x_continuous(limits = c(-5, 5), expand=c(0,0))+
scale_y_continuous(limits = c(-5, 5), expand=c(0,0))+
annotate("rect", xmin = -5, xmax = -1.96, ymin = -5, ymax = 5, alpha = .5, fill=my_blue)+
annotate("rect", xmin = 1.96, xmax = 5, ymin = -5, ymax = 5, alpha = .5, fill=my_blue)+
annotate("rect", xmin = -1.96, xmax = 1.96, ymin = 1.96, ymax = 5, alpha = .5, fill=my_blue)+
annotate("rect", xmin = -1.96, xmax = 1.96, ymin = -1.96, ymax = -5, alpha = .5, fill=my_blue)+
geom_hline(aes(yintercept=-1.96), linetype='dashed')+
geom_hline(aes(yintercept=1.96), linetype='dashed')+
geom_vline(aes(xintercept=-1.96), linetype='dashed')+
geom_vline(aes(xintercept=1.96), linetype='dashed')+
theme(
panel.grid.major = element_line(),
aspect.ratio = 1
+
)labs(x=expression(Z^(1)),
y=expression(Z^(2)))
<- function(za){
find_region = abs(Z[, 1])>za
z1 = abs(Z[, 2])>za
z2
<- mean(z1|z2)
fpr
- 0.05)^2
(fpr
}
<-optimize(find_region, interval=c(0, 3)) result

That is the why! When we naively just run our test each time we peek, we are defining a region in
Conclusion
We’ve done algebra, and it wasn’t for nothing. It have us insight into exactly what is going on and why the type one error increases under peeking. We also know that there is a way to fix it, we just need to define the shaded region a little more carefully. This will lead us to talk about alpha spending and various alpha spending functions.
Appendix
Covariance Calculation
The diagonals of the covariance matrix
What remains is the covariance, which can be obtained with some covariance rules
Since the groups are independent, the sample means are also independent (but the cumlative means are not). Meaning
Footnotes
See the Section 6.1 for a calculation↩︎