A Practical A/B Testing Brain Dump

AB Tests
Statistics
Author

Demetri Pananos

Published

May 14, 2023

What Is This Post?

Over the past year or so, I’ve helped move Zapier towards doing better experiments and more of them. I wanted to document some of this for posterity; what I did, what worked, how we know it worked, and where I might want to go in the future. This is more of a story rather than a set of instructions.

My hope is that there is some sort of generalizable knowledge here. I don’t want to turn this into a set of instructions, but I do think there is something to learn (even if this has been written about before. I mean, I couldn’t find anything I needed, so here I am making it). As such, its going to be a long post. I might periodically update this post from time to time so that I can just put something out there initially.

What This Post Is Not

This is not a list of statistical tests to run (spoiler: its all regression anyway). Nor is this a technical post about feature flagging, persisting experiences across sessions, caching, or any other technical aspects of running online experiments.

What Am I Going To Cover

Roughly, I want to go through the steps I took over the last year. Much of it was off the cuff and improvisation, but a lot of it was inspired by working with clinicians and contents of the book Trustworthy Online Controlled Experiments. Topics will include:

  • Understanding where we are and where we came from
  • Determining the quickest way to improve the situation
  • Building credibility and trust
  • Scaling experimentation – even if it means buying rather than building
  • (Counter intuitive) signs we are moving in the right direction
  • Next Steps

You should know: This was not a one man effort. There are lots of people to thank for any success I’ve had. Guidance from leadership, all the people who let me touch their projects, and all the data scientists who tapped me for help or a consult.


Empathy: Understanding Where We Are and Where We Came From

The first thing I did when I arrived at Zapier was shut up and listen. It would have been very easy for me to come in, see a few experiments, and then insist we do everything as Bayesians and use exotic methods to estimate quantities that didn’t actually matter but sounded important. There is a folk theorem I like that goes like

“If you have a problem, someone smarter than you has already solved it”.

A small corollary to that theorem is

“If you have a possible solution, someone has already tried it and failed.”

That failure is likely not because they just did it wrong. Don’t recreate stuff that doesn’t work. I set up a “listening tour” (basically a bunch of 1:1s) with anyone who was interested in talking to me or had something to say about experimentation. This was the single best thing I could have done. I learned a lot – which I will get to – but before that I want to talk about how to run these.

Your main job here is to empathize and be curious about what you’re hearing. Don’t solution, don’t look for a nail you can hit with your statistical hammer, just listen. I politely asked everyone if I could record the sessions so that I could come back and listen more intently later. This removed the burden of taking notes during the meeting, I could do that after. Instead, we were able to have a real connection. If you find yourself in the same position as I found myself, do the following: Shut up, listen, and empathize.

What did I learn? A few things:

  • There was a culture of experimentation. People were running experiments and generating insights. However, the way experiments were being run was heterogeneous. There was no global process for ideating, hypothesizing, designing, implementing, analyzing, and reporting on an experiment. People were running experiments differently between teams, and in some cases were running them differently within teams. If the process for running experiments differs person to person, there is no consistency. That needs to be solved.

  • Experimentation was seen as a roadblock for a lot of people. It took time, engineering effort, and would often not result in a clear winner which was frustrating. Not only that, but it took data some time to actually analyze these experiments. Data was considered a bottleneck because we would have to make a JIRA ticket to analyze the data, and then once we did analyze it stakeholders just had more questions so we’d have to repeat the process.

  • Additionally, there was just no guidance on how to run experiments. I remember one conversation so viscerally because a stakeholder told me “If I wanted to run an experiment, I wouldn’t know where to start”. We needed some basic information on why we experiment, how we experiment, and even when to experiment. When we couldn’t run an experiment, we needed some guidance on what other forms of learning we could use and when those were appropriate.

There were a few other pain points, but these were the biggest three in my opinion.

I don’t have much advice on empathizing except the following: You’re not going to solve someone’s problems with math. Do not go into these conversations looking for evidence that your sexy model is going to work. Instead, be curious, collect data, and try to see if problems you are hearing exist elsewhere. There is a whole art to this that I will not pretend to have mastered. In any case, do not go in thinking more math will save the day. It won’t.

After these interviews, I consolidated all my learning and presented them to other data scientists. The one thing I wish I added to my research was why previous attempts to solve these problems failed. There were at least two attempts to improve experimentation at Zapier and it would have been useful to get a more robust understanding of what they attempted to solve and how.


Determining The Quickest Way To Improve The Situation

At this point, I had a pretty good understanding of what people thought the problems with experimentation were. It was time for me to get some skin in the game and analyze a few experiments myself. After getting a few experiment reps in, the first thing I sought to introduce to improve the situation was an A/B Testing Brief Template.

My hope here was that this would be an evergreen place to learn about experiments being run. It includes all the information I would need to know about to help run an experiment. This includes, but is not limited to:

  • Phase of development (are we planning, are we developing, are we running, have we finished, etc)
  • Experiment start and end dates
  • Variant names and links to Figma
  • Hypotheses, metrics, and assumptions, etc
Here is the template brief I created. Its simple, and has a lot more than what is shown here, including some steps to estimate sample size.

It was a really simple document that tried to extract the most information with the least amount of pain. To my surprise, people loved it. I quickly saw teams adopt this template, and would even encourage other teams to use it too. This was the quickest way to improve the situation. It gave a small global process to think about what was important when planning an experiment. This helped pave the way to introducing more global processes and an eventual repository of experimentation information (more on that later maybe).

Of course, nothing is perfect. I was hoping teams would be able to self serve sample size requests with just a few simple steps. Some of these questions are easy to answer (like baseline performance) but others created a lot of questions. What should our MDE be? what should our significance level be? This section is rarely filled out by stakeholders and left to data scientists, which is too bad because it really should be something a stakeholder can do. I’ll talk about how this was solved later.

Sample size instructions

Additionally, the quality and depth of these briefs varied team to team. Some teams were very verbose, offering links and tons of details. Others would kind of slap something in there, maybe a few bullet points at the most. It isn’t my job to say what is good and what is bad, but certainly more info made it feel like the hypotheses were better researched and supported. No word yet on if more complete briefs yielded more successful experiments.

An Important Side About Frequentism vs Bayesianism

You may have noticed the mention of sample size and significance level, implying a Frequentist and NHST framework being adopted here. I’m well aware of the problems with this framework, so why did I allow us to keep using this approach?

  • First, keep in mind that 80% of A/B testing is not about what model you use. It is more about the stuff that comes prior to hitting Run on your R or python script. We needed to work on that 80%, and letting people use Frequentist models at least let them be familliar with the harder parts of AB testing. Keeping that familiarity and not introducing exotic models (again, math solution to a non-math problem) helped me build trust with those data scientists instead of having me be perceived as some egg-head completely detached from business needs. I’m not saying this would be the case for everyone, but I made the decision to leave some things the same as we fixed the more important pieces. P values were one of those things.

  • Not only that, but you need to remember that you are working with people on various different stages in their journey. You had some people like Statwonk, who really knows his stuff, and you had others who hadn’t taken stats in 10 years and didn’t understand that all statistical tests are forms of regression. Bayesian stats is hard, and once we were to move past Beta-Binomial models for conversion, I would have to hold many people’s hands as they fit and interpreted their Bayesian model. That doesn’t scale very well.

  • Not only that, but I’m no longer interested in teaching statistics to people. There are plenty of resources online that are leagues better than what I could put together, so why waste my breath? If you wanna learn Bayesian stats then go read McElreath or Gelman or Kruschke or whomever. Its just not something I’m interested in doing ever again.

That being said, there is no reason to stay Frequentist. If we feel there is a real advantage to being Bayesian, we can and have used Bayesian models. These are especially useful in revenue experiments where we statistical significance is not our concern and we should instead be using decision theory. So let’s do what we know for now and re-evaluate when all the supporting processes are working as we intend them too. Don’t be so infelxible as to think “Bayes or bust”, that isn’t helpful and lacks empathy.

An Experimentation Hub

The AB testing Brief was a really good move. It saw quick and wide spread adoption and helped Zapier begin to think about planning experiments in the same way by making everyone answer the same questions. This helped a lot with the variance problem.

There was, however, a scalability problem. While I was very keen to consult on as many A/B tests as I could, I very quickly learned that I was not a scalable solution; I could not realistically consult on experiment ideation, design, quality control, monitoring, analysis, and reporting for every experiment we would run. As soon as I got up to about 3 or 4 experiments simultaneously, I started to get really bogged down. Not only that, but I became a victim of my own success. Teams would have a good experience with an experiment I helped run and then come back to me, which lead to another good experience, and soon enough I became their go to person for experimentation. My goal was not to give myself too much work, my goal was to improve experimentation across the org and this workflow was actively working against me.

I’ve learned that as you become more senior, your job is less writing code and more writing documentation1. I began writing documentation on experimentation with the help of my VP. Much of the content was mine, but the idea was his and he helped me identify where and what to write about. I didn’t try to re-write textbooks, and neither should you if you’re thinking of doing the same. Instead, my goal was to capture the most salient information so that people would be able to self serve as much as possible, and when I was required to consult I could do the most important work only I could do and then dip. Examples include:

  • Here are the criteria you need in order to experiment. Here is the decision tree for if you should experiment. It isn’t perfect and that is OK.
  • Sample size? Nope, here is a calculator and what each of the fields mean. Let me know when you’ve got a number and I will check it.
  • Need to run a test of some sort? Here is the R function. Don’t know R? Here is the python function. Don’t code all that much? Here is a link to an online calculator.
  • Here are suggested checks you should do on your experiment data.

You get the jist. This worked…OK. What I found is that this “just in time” education is extremely prone to edge cases and that means I just created different problems to solve initially (which was good, that is progress).

2: For more on this, I would recommend reading “The Staff Engineer’s Path”.

Footnotes

  1. 1↩︎

  2. 1↩︎