N | Treatment Conversions | Control Conversions | Relative Lift | p |
---|---|---|---|---|
50,000 | 541 | 496 | 1.09 | 0.08 |
50,000 | 557 | 524 | 1.06 | 0.16 |
50,000 | 559 | 486 | 1.15 | 0.01 |
50,000 | 556 | 500 | 1.11 | 0.04 |
50,000 | 530 | 516 | 1.03 | 0.34 |
50,000 | 532 | 475 | 1.12 | 0.04 |
50,000 | 516 | 507 | 1.02 | 0.40 |
50,000 | 532 | 475 | 1.12 | 0.04 |
50,000 | 528 | 490 | 1.08 | 0.12 |
50,000 | 544 | 506 | 1.08 | 0.13 |
50,000 | 519 | 512 | 1.01 | 0.43 |
50,000 | 552 | 489 | 1.13 | 0.03 |
You’re part of a team at a company who is tasked with improving conversion on some web page. You’ve run a few experiments already with mixed results and now it is time to set some goals for the next year. Here is a question:
Based on your team’s performance to date, how do you set realistic goals for incremental conversion?
Maybe your approach for your end of year targets would look like
Its a good back-of-the-napkin approach to the problem. But if you come up short is that neccesarily a failure? Or, could it be well within expectation?
This post is forecasting how much a given team can move a metric within some time frame. You’re going to forecast the lift the team can generate given some of their past performance. The forecasting is Bayesian, but assumes the team works within a frequentist framework.
Assumptions
Your team can run approximately 1 experiment per month or 12 in a calendar year (but the method we develop can be extended to an arbitrary number of experiments per month). Let’s say you start experimenting on January 1 and will evaluate your performance December 31. In addition to this, assume:
- All your experiments are A/B tests with two and only two groups: test and control.
- Your main metric is a conversion rate and the baseline value is 1%.
- Every intervention has an effect, though it may be small. The null is never true.
- Your site sees 100,000 unique users per month. You split all 100,000 into two groups at random, and
- You measure lift in a relative sense (this is sometimes called relative risk in epidemiology).
Let’s make some additional assumptions about experiments:
- Your team is relatively reliable. They don’t get better at thinking up interventions over time, so the effects they generate do not change over time, except for random variation.
- Experiments effects are independent of one another, so the implementation of one change does not alter the effect of the next experiment.
Scenario
Shown in the table below are your results over the last year. Nice job, lots of wins, a few failures to reject the null, but overall very good. Using the estimated relative lifts where you did , you managed to increase conversion by 80%. Now, you’re PM is asking you to shoot for 2x conversion this year.
Is that reasonable1? How probable are you to generate at least 2x lift over 12 months given your past performance? I mean, it’s only a little better than you did this past year, right? Luckily, you’re a good data scientist. Even though your team uses frequentism to evaluate their A/B tests, you are not beholden to one ideology over another. So, you decide to use a hierarchical Bayesian model to estimate what kinds of lifts your team is likely to generate in the future.
Hierarchical Model
Let
Here,
Now, you just need to place priors on
Forcasting Lift
Once you fit your model, you can generate hypothetical relative lifts by sampling from the model. Let
If your team were to implement an experiment for which had a relative lift of
Now for the fun part. Say you run
and the forecasted lift,
Think this through. If you were to implement every intervention, your lift would simply be
Modelling
Luckily, all of the computation above – even the power calculation – can be done inside Stan (and you’re pretty good at writing Stan code3).
Shown in Figure 1 is the forecasted lift as compared to baseline after the
Click to see individual trajectories
Now, what about that goal of increasing conversion by 2x? Well, it isn’t looking good. Looks like there is only a 12% chance you meet or exceed the 2x goal. Could it be your performance last year was just extreme? The distribution of forecasted lifts is long tailed. Maybe you’re just going to regress to the mean. Sounds like a good time to push back on your boss and come prepared with data.
Conclusion
You’re clever and realized you could use a hierarchical model to simulate future experiment results and use those to forecast your team’s performance. Your boss’ goal of a 2x increase is nice in so far as it shows they have confidence in you and your team, but the model says it isn’t super achievable.
If 2x isn’t achievable, what is a better target? Or maybe, what is a better range of targets. I’m not sure, that isn’t the point of the post. The post was to equip you with a means of answering that question yourself, and I know you’re capable of answering it. I mean…look at all this cool modelling you did.
Post Script
Ok, breaking away from the narrative for a moment…this is a continuous approximation to a discrete process. We should simulate this to see how real experiments would stack up against my forecast. I’ve gone ahead and actually simulated running the 12 tests and computed the lift after the 12 tests. Shown below is the forecasted lift versus relative error as compared to simulation. I’ll let you come to your own conclusion about the quality of the approximation.
Footnotes
Just go with it, its a work of fiction, who knows if it is reasonable. Let’s pull out the forecasted lifts after the final experiment.↩︎
for relative risk, sorry my epidemiology is showing↩︎Speaking of Stan code, the Stan file in the github repo for this post (see the “Edit this page on github” on the right hand side).↩︎