How to Split Test Like A Boss

Postado em - Última vez Modificado em

TL;DR

I argue (without data) that you should decide when to end your A/B tests by considering for each variant the amount you stand to lose if you go with that variant. If you see an acceptable deal, then take it and end the test! Broadly speaking, the presented deals tend to get better as more samples are collected. I believe that what is "acceptable" should depend on extrinsic factors such as how long you've been running the test, the excitement factor of the next A/B tests in the queue, and the relative maintainability of the variants.

The "amount you stand to lose" is the worst-case drop in conversion rate of the variant compared to the best of the other variants. I define "worst case" as there being a 95% probability that things are not this bad. Or 98%+ if you're conservative. As you collect more data, the "worst case" for each variant becomes more realistic. Once a variant's worst case is acceptable to you - either because even the worst case is a gain (great success!), or because it's a tolerably small loss and you've run out of patience (formerly known in sloppy circles as a null result) - you should end the test and go with that variant.

Beware: if you don't implement the correct statistics, then the above recommendations can be disastrous!

Introduction - Data at Freelancer

At Freelancer.com we do a lot of A/B testing. We've made a lot of progress in the usability of our site (and averted the occasional backwards step!) by testing ideas against each other and seeing how real people react. Our customers vote with their feet, and we can find out whether a new form does indeed make it easier to post a project, or we can check that our new matchmaking algorithm is actually helping real people to find the most suitable freelancers.

We're a fast-paced, data-driven company. In the last 8 hours we automatically generated over 3,000 graphs for our internal dashboard. Our data scientists take turns to comb through these and present the results in Daily Stats emails, which often set the discussion agenda for the day.

Bayesian analysis of our A/B tests is a natural fit for us. We can check test results as often as we like. We can end tests early if there's a strong result and move on to the next test. Or we can keep a test running a little longer than planned in the hope of reaching a significant result.

We can also directly answer questions like "What is the probability that variant B is better than variant A?". When you're making a business decision, this is a more directly useful question than "What is the probability that the two variants are equally effective and the observed results arose by chance?", which is the question usually asked in null hypothesis testing.

But enough jabs in the Bayesian vs. (non-sequential) Frequentist debate! You can retread that debate in lots of places. The point of this article is to introduce a new (to me, and maybe to you) question to ask when analysing A/B tests.

In this article I'm going to assume you're modeling a Bernouilli process, i.e. one where each sample either converts or does not convert. Your split test can involve two or more variants.

First I recap the most common question evaluated by Bayesian A/B tests, and point out that it doesn't handle null results gracefully. Then I discuss a straightforward method of resolving this problem ("Lower your standards"). Then I take it one step further, to evaluate quantities I subjectively believe to be more "boss" and more useful for making business decisions ("Making deals"). Following the hand-wavy discussion, I give mathematical formulations of each of the questions ("The math"), then conclude with an outline of the numerical implementation ("The code").

Ask the right question

It's super important to ask the right question of the data. Typically you formulate a single question in words, then you go about translating that question into maths and finally use it to ask the data for an answer. The Bayesian way is to ask as few separate questions as possible, and wherever possible to do comparisons and evaluations inside the probability clouds before finally extracting the answer as one small piece of information; rather than asking n questions, extracting n pieces of information that are detached from probabilities, then doing further processing on them to arrive at the final result.

The most common question that split testers ask (including Google Experiments) seems to be, for each variant,

    "What is the probability that this variant is the best?"

If you know that there is going to be a best variant, then this is a great question to ask. You keep collecting data until one variant has >95% probability of being the best (or >98%, or >99.9%, or >100 - eps%), at which point you call the test done and proclaim that variant the winner.

But there isn't always going to be a single best variant. Most of the A/B tests I've been involved in have not ended with a clear winner. If it doesn't actually matter whether the button is deep blue (variant 1) or purple (control), then unless you're unlucky, if you ask this question you won't get to declare a winner. How and when should you give up on such a test?

The problem is that the above question can't distinguish between two common cases:

  1. One variant is better than the others, but not enough data has been collected to be sure of this (and we should keep running the test)
  2. The two best variants are pretty much equally effective (and we should stop the test)

Amusingly, this distinction is closely related to the null-hypothesis question that the standard frequentist methods try to answer, but let's not go there.

Lower your standards

One solution is simply to lower your standards before you start the test. Could you live with a conversion rate that's possibly 2% less than optimal - whatever optimal happens to be? How about a 5% worst case drop? C'mon, 5% will allow you to finish the test so much faster... Okay, okay, let's go with 2%. We can now run a handicapped race, and the question becomes:

    "What is the probability that either this variant is the best or is within 2% of the best?"

or, equivalently,

    "If we dropped the other variants' conversion rates by 2%, then what would be the probability that this variant would be the best?"

Now you start a test and collect data until one of the variants has >95% probability of being the best or within 2% of the best. This question lets us distinguish between the two cases. For case 2, unless you're unlucky, in the long run one or both of the best variants will reach >95% probability. If multiple variants reach >95% probability of being 'nearly the best', which should you pick? Well according to the test, and your standards, it doesn't really matter!

Be aware that if you do lower your standards, then the consequence of being unlucky (which happens with 5% probability) are more severe than before. Therefore you may wish to increase the threshold from 95%.

Making deals

But by how much should you lower your standards? Ideally you'd like to discover that one of the variants is an outright winner - you don't want to accept a "possibly 2% worse" variant unless you have to. It may make me a sloppy decision maker, but my standards tend to start high while I'm excited and optimistic, then slowly drop as I get impatient with the experiment. (I didn't notice either of these effects on Wikipedia's list of cognitive biases!)

Rather than fixing the handicap margin and calculating a probability, as above, why not fix the probability and calculate the handicap margin? Ask

    "How much of a conversion rate boost would we have to give this variant in order for it to have 95% probability of being the best?"

Or equivalently,

    "In the worst case, how much do we stand to lose (or gain) if we go with this variant instead of the best of the rest. Here, 'worst case' means that there is 95% probability that the loss is not this bad."

When the "boost" or the "worst case change" is zero for a variant, then there is 95% probability that that variant is the best. When the "worst case" is a gain of X%, then you're 95% sure that this variant is at least X% better than the best of the other variants - and you can either stop the test, or keep going and try to rack up bigger bragging rights.

Note that we don't necessarily measure this change against the control's conversion rate - the control is just another variant. And as soon as you start the experiment, you will see that scrapping the test and returning to the control (which I'm assuming is the original version) involves a worst-case drop in conversion rate compared to the hypothetical best of the relatively-unknown variants.

Let's consider some case studies. Here are some results from a test we ran:

           sample_size  expectation_value  worst_case_rel
variation 1       1101           0.076294       -0.146379
variation 2       1074           0.068901       -0.299833
control           1092           0.061355       -0.385461

This is early in an A/B test, and as usual the worst cases are all negative (i.e. a drop). Note that the worst-case change is bigger than the expected change: for the control the expected change compared to the best of the rest (variation 1) is 0.061 / 0.076 - 1 = -20%, whereas the worst-case change is almost double at -39%. And for variation 1, which is in the lead, the expected gain over the next-best variant is 0.076 / 0.068 - 1 = +11%, whereas the worst-case change is still negative. The fact that the worst-cases are all so pessimistic indicates the high level of uncertainty, which is due to the small sample size.

Should we stop the test now? No, not even if we were happy to cop a 14.6% drop in conversion rate - because looking at the population sizes we might not have enough samples to be able to trust the Bayesian methods yet.


Here's some results from another test:

           sample_size  expectation_value  worst_case_rel
variation 1      46119           0.165572        0.074664
control          51274           0.150349       -0.113701

We have a winner! The variant's worst case is a 7% increase in conversion rate. We expect it to be even higher: 0.166 / 0.15 - 1 = +10%. For maximum bragging rights, we could wait, collect more samples and hope that the 7.5% worst case increase approaches the expected 10% increase (and not vice-versa!), or we could stay humble and start the next test sooner. (Actually, in this particular case the difference in sample size between the variants was not intentional, and was due to a technical problem in which the variant was less likely to record failures. Oops. The need to detect circumstances like this is one of the reasons why we don't run Multi-Armed Bandits yet!)

It's also possible to calculate the worst case as percentage points as opposed to a relative percent.

I find the 'worst case' to be really useful when making decisions, and communicating results to decision makers. You can say things like "But if you end the test now, we could lose up to 10% of our conversion rate!", which I think is more evocative than "But there's only 80% probability that this is the best variant!". You can also say things like "Look, we've been running this test for a month now and have hundreds of thousands of samples for each variant. We still haven't found a winner but the worst case for this variant is a 5% drop.", which I think is a more satisfying answer than "We still haven't found a winner; two of the variants are neck and neck in terms of expected conversion rate but both have <60% probability of being the best", or "There's a 95% probability that the two best variants differ by less than our predefined margin of caring, so we can't distinguish between the two, so let's go with the control".

Additionally, if you don't like the winner for some reason - maybe it involves code debt or the design clashes with the CEO's shirt - this method tells you how much you stand to lose by going with your favorite variant - both the expected loss and the worst-case loss. So if the MVP of the spiffy new redesign doesn't perform as well as the ugly-but-optimized original version, the expectation values alongside the 'worst case' numbers can help you decide whether you're willing to kill off the better performing original and lose a small amount of business while you optimize the new version.

The math

If you understand the maths behind the usual Bayesian methods that ask the originally posed question "What's the probability that this variant is the best?", then the subsequent questions follow pretty easily. If you don't understand the maths, it's very briefly described here - you only need to make it to Eq. 1! If you want more background on the mathematical properties, then these lecture notes may be helpful.

OK, so having read that you're now comfortable integrating joint probability distributions (JPDs) to determine probabilities? Great.

So to answer the original question, you calculate the probability that variant i is the best by integrating the JPD over the region $\mathcal{R}: x_i > max({x_j \forall j \neq i})$:

    P(i \text{ is the best}) = \int_{\mathcal{R}} P(\mathbf{x})~d\mathbf{x},

where x_i is the conversion rate of variant i, and x is a vector of all variants' conversion rates.

If you want to implement a "threshold of caring" of c (e.g. c = 1% relative), then for each variant i you simply integrate the JPD over the region $\mathbf{R}: x_i > max({x_j \forall j \neq i}) * (1 + c)$:

    P(i \text{ is near enough}) = \int_{\mathcal{R}} P(\mathbf{x})~d\mathbf{x}.

If you want to find the "worst case change", then you still want to integrate over the regions $\mathcal{R}: x_i > max({x_j \forall j \neq i}) * (1 + c_i)$ to find the value of c_i for which

    0.95 = \int_{\mathcal{R}} P(\mathbf{x})~d\mathbf{x}.

This sounds messy, but in practice it's very easy to compute numerically.

Others' improvements on the original question

My friend Thomas Levi developed a similar method to the "threshold of caring" (unpublished as of yet) for the control and a single variant. It gives you one of three answers each time you run it:
1) Stop the test and declare the winner
2) Stop the test and declare no result: the variant's conversion rate is within Y% either side of the control
3) Don't stop the test yet!
I argue that 1) and 2) have similar consequences in terms of what you actually do - both of them call for you to stop the test and pick a version to push to 100% - so we may as well combine them.

 

Ben Tilly also suggests moving from "We're confident that we're not wrong" to "We're confident that we didn't screw up too badly. (And hopefully we're right.)" - and provides a sequential frequentist scheme.

 

Note that the "threshold of caring" in Chris Stucchio's approach (which inspired mine) does something different again. He computes the expectation value of the function max(-drop in conversion rate, 0):

    \int max(0, max({x_j \forall j != i}) - x_i) P(\mathbf{x})~d\mathbf{x},

or equivalently for the region $\mathcal{R}: x_i < max({x_j \forall j \neq i})$,

    \int_{\mathcal{R}} (max({x_j \forall j != i}) - x_i) P(\mathbf{x})~d\mathbf{x}.

Essentially, he asks

    "Consider the expectation value for the difference in conversion rates between the variants. What is the contribution to that number from the region of the JPD where the underdog is better?"

Or equivalently (I think?),

    "Given that you're making a mistake by choosing variant A, how much will that mistake cost you? Multiply this number by the probability that you did made a mistake by choosing variant A."

I think that the expected cost of a mistake is an interesting quantity, and the probability of making a mistake is an interesting quantity, but I don't intuitively understand why their product is of fundamental importance on its own. I don't know whether it was chosen as an ad-hoc combination of the two factors or to draw on some deep result.


The code

If you have code that calculates $P(i \text{ is the best})$ by Monte-Carlo sampling, then it's very easy to extend it to calculate the other two quantities. Here we only discuss implementation for a Bernoulli model with a beta prior.

$P(i \text{ is the best})$ may be calculated by constructing a beta distribution for each of the variants' conversion rates, then taking a large number of samples from each of the variants' distributions:

from scipy.stats import beta

def draw_monte_carlo(data, prior, num_draws=100000):
    """Construct and sample from variants' beta distributions.

    Use the data and the prior to construct a beta distribution for
    the conversion rate of each of the variants. Sample from these
    distributions many times and return the results.

    INPUTS:
    - data (pd.DataFrame): Each row is the collected data for a
    variant, giving the number of successes (conversions) and
    failures (non-conversions)
    - prior (pd.Series or dict): Parameters defining a beta prior.
    - num_draws: The number of draws to take from the distributions.
    The more, the merrier.

    RETURNS:
    - draws (pandas.DataFrame): Each column is a variant's monte
    carlo draws.
    """
    draws = pd.DataFrame(columns=data.index)

    for variant_name, variant_data in data.iterrows():
        # Calculate parameters for beta distribution
        # Here we assume you're using the same prior for all variants
        a = variant_data['successes'] + prior['successes'] + 1
        b = variant_data['failures'] + prior['failures'] + 1

        # Sample the beta distribution many times and store results
        draws[variant_name] = beta.rvs(a, b, size=num_draws)

    return draws

Compare the nth sample from each distribution and note which variant won. For each variant i, the percentage of samples for which it wins is approximately $P(i \text{ is the best})$.

def calc_prob_of_best(variant, draws):
    """Return the probability `variant` is the best.

    INPUTS:
    - variant (str): Name of the variant
    - draws (pandas.DataFrame): Each column is a variant's monte
        carlo draws.
    """
    best_of_the_rest = draws.drop(variant, axis=1).max(axis=1)
    win = draws[variant] > best_of_the_rest
    return win.sum() / len(win)

From here, calculating $P(i \text{ is near enough})$ for an *absolute percentage* is extremely easy - simply subtract c from each of variant i's samples before you do the comparison. Effectively you're introducing a simple handicap to the game.

def calc_prob_of_near_enough(variant, draws, care_threshold):
    """Return the probability that `variant` is 'good enough'.

    INPUTS:
    - variant (str): Name of the variant
    - draws (pandas.DataFrame): Each column is a variant's monte
        carlo draws.
    - care_threshold (float): Defines 'good enough'. E.g.
        `care_threshold=-0.01` implies that a 1 percentage point
        drop is 'good enough'.
    """
    best_of_the_rest = draws.drop(variant, axis=1).max(axis=1)
    win = draws[variant] - care_threshold > best_of_the_rest
    return win.sum() / len(win)

Now you _could_ solve for c_i by numerically solving the above function for a value of c_i that gives $P(i \text{ is near enough}) = 0.95$, for each variant i. But there's a much easier way. Instead, just calculate the differences between i's sampled conversion rates and the best of the others, then find the 5th quantile:

def calc_worst_case_abs(variant, draws, confidence=0.05):
    """Return the worst-case absolute conversion increase for `variant`,
    compared to the best of the other variants.

    INPUTS:
    - variant (str): Name of the variant
    - draws (pandas.DataFrame): Each column is a variant's monte
        carlo draws.
    - confidence (float): Defines the probability of the
        "worst case"; defaults to a 5% chance of the worst case
        eventuating.
    """
    best_of_the_rest = draws.drop(variant, axis=1).max(axis=1)
    differences = draws[variant] - best_of_the_rest

    return differences.quantile(confidence)

Finally, for the worst case relative change in conversion rate,

def calc_worst_case_rel(self, variant, confidence=0.05):
    """Calculate the worst-case relative change in conversion rate.

    INPUTS:
    - variant (str): Name of the variant
    - draws (pandas.DataFrame): Each column is a variant's monte
        carlo draws.
    - confidence (float): Defines the probability of the
        "worst case"; defaults to a 5% chance of the worst case
        eventuating.
    """
    best_of_the_rest = self.draws.drop(variant, axis=1).max(axis=1)
    rel_change = self.draws[variant] / best_of_the_rest

    return rel_change.quantile(confidence) - 1

Conclusion

I have described a Bayesian technique to analyse A/B tests that I think is very useful to inform decision making. Unlike the most common Bayesian methods, my method can distinguish between a near-null result and a test with too-few samples.

Acknowledgments

Thanks to Thomas Levi, Matt Gibson, Carlos Pacheco, Shamindra Shrotriya, and Richard Weiss for useful debates and discussions. None of these people necessarily agree with its contents!

About the Author

Felix Lawrence is a Data Scientist in the Vancouver office of Freelancer.com. His hobbies include skiing, craft beer, and applied math.

Próximo Artigo

Kim Dotcom on the Future of the Internet