Avoiding Bias in Offline A/B Testing

of one

Avoiding Bias in Offline A/B Testing

When attempting to develop A/B testing, there are many unintentional errors one can make which will inevitably bias the results, such as failing to take into account other factors or introducing an arrival rate bias, dilution bias, or publication bias into the test. Perhaps most insidious are biases that arise from desiring a particular outcome, which can result in stopping the test early (probably the most common error in A/B testing), running the test repeatedly, or dividing the test group into smaller and smaller sub-groups until the desired outcome is reached.


  • A subtle form of bias is the belief that the A and the B are the only factors that might change in the course of testing.
  • For example, if one were to A/B test two kinds of bait, and test one in the morning and the other in the afternoon, one bait is likely to outperform the other even if the fish have no actual preference.
  • The so-called p-value is the calculation as to what degree other factors may have influenced the test; in science, a p-value of 0.05 or less (i.e., a 5% or less chance that the results come from random chance) is the gold standard for whether an experiment's results were due to more than random chance.
  • Avoiding introducing random bias in, for example, the A/B test of a web page requires at least 25,000 visitors; however, there is also a danger in letting an A/B test go on too long in cases where fewer subjects are available (e.g., a low-volume, high-margin business).


  • P-hacking occurs when the desire to get a statistically significant result overrides the desire to get an honest result.
  • The most common form of p-hacking is to simply run the test over and over again until statistical significance is achieved.
  • Another variant is to segment the data into smaller and smaller slices until the desired result is found in one of them.
  • Yet another unintentional p-hack involves watching the data in real-time and declaring victory whenever it happens to cross the desired threshold. "Statistical significance analysis is based on the assumption that your sample size was fixed in advance." Lacking a fixed end point "is even more dire as you didn’t even have the chance to get it right." (See the graphs halfway down this page for a good visual illustration as to why this is a problem.)
  • "Peeking" at the data twice doubles the chance of error in interpreting the results, 5 times triples the chance of error, and 10 times quadruples it.
  • Despite this, 57% of testers will end an experiment when it appears to confirm their hypothesis.


  • Arrival rate bias is an unintentional bias in the results which is introduced when the participants are selected for the A or B group based on when they "arrive."
  • In practice, existing customers may be assigned to the A or B group randomly in advance, but since A/B testing works best with groups of roughly equal size, which customers are "bucketed" is often determined by when they "arrive," e.g., when they download a new app or when they make a particular purchase.
  • Consequently, customers who interact with the company more often are more likely to be selected, weighing the experiment in favor of the most active customers, who may not be representative of the customer base as a whole.


  • Conversely, including all customers in a bucket whether or not they trigger the A/B change can also bias the experiment.
  • For example, if all customers in an international company are bucketed, but only those in the US are the targeted for the change, the company will inadvertently compare just US customer in the B group while including global customers with US customers in the control (A) group, again biasing the experiment.
  • This is also known as "bucket imbalance."



Note: There was some question about how to interpret the phrase "A/B offline testing." Since our initial research indicated that most or all biases that might occur in A/B testing in an offline environment could also occur in an online environment, we understand this criterion to be inclusionist rather than exclusivist: That is, the scope of a given bias must include the ability to occur in an offline environment but need not exclude online applicability.

Initial research found no shortage of sources discussing A/B testing and the biases and other errors that can occur in the process. The vast majority were, naturally, focused on A/B testing on website performance and online sales conversions, and we rejected many that simply did not seem to have much application outside of the internet. Others went deep into the math behind calculating statistical significance, burying any useful information about biases in highly-technical jargon. We rejected these as too arcane to be practically useful. Consequently, most of our best sources proved to be written by marketers for marketers.

There are very few statistics available on the affect of biases in A/B testing. We hypothesize that collecting such statistics would be extremely difficult due to businesses having little motivation to share that much detail about their internal operations. While a consulting firm might be in a position to collate together the bias problems faced by their clients, non-disclosure agreements might make publishing the results rather problematic. Consequently, far more of our findings detail how one might create a bias in an A/B test than provide statistics on how often a given error occurs.