A/B Testing

Does your manager decide based only on intuition and experience, not data? Do you like to experiment and analyze test results? Would you like data to validate your assumptions? Do you prefer to make data-driven decisions? If your answers to these questions are "YES", then A/B testing is relevant.

A/B Testing is a method to compare two (or more) variations of something and determine which one works better. In this method, users are randomly assigned to one of two variants. A statistical analysis is performed to determine which variation performs better for a defined business goal.

While A/B testing is applied in many areas (life sciences, social sciences, advertising, etc.) this article focuses on application to web/mobile app design and user interfaces.

Discussion

• Why should I do A/B testing?

In the world of advertising, there's no way of telling if a campaign will be successful. Huge advertising budgets have been wasted because of bad guesses. What if we could instead test on a small sample and use that data to figure out user preferences for the entire population? This is exactly what A/B testing offers.

When applied to digital marketing, the design of web pages or mobile app screens, A/B testing gives data that can help us choose among alternatives. The goal is Conversion Rate Optimization (CRO). Conversion could mean a purchase, an email sign-up, account creation, survey completion or app download. One survey showed that 67% of participants chose A/B testing for CRO. It was also the most preferred method.

With A/B testing, you can make more out of your existing traffic. Since even small changes can lead to better conversion or lead generation, the return on investment (RoI) can be massive sometimes. The results help us to understand user behaviour better and determine what impacts user experience. Testing validates assumptions. It removes guesswork and enables data-informed decisions from "we-think" to "we-know".

• What does it mean to call it "A/B"?

The term "A/B" comes from the fact that two groups are used for the test. The control group is shown the base version. The test group is shown the variation. These two groups are also called A and B respectively.

An alternative interpretation is that A and B refer to two variations that need to be tested to ascertain which one is better. In this case, the baseline without either A or B, is the control group. We may even call this "A/B/C" testing.

In life sciences where A/B testing is used to measure the effectiveness of a drug, the test group is often called the treatment group.

• What's the recommended process for A/B testing?

The correct way to run a A/B testing is to follow a scientific process:

• Construct Hypothesis: Observe user behaviour and brainstorm hypotheses. Prioritize what to test based on expected impact and difficulty of implementation.
• Create Variations: Decide on which feature you want to vary, such as changing the colour or size of the call-to-action button.
• Run Experiment: Randomly assign the users to test and control group and collect relevant data for analysis.
• Analyze Results: Once the experiment is complete, you can analyze the results. Ask which variation performed better and whether there was a statistically significant difference.
• Repeat: Based on the results, if we want to test another variation, repeat the same process.
• Could you share some tips when planning for A/B testing?

Use your experience to know what variations could affect user behaviour for the relevant business goal. The variation could be in text such as the title or heading. It could be in terms of design such as the choice of colour. It could be a different layout. It could be number of fields in a form or when to send an email.

How long should you run the test? The common approach is to look at 95% statistical significance, standard deviation of the conversion rate, and sample size. You need 500 users per variation for statistical significance. For search or category tests, this could be 1000 users. Online calculators are also available to know how long to test: from Convert, from Optimizely.

An article on Smashing Magazine gives a list of Do's and Don'ts. Optimizely has shared what variations are businesses testing and which ones are effective.

Some common threats to the validity of A/B tests are explained in an article on Instapage blog.

• What are some myths associated with A/B testing?

Some things about A/B testing are misunderstood. We clarify the following:

• A/B testing is not the same as conversion rate optimization. A/B testing is part of optimization. While conversions are important, pay attention to secondary metrics. Analyze all steps of the funnel.
• To test everything is probably a waste of resources, particularly when your user base in not as large as that of Google or Facebook. Prioritize on things you think can make a real difference.
• While changing one variable at a time is good advice, sometimes it makes sense to include multiple changes, such as on a badly designed page. Consider also how factors interact with one another.
• Your tests should be long enough to account for influencing factors, such as weekday vs weekend, or a big sale. Look for 95% statistical significance but don't rely on this alone to stop a test. Wait to reach a calculated sample size or a stable conversion rate.
• If you obsess over velocity and real-time optimization, you might make mistakes and lose out on valuable insights.
• A one-off test may give temporary results. Iterate your testing. Retest to rule out false positives.
• What do you mean by 95% statistical significance?

This is best explained with an example. Suppose the results of an A/B test show "Control: 15% (+/- 2.1%); Variation 18% (+/- 2.3%)". You might interpret that the actual conversion rate is between 15.7% and 20.3% for the variation; but this is in fact a common error among those not familiar with statistics.

The correct interpretation is that if we were to perform the tests multiple times, 95% of the ranges will include the true conversion rate. In other words, in 5% of the tests, the true conversion rate will fall outside the range.

To put it differently for the above example, we are 95% confident that the true conversion rate is between 15.7% and 20.3%.

Ultimately, results from A/B testing must be statistically validated, usually with Chi-squared test, to rule out differences due to chance. Among the tests are Fisher's Exact, Fisher's Monte Carlo and Chi-squared tests. The last is the simplest for sample sizes exceeding 10,000. Fisher's Monte Carlo, a computational feasible approximation of Fisher's Exact, is more accurate for smaller samples.

• How is A/B Testing different from Multivariate Testing?

With A/B Testing, you're splitting the traffic between two variations of a single variable. What if we wish to test 10 different shades of blue for a button? This is considered as A/B/N Testing. A visiting user is shown one of these variations.

When a page is modified in more than one way, we call it Multivariate Testing. For example, in the control variation there's no sign-up form or embedded video. In variations, either one of these or both are included. Tools will generate all possible combinations of variations once the variable-level changes are specified. Such variations may be important to test influence of one variable over another. For example, sign-up form might generate more sign-ups if there's also a video on the page. Because there could be many variations to test, this is suitable only for sites with high traffic.

However, use multivariate testing with caution due to the danger of spurious correlations, which could be nothing more than random fluctuations.

• What are some tools available for A/B testing?

TrustRadius lists a number of A/B testing tools and mentions that the top ones are Kameleoon, AB Tasty, Evergage, Qubit, Dynamic Yield, and Visual Website Optimizer. Others include Optimizely, Monetate, and Maxymiser. Another list includes Unbounce, Kissmetrics, Optimizely, Maxymiser, and Webtrends Optimize. Another list of 30 tools includes Optimizely, Unbounce, Visual Website Optimizer, Usability Hub, Maxymiser, A/Bingo, Kissmetrics, Google Analytics, AB Tasty and Adobe Target.

Tools may be for small businesses, mid-sized businesses or large enterprises. While many are paid tools or even managed services, Google Optimize is a free tool. Google Optimize 360 is a paid enterprise-level tool. These Google tools deprecate the older Analytics Content Experiments.

Basic features of a tool include splitting traffic to different variations, calculating conversions, and measuring statistical likelihood that one variation will consistently outperform another. Additional features may include easy creation of variations, customization, and multi-page campaigns.

• Are there situations where A/B testing is not suitable?

A/B testing cannot make a good design choice on its own. It can compare two or more design choices and help us decide which is better.

Incremental changes can improve design only up to a certain extent. In order to improve it radically, the designer should think more creatively and understand the user needs better. With A/B testing, we may find ourselves chasing "local maxima", the best outcome but within narrow constraints. There are therefore situations when we need to think outside the box and make radical changes. A/B testing is not the way to do it.

There are also some reasons against A/B testing. With A/B testing you're testing your own assumptions and therefore prone to individual bias. A test could take 3-4 weeks, valuable time that could be spent elsewhere. If your user base is not large, you'll not achieve statistical significance.

• When applied to websites, how does A/B testing affect SEO? Are there any best practices?

A/B testing or multivariate testing does not pose any inherent risk to the website's search rank if used properly. However, it's important to know the best practices:

• No cloaking: Serve the same content to users and search engines regardless of user-agent.
• Use rel="canonical": In A/B testing, alternative URLs should refer to the original URL as the canonical so that search engines will not index the alternative URLs.
• Use 302 Redirects instead of 301s: When using redirection to a variant URL, use 302 redirects since the variant will exist only for the experiment. 301 is a temporary redirect while 302 is permanent.
• Run experiments only as long as necessary: Once you have enough data, revert to original site and remove variations.
• Could you share some examples where A/B testing has been used?

Here are a few examples:

• Humana's Banner Test: Humana tested two different banners on their homepage. A simpler design plus a stronger call-to-action (CTA) led to 433% more click-throughs. In the control variation, the CTA is not clear and in the test variation the CTA is clear.
• Hubspot's Lead Conversion: Using an inline form for CTA led to 71% better conversion compared to a linking to a form on a separate page.
• MSN Home Page Search Box: The Overall Evaluation Criterion was the click-through rate for the search box and the popular searches links. Taller search box was compared with bigger search button. However, no remarkable differences were found.
• Electronic Arts: For their SimCity 5 game, they tested conversions between direct promotional banner and no banner. Surprisingly, the latter drove 43.4% more purchases.
• Coderwall: A simple title change improved Coderwall's click-through rate (CTR) by 14.8% on Google Search. Over time, this resulted in improved ranking as well.

Milestones

1753

James Lind publishes "A Treatise of the Scurvy" in which he compares different treatments for scurvy using controlled trials. In well-planned clinical trials from the late 1740s, James Lind provides remedies to the identical diets of six pairs of men with symptoms of scurvy. He finds that a diet that includes oranges and lemons (vitamin C) cures scurvy.

1863

Austin Flint conducts a clinical trial that compares the efficacy of a dummy simulator to an active treatment. The dummy simulator, also called placebo, is really for moral effect. Flint's technique is called "Placebo and Active Treatment". Used from at least the 18th century, placebos avoid observer bias.

1920

Statistician and biologist Ronald Fisher discovers the most important principles behind A/B testing and randomized controlled experiments in general. He runs the first experiment in agriculture.

1923

In his book "Scientific Advertising", Claude Hopkins writes,

Almost any questions can be answered, cheaply, quickly and finally, by a test campaign. And that’s the way to answer them—not by arguments around a table. Go to the court of last resort—the buyers of your product.
1948

The Medical Research Council conducts the first ever randomized clinical trial to determine the efficacy of streptomycin in the treatment of pulmonary tuberculosis. Unlike prevailing practice, subjects are randomly allocated to one of two test groups, thus avoiding any bias. A/B testing is increasingly applied in clinical trials from the early 1950s.

1960

In the 1960s and 1970s, marketers start using A/B testing to evaluate direct response campaigns. An example hypothesis would be, "Would a postcard or a letter to target customers result in more sales?"

1990

With the coming of the World Wide Web in the 1990s, A/B testing goes online and becomes real time. It can also scale to large number of experiments and participants. The underlying math hasn't changed though.

2000

Google engineers conduct A/B tests to find the optimal number of display results per page. The test was a failure due to a performance bias because variants that displayed more results were slower. But Google did learn many lessons about the impact of speed on user experience.

2007

Obama's presidential campaign has a new digital team advisor, Dan Siroker. He introduces A/B testing in the campaign with a goal of converting visitors to donors.

2009

A team at Google can't decide between two shades. So they test 41 different shades of blue to decide which color to use for advertisement links in Gmail. The company shows each shade of blue to 1% of users. A slightly purple shade of blue gets the maximum clicks, resulting in \$200 million boost of ad revenue.

2011

Google runs 7,000 A/B test experiments on its search algorithm. Like Google, Netflix, Amazon and eBay are keen adopters of A/B testing.

Author
No. of Edits
No. of Chats
DevCoins
6
4
2201
6
0
1167
1
2
54
2442
Words
6
Likes
26551
Hits

Cite As

Devopedia. 2022. "A/B Testing." Version 13, February 15. Accessed 2022-09-22. https://devopedia.org/a-b-testing
Contributed by
3 authors

Last updated on
2022-02-15 11:49:00
• Multivariate Testing
• Bandit Testing
• Discrete Choice Modelling
• Statistical Inference
• Spurious Correlation
• Site Map