What is an A/B test? And how to Perform it?

AB testing is widespread among companies, and it is a useful statistical tool for product managers. The test performing and analysis results methods vary from company to company and depend on the chosen tool. The AB-test is appropriate for solving a single-variable-problem. It is suited to cases where the product manager needs to examine the users’ reactions correlative to a change in the system. The product manager should define two groups, the control group and the test group, and set one measured variable for them.

Many analytical tools allow us to perform the AB tests; they are divided into two main groups — internal tools and off-the-shelf products. The choice of tool is individual, and it depends on the method chosen for analyzing the information, as well as on the requirements of the test. Another consideration in selecting the device depends on the current position of the company. For a company with limited resources, which is trying to build its product quickly, it is advisable to opt for an external testing tool. On the other hand, for a company that can allocate development resources, it is advisable to use tools customized to its needs. Internal tools are a product of the company’s tailor-made development. They suit the data type that the company collects, the quantities measured, and the performed tests. The COTS tools differ in the accessibility of information, statistical analysis, information sources, and how easy it is to connect the product's tool.

This article will review the steps of the A/B test process according to the following method:

1) Setting up the experiment

  • Defining the experiment problem
  • Defining the measured variable
  • Defining variations
  • Defining the confidence interval and the statistical significance
  • Setting the sample size
  • Selecting the sample

2) Execution of the experiment

3) Analyzing the results

4) Research and Insights

Part 1 — Setting up the experiment

Defining the experiment problem

The product manager should determine the experiment problem measurably and unambiguously as a numerical value. For example, a problem could be a conversion rate measurement. The experiment question will be, “What is the percentage of users who clicked a button in each experiment group?” It is essential to focus on the experiment’s problem while performing a particular action with a well-defined metric.

Defining the measured variable

The A/B test is a simple and convenient tool, and it offers an essential tool that provides an answer to the experiment question — “yes” or “no.” It is impossible to obtain far-reaching conclusions by using the A/B test. The most common examples of the A/B test are in the user interface. The following are examples of experimental variables related to user actions in the system UI: Button color, font, micro-copy, sizes, location of components in the product.

Defining variations

The control group should offer the product with the existing variable in the routine. Also, the product manager should set 2–3 variations of the variable. It is crucial to make sure of only one change; otherwise, the test results may be skewed and ambiguous. The following are examples of variables: control group — blue button, variation 1 — red button, variation 2 — green button. It is essential to make sure that the gaps between the observed differences are relatively small.

It is also possible to perform an A/B test on changes that are not in the user interface. For example, if the question set up in the experiment is, “How long does a process run in the system?” the control group should measure the length of time in the current service. Different variations can include replacing a service with another facility to work with one micro-service, and some will work with a second micro-service. The A/B test measures the difference in milliseconds between the two cases, and then the PM will be able to decide which service saves time.

Defining the confidence interval and the Statistical significance

Stated, if the confidence interval is 95%, then the statistical significance is 5%. A 95% confidence interval means that if we repeat the A/B test 100 times, then in 95 of the examinations, we will get the same result. The PM should decide the confidence interval while planning the experiment.

Setting the sample size

The confidence interval set in the experiment setting affects the sample size, while the sample size is a derivative of the number of users in the system. For B2B products with ~50 users, the A/B test is challenging to perform due to sampling size limitations. In this case, the PM should consult the customers, offer them the variations, and ask for their opinion. The sample size is the critical parameter in the experiment, and it allows the PM to try to ensure that the sample is representative of the population in statistical terms. By defining a suitable sample size, the experiment's hypothesis at the beginning of the experiment can be confirmed.

Select the sample

The natural choice is to set the sample selection randomly; however, this is not simple. The PM should be able to distinguish between different segment groups within the sample. He or she needs to prove that the sample selection is random and that all the tested users do not belong to a particular segment group. Here are some examples of different sectors: only new users, returning customers, only users from the US, only users who came to the product through a particular channel, only Android users, only iPhone users. At this point, the sample must represent all user segments. Otherwise, the test results will represent insights only on one or more particular sub-groups and not on the measured feature.

A sample ought to be a cross-section of the entire product user population. When the product manager selects a particular sample, he sets a new variable experiment called “a sample error,” representing a range of two values. The groups of each sample should be well defined, as they affect the statistical significance. The user’s source should be built so that the PM will be able to distinguish between the different segments during the experiment results analysis.

Part 2 — execution of the experiment

It is recommended that the regular performance of A/B tests of a system be assimilated into every company's corporate culture. The PM should try to ensure that all work teams are aware that the PM is conducting A/B experiments in the system that does not interfere with the development team's work routine. A practical method of implementing this type of use is developing a “feature toggle” or “feature flag” method. In this method, it is possible to “turn on” or “turn off” certain features for different groups as an integral part of the system.

Part 3 — Analyzing the results

Reasons for failure of an experiment

  1. A multiplicity of variations relative to one control group — produces a statistical problem, in which the experiment results are misled by one of the variables. In this situation, a particular variation is measured as successful, although, in practice, it was not meant to be [unclear. Give an example]. While performing an A/B test with multiple variables, it is necessary to significantly increase user traffic and the test duration, with a correlation to the sample size.
  2. False-positive results — are called, statistically, a “first-order error”. A famous example that illustrates the principle is the “41 shades of blue” test performed by Google. In 2009 a team of Google was challenged to decide which shade of blue encouraged users to click on an advertisement, and so they measured 41 shades of blue. Statistically, it caused an 88% chance that the chosen shade was incorrect, with a significance of 5%. It can be calculated according to:

Tip: choose only 1–3 variables per control group. Above this number, there is a risk that the experiment results will be biased.

3. When the A/B test runs a long time, two factors can introduce “noise” into the groups’ measurements — internal and external.

  • External — effect on the measurements of both groups, but not necessarily in all cases. Examples of external bias regarding changing user behavior — holidays, global sales, sports events, nature events, and more.
  • Internal — when the user makes a change during the A/B test, say the user first uses the group A system, but after a while, uses the group B system — for example, if the user’s browser deletes the cookies. This adds an “internal” bias to the measurements. As a result, the experimental results are skewed and undermine the PM’s efforts to pinpoint which group induced the user to convert to or abandon the product.

4. Changing the user’s traffic while A/B tests are in progress or adding new user sources to the product during the experiment.

5. The Simpson’s Paradox — is a paradox in the field of statistics that was first described in 1899 in the work of the statistician Carl Pearson. The paradox argues that the average result represents the opposite conclusion from any other way of looking at the test's particular behavior. If the PM performs a results analysis in the middle of the experiment, for example, after a week, the research will last two weeks. He may find the opposite result from the average result he will receive at the end of the experiment. The way to avoid the paradox is to analyze the average results over the entire duration of the pre-defined examination period and for each sample group.

Part 4 — Research and Insights

If an experiment fails, then the PM needs to investigate the failure to avoid repeating the mistakes the next time in a similar test. It is essential to get to the main problem and understand the errors that caused the experiment to fail, summarize the significant insights, and send it to the work teams and stakeholders. The PM must perform the test again to get an answer to his experiment question.

In conclusion, the A/B test is a tactical rather than a strategic tool, which can support a decision in the product development process. This tool is not about optimizing components in systems or optimizing processes in a product. For that, there are other tools. The product manager often makes decisions based, on the one hand, on intuition, and long-term strategic planning on the other. The A/B test is not a test that gives full correlation or causality, but it can direct some aspect of the product to progress in the right direction.

Written by Maayan Galperin

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store