Introduction to A/B Testing
A/B Testing (also known as experimentation or bucket testing), enables the product teams to gain more insights and understandings of PayPal users, their behavior and interaction with PayPal product pages, product flows and products in general (with user permissions). It is critical to arrive to some product decisions fast in a data driven manner, and hence facilitates fast iterations and turnarounds to speed up innovations and product development cycles.
With tracking of behavior data from different versions of the product pages or apps, we can analyze the data to learn the most successful variation that maximizes the number of transactions, revenue, conversions, page views, or any key relevant metric. A/B Testing takes the guesswork out of product development and enables datainformed decisions by shifting business conversations from “we think” to “we observe”. In measuring the impact and changes of key metrics, the product engineering team can ensure that every change produces positive impacts and results.
Next Generation PayPal Experimentation Platform
To this aim, we have built a next generation PayPal experimentation platform to simplify and streamline all the experimentation operations, and new PayPal Experimentation Analytics to quickly learn the insights from the experimentation, and to make datadriven decisions.
In this blog, we’ll focus on describing the efforts of building a new generation PayPal Experimentation Analytics, specifically on how we leverage Druid data engine, as well as how we extend Druid data engine to calculate statistical significances.
New Experimentation Analytics Dashboard provides an overview that enables teams to easily compare all of relevant success metrics for each of the experiment variations.
In addition, it also enables easy zoomins to see the details of traffic splits along with their metrics. Most importantly, it can calculate whether these observations and metrics are simply by chance, or they are actually statistically significant.
Figure 1: New PayPal Experimentation Dashboard
Next Generation PayPal Experimentation Analytics Platform
New PayPal Experimentation Analytics Platform is built to power up its dashboard in realizing the above goals. In a very high level, there are two stages of processing:
 Data Enrichment – We first enrich experimentation tracking data with experiment and treatment metadata using Apache Spark. In doing so, data sets can be loaded to a data engine for meaningful comparisons, i.e. comparing corresponding data sets of different treatments within the same selected time periods.
 Data Indexing – the new PayPal Experimentation Analytics Platform is built on top of Druid data engine, to slice and dice verylargescale multidimensional data sets, fast enough to power interactive queries, utilizing Hadoopbased Batch Ingestion to take advantage of PayPal Hadoop cluster to perform indexing into Druid.
In addition, PayPal Experimentation Analytics Platform is built with the following forward looking designs in mind:
 Real time – The platform can be extended to process streaming data in realtime if needed. For example, the Spark code can be quickly modified for this purpose. Druid has builtin realtime nodes for realtime data ingestion. In fact, our vision is to unify streaming and batch processing without duplicate processing capacity and sacrificing correctness. Druid can be fit into this architecture naturally.
 Cloud ready – As Druid deep storage component is pluggable, Druid can utilize cloud storages, for example, Google Cloud Storage or AWS S3 as needed. when migrating to cloud.
Hypothesis Testing
Simply speaking, A/B testing is a way of comparing two versions of a single variable typically by testing a user’s response to version A against version B, and determining which version of variables is more effective.
In the field of statistics, A/B Testing is basically a hypothesis testing for comparing two population proportions. A hypothesis test is a statistical test that is used to determine if there is enough evidence from our observations to infer whether a certain hypothesis is true.
In other words, we are using hypothesis testing to tell us whether the differences between treatment A and treatment B actually affect user behavior, or the observations from variations are due to random chances.
For hypothesis testing, we first start with a null hypothesis. We should then hypothesize the sampling distribution and select a test statistic. In doing so, we can compute the test statistic and calculate pvalue to determine whether we should reject the null hypothesis (see below).
Null Hypothesis
For A/B testing, the null hypothesis is usually that the observations from the control group are no difference from the observations from the treatment group, which is similar to the presumption of innocence in our legal system, where we begin a trial by assuming a defendant to be innocent.
The null hypothesis is generally assumed to be true until the evidence indicates otherwise. In statistics, it is often denoted as H_{0}.
In order to evaluate evidence, we need to select an appropriate distribution for the test.
ZTest
Due to a very large user base of PayPal, the central limit theorem can be applied. Therefore, our test statistic is approximately normally distributed. The distribution of the test statistic under the null hypothesis can be approximated by a normal distribution. As a result, we opt to use ZTest.
ZScore
From https://en.wikipedia.org/wiki/Ztest, zscore is defined as the distance from the sample mean to the population mean in units of the standard error:
z = (M – μ ) / SE
where M is the sample mean and μ is the mean of the population, and SE is the standard error of the sample mean.
Zscore is used as a normalized measure of the deviance from the center.
TwoSample ZTest
We are performing a hypothesis test to compare two population proportions P_{1}, and P_{2},
where the null hypothesis H_{0} is:
P_{1} (Control metrics) ≈ P_{2} (Treatment metrics)
the metrics can be relevant metrics, for example, conversion rate, number of transactions and etc.
Please see https://www.isixsigma.com/toolstemplates/hypothesistesting/makingsensetwoproportionstest/ and substitute with the null hypothesis P_{1} – P_{2} = 0
Assuming the null hypothesis H_{0} is true
=> P_{1} – P_{2} = 0
i.e., there is no difference between two population proportions.
PValue
The pvalue is the probability of a standard normal distribution that is extreme, or more extreme than the observed zstatistic. Please refer to http://www.stat.yale.edu/Courses/199798/101/sigtest.htm for more explanations.
Per https://en.wikipedia.org/wiki/Pvalue, pvalue is the probability, assuming the null hypothesis is true, of obtaining a result where the test statistics is as extreme, or more extreme than what was actually observed, if we continue to run the experiments over and over.
Figure 2: Pvalue from https://en.wikipedia.org/wiki/Pvalue
PValue Of a TwoTailed ZTest
In the context of our A/B testing, we are interested in testing whether there is difference between the observations from the control, and the observations from the treatment. Therefore, we are calculating P–value of a twotailed Ztest as 2 * P(Z ≥ Zscore), with the assumption that the null hypothesis is true.
From Mind on Statistics 3rd Edition:
Our test data can be hypothesized to follow the normal distribution per central limit theorem.
Since a normal distribution is symmetrical, and we standardize the normal distribution to a standard normal distribution
=> pvalue for a twotailed test = 2 * area to the right of z
Recall the Pvalue measures the plausibility of H_{0}. The smaller the Pvalue, the stronger the evidence is against H_{0}. The pvalue is the probability that you have falsely rejected the null hypothesis.
As an example, a pvalue of 0.0128 (say alpha = 0.05) reveals that it is unlikely we would observe such extreme test statistic in the direction of H_{a,} if H_{0} is true. Therefore, the initial assumption of H_{0 }is true, is likely incorrect so we reject the null hypothesis H_{0}, in favor H_{a}, i.e. the differences between control and treatment is statistically significant.
In case of a large pvalue (say 20%) we would say that we failed to reject H_{0}, similarly to failing to disprove the innocence.
Druid
Druid is an opensource, highperformance columnoriented, distributed realtime analytics data engine that we use to power dashboards and interactive queries for PayPal experimentation analytics needs.
It is horizontally scalable due to its sharednothing architecture. Its column oriented storage format makes it suitable for OLAP query. Druid uses segments as fundamental storage units. As Druid segments are immutable, it simplifies query processing and makes scaling out easy. In addition, Druid segment size is configurable to tune for optimal query performance. Druid uses zookeeper as the coordination service to make itself resilient. Please refer to Druid paper in SIGMOD 2014 for details about its architecture.
As powerful as Druid can be, it can’t tell you whether observations from group A and group B are by chance, or whether there is statistical significance behind the data. That’s why we are contributing new operators to facilitate making such decisions.
Druid Post Aggregators
A/B testing is boiled down to comparing two population proportions with independent samples. As described before, we are using Ztest as our test statistic, since it is safe to assume the distribution is approximately normal, based on Central Limit Theorem, for our experimentation scenarios.
Druid is built with extensions in mind. To this aim, we would like to contribute new post aggregators to druidstats extension:
zscore2sample
 Calculate the zscore using two sample Ztests while converting binary variables (e.g. success or not) to continuous variables (e.g. conversion rate).
The grammar for a two sample ztest post aggregation is:
postAggregation : {
"type": "zscore2sample",
"name": <output_name>,
"fields": [<count 1 (post_aggregator1)>, <sample size 1 (post_aggregator2)>,
<count 2 (post_aggregator3>, <sample size 2 (post_aggregator4)>],
}
For example, the fields can be
{
"fields": [<success count of sample 1>, <sample size of population 1>, <success count of sample 2>, <sample size of population 2>]
}
pvalue2tailedztest
Calculate pvalue of twotailed ztest from zscore.

 pvalue2tailedztest – the input is zscore, which can be the zscore calculated using zscore2sample post aggregator
postAggregation : {
"type": "pvalue2tailedztest",
"name": <output_name>,
"fields": <zscore (post_aggregator)>
}
{
"type": "pvalue2tailedztest",
"name": pvalue,
"fields": zscore
}
To use these aggregators, you’ll need to include the stat extension in your configuration file:
druid.extensions.loadList=["druidstats"]
Implementation of Druid Post Aggregator for twosample Ztest
We need to first create a class to implement Druid PostAggregator Interface:
@JsonTypeName("zscore2sample")
public class ZtestPostAggregator implements PostAggregator {
private final String name;
private final List fields;
...
where an important method of this interface to implement is:
public Object compute(Map<String, Object> combinedAggregators);
Druid Query Evaluation of POST Aggregators
A Druid query will be translated to a directedacyclic graph (DAG). A Druid post aggregator operator will be responsible for unrolling of its arguments, by invoking its dependent fields, which could be operators too and passing down the query context (which is stored as a map of <name, object> pairs) to do evaluations.
A Druid Query Example using Zscore and Pvalue Post Aggregators
Druid query engine relies on the post aggregators to do the unrolling themselves. Please see the following JSON query as an example in how to use these two aggregators together.
The pvalue2tailedztest post aggregator will be invoked by Druid query engine first. As pvalue2tailedztest post aggregator needs zscore input field, and the zscore is not yet calculated at that time, pvalue2tailedztest post aggregator will invoke zscore2sample post aggregator to do the twosample zscore calculation, and using its output as an input to calculate the pvalue from zscore.
"postAggregations": [{
"type" : "pvalue2tailedztest",
"name" : "pvalue",
"field" :
{
"type" : "zscore2sample",
"name" : "zscore",
"fields" : [{
"type" : "fieldAccess",
"name" : "success1",
"fieldName" : "success1"
},{
"type" : "fieldAccess",
"name" : "sample1",
"fieldName" : "sample1"
},{
"type" : "fieldAccess",
"name" : "success2",
"fieldName" : "success2"
},{
"type" : "fieldAccess",
"name" : "sample2",
"fieldName" : "sample2"
}]
}
}]
Next Step in furthering Druid Performance of Post Aggregators
To have a comprehensive view on whether A/B testing results are statistical significant, a couple dozens of relevant metrics are usually computed and compared against each other, from control and treatment groups respectively. The latter is realized by filtering with the treatment ID, while the former are computed using aggregators. However, Druid has one aggregator per filtered aggregator limitation as of 0.10 release. A performance enhancement request has been filed on this behalf – https://github.com/druidio/druid/issues/4267 to lift up this limitation.
Concluding Remarks
We have built a new PayPal Experimentation Platform to streamline setting up and running PayPal experiments, specifically, in building up new PayPal Experimentation Analytics platform for fast experiment reporting and interactive queries at scale. To speed up datadriven decisions and innovations, we are contributing new Druid postaggregators; to calculate the whether the observations of control and treatments are by chance, or it reveals statistical significance so a decision can be made based on the data.
As Druid engine is designed and built with a modular architecture for extensibility. We are adding new postaggregators to druidstat extension without adding overhead to Druid engine. In contributing these new stat post aggregators back to Druid, the whole Druid community can benefit from Druid not only for fast interactive queries but also for making a decision based on data.
Acknowledgements
New PayPal Experimentation Platform is a big undertaking by PayPal Experimentation team (PXP) and PayPal analyst community. The preliminary results have been very promising.