Democratizing Experimentation Data for Product Innovations

By

This blog is organized as follows:

Introduction to A/B Testing

A/B Testing (also known as experimentation or bucket testing), enables the product teams to gain more insights and understandings of PayPal users, their behavior and interaction with PayPal product pages, product flows and products in general (with user permissions).  It is critical to arrive to some product decisions fast in a data driven manner, and hence facilitates fast iterations and turn-arounds to speed up innovations and product development cycles.

With tracking of behavior data from different versions of the product pages or apps, we can analyze the data to learn the most successful variation that maximizes the number of transactions, revenue, conversions, page views, or any key relevant metric. A/B Testing takes the guesswork out of product development and enables data-informed decisions by shifting business conversations from “we think” to “we observe”.  In measuring the impact and changes of key metrics, the product engineering team can ensure that every change produces positive impacts and results.

Next Generation PayPal Experimentation Platform

To this aim, we have built a next generation PayPal experimentation platform to simplify and streamline all the experimentation operations, and new PayPal Experimentation Analytics to quickly learn the insights from the experimentation, and to make data-driven decisions.

In this blog, we’ll focus on describing the efforts of building a new generation PayPal Experimentation Analytics, specifically on how we leverage Druid data engine, as well as how we extend Druid data engine to calculate statistical significances.

New Experimentation Analytics Dashboard provides an overview that enables teams to easily compare all of relevant success metrics for each of the experiment variations.

In addition, it also enables easy zoom-ins to see the details of traffic splits along with their metrics. Most importantly, it can calculate whether these observations and metrics are simply by chance, or they are actually statistically significant.

Figure 1: New PayPal Experimentation Dashboard

Next Generation PayPal Experimentation Analytics Platform

New PayPal Experimentation Analytics Platform is built to power up its dashboard in realizing the above goals. In a very high level, there are two stages of processing:

  • Data Enrichment We first enrich experimentation tracking data with experiment and treatment metadata using Apache Spark. In doing so, data sets can be loaded to a data engine for meaningful comparisons, i.e. comparing corresponding data sets of different treatments within the same selected time periods.
  • Data Indexing – the new PayPal Experimentation Analytics Platform is built on top of Druid data engine, to slice and dice very-large-scale multi-dimensional data sets, fast enough to power interactive queries, utilizing Hadoop-based Batch Ingestion to take advantage of PayPal Hadoop cluster to perform indexing into Druid.

In addition, PayPal Experimentation Analytics Platform is built with the following forward looking designs in mind:

  • Real time – The platform can be extended to process streaming data in real-time if needed. For example, the Spark code can be quickly modified for this purpose. Druid has built-in real-time nodes for real-time data ingestion.  In fact, our vision is to unify streaming and batch processing without duplicate processing capacity and sacrificing correctness.  Druid can be fit into this architecture naturally.
  • Cloud ready As Druid deep storage component is pluggable, Druid can utilize cloud storages, for example, Google Cloud Storage or AWS S3 as needed. when migrating to cloud.

Hypothesis Testing

Simply speaking, A/B testing is a way of comparing two versions of a single variable typically by testing a user’s response to version A against version B, and determining which version of variables is more effective.

In the field of statistics, A/B Testing is basically a hypothesis testing for comparing two population proportions. A hypothesis test is a statistical test that is used to determine if there is enough evidence from our observations to infer whether a certain hypothesis is true.

In other words, we are using hypothesis testing to tell us whether the differences between treatment A and treatment B actually affect user behavior, or the observations from variations are due to random chances.

For hypothesis testing, we first start with a null hypothesis. We should then hypothesize the sampling distribution and select a test statistic. In doing so, we can compute the test statistic and calculate p-value to determine whether we should reject the null hypothesis (see below).

Null Hypothesis

For A/B testing, the null hypothesis is usually that the observations from the control group are no difference from the observations from the treatment group, which is similar to the presumption of innocence in our legal system, where we begin a trial by assuming a defendant to be innocent.

The null hypothesis is generally assumed to be true until the evidence indicates otherwise. In statistics, it is often denoted as H0.

In order to evaluate evidence, we need to select an appropriate distribution for the test.

Z-Test

Due to a very large user base of PayPal, the central limit theorem can be applied. Therefore, our test statistic is approximately normally distributed. The distribution of the test statistic under the null hypothesis can be approximated by a normal distribution.  As a result, we opt to use Z-Test.

Z-Score

From https://en.wikipedia.org/wiki/Z-test, z-score is defined as the distance from the sample mean to the population mean in units of the standard error:

z = (M – μ ) / SE

where M is the sample mean and μ is the mean of the population, and SE is the standard error of the sample mean.

Z-score is used as a normalized measure of the deviance from the center.

Two-Sample Z-Test

We are performing a hypothesis test to compare two population proportions P1, and P2,
where the null hypothesis H0 is:

P1 (Control metrics) ≈  P2 (Treatment metrics)

the metrics can be relevant metrics, for example, conversion rate, number of transactions and etc.

Please see https://www.isixsigma.com/tools-templates/hypothesis-testing/making-sense-two-proportions-test/ and substitute with the null hypothesis P1 – P2 = 0


Assuming the null hypothesis H0 is true

     => P1 – P2 = 0

i.e., there is no difference between two population proportions.

P-Value

The p-value is the probability of a standard normal distribution that is extreme, or more extreme than the observed z-statistic. Please refer to http://www.stat.yale.edu/Courses/1997-98/101/sigtest.htm for more explanations.

Per https://en.wikipedia.org/wiki/P-value, p-value is the probability, assuming the null hypothesis is true, of obtaining a result where the test statistics is as extreme, or more extreme than what was actually observed, if we continue to run the experiments over and over.

 


Figure 2: P-value from https://en.wikipedia.org/wiki/P-value

P-Value Of a Two-Tailed Z-Test

In the context of our A/B testing, we are interested in testing whether there is difference between the observations from the control, and the observations from the treatment. Therefore, we are calculating P–value of a two-tailed Z-test as 2 * P(Z ≥ |Zscore|), with the assumption that the null hypothesis is true.

From Mind on Statistics 3rd Edition:

 

Our test data can be hypothesized to follow the normal distribution per central limit theorem.
Since a normal distribution is symmetrical, and we standardize the normal distribution to a standard normal distribution

=>  p-value for a two-tailed test = 2 * area to the right of |z|

Recall the P-value measures the plausibility of H0. The smaller the P-value, the stronger the evidence is against H0. The p-value is the probability that you have falsely rejected the null hypothesis.
As an example, a p-value of 0.0128 (say alpha = 0.05) reveals that it is unlikely we would observe such extreme test statistic in the direction of Ha, if H0 is true. Therefore, the initial assumption of His true, is likely incorrect so we reject the null hypothesis H0, in favor Ha, i.e. the differences between control and treatment is statistically significant.

In case of a large p-value (say 20%) we would say that we failed to reject H0, similarly to failing to disprove the innocence.

Druid

Druid is an open-source, high-performance column-oriented, distributed real-time analytics data engine that we use to power dashboards and interactive queries for PayPal experimentation analytics needs.

It is horizontally scalable due to its shared-nothing architecture. Its column oriented storage format makes it suitable for OLAP query. Druid uses segments as fundamental storage units. As Druid segments are immutable, it simplifies query processing and makes scaling out easy. In addition, Druid segment size is configurable to tune for optimal query performance. Druid uses zookeeper as the coordination service to make itself resilient. Please refer to Druid paper in SIGMOD 2014 for details about its architecture.

As powerful as Druid can be, it can’t tell you whether observations from group A and group B are by chance, or whether there is statistical significance behind the data. That’s why we are contributing new operators to facilitate making such decisions.

Druid Post Aggregators

A/B testing is boiled down to comparing two population proportions with independent samples. As described before, we are using Z-test as our test statistic, since it is safe to assume the distribution is approximately normal, based on Central Limit Theorem, for our experimentation scenarios.
Druid is built with extensions in mind. To this aim, we would like to contribute new post aggregators to druid-stats extension:

zscore2sample

  • Calculate the z-score using two sample Z-tests while converting binary variables (e.g. success or not) to continuous variables (e.g. conversion rate).

The grammar for a two sample z-test post aggregation is:

postAggregation : {
"type": "zscore2sample",
"name": <output_name>,
"fields": [<count 1 (post_aggregator1)>, <sample size 1 (post_aggregator2)>,
<count 2 (post_aggregator3>, <sample size 2 (post_aggregator4)>],
}

For example, the fields can be

{
"fields": [<success count of sample 1>, <sample size of population 1>, <success count of sample 2>, <sample size of population 2>]
}

pvalue2tailedztest

Calculate p-value of two-tailed z-test from zscore.

    • pvalue2tailedztest – the input is zscore, which can be the z-score calculated using zscore2sample post aggregator

postAggregation : {
"type": "pvalue2tailedztest",
"name": <output_name>,
"fields": <zscore (post_aggregator)>
}

For example,

{
"type": "pvalue2tailedztest",
"name": pvalue,
"fields": zscore
}

To use these aggregators, you’ll need to include the stat extension in your configuration file:

druid.extensions.loadList=["druid-stats"] 

Implementation of Druid Post Aggregator for two-sample Z-test

We need to first create a class to implement Druid PostAggregator Interface:


@JsonTypeName("zscore2sample")
public class ZtestPostAggregator implements PostAggregator {
private final String name;
private final List fields;
...

where an important method of this interface to implement is:

public Object compute(Map<String, Object> combinedAggregators);

Druid Query Evaluation of POST Aggregators

A Druid query will be translated to a directed-acyclic graph (DAG).  A Druid post aggregator operator will be responsible for unrolling of its arguments, by invoking its dependent fields, which could be operators too and passing down the query context (which is stored as a map of <name, object> pairs) to do evaluations.

A Druid Query Example using Zscore and Pvalue Post Aggregators

Druid query engine relies on the post aggregators to do the unrolling themselves. Please see the following JSON query as an example in how to use these two aggregators together.

The pvalue2tailedztest post aggregator will be invoked by Druid query engine first.  As pvalue2tailedztest post aggregator needs z-score input field, and the z-score is not yet calculated at that time, pvalue2tailedztest post aggregator will invoke zscore2sample post aggregator to do the two-sample zscore calculation, and using its output as an input to calculate the p-value from zscore.

"postAggregations": [{
"type" : "pvalue2tailedztest",
"name" : "pvalue",
"field" :
{
"type" : "zscore2sample",
"name" : "zscore",
"fields" : [{

"type" : "fieldAccess",
"name" : "success1",
"fieldName" : "success1"
},{
"type" : "fieldAccess",
"name" : "sample1",
"fieldName" : "sample1"
},{
"type" : "fieldAccess",
"name" : "success2",
"fieldName" : "success2"
},{
"type" : "fieldAccess",
"name" : "sample2",
"fieldName" : "sample2"
}]
}
}]

Next Step in furthering Druid Performance of Post Aggregators

To have a comprehensive view on whether A/B testing results are statistical significant, a couple dozens of relevant metrics are usually computed and compared against each other, from control and treatment groups respectively.  The latter is realized by filtering with the treatment ID, while the former are computed using aggregators.  However, Druid has one aggregator per filtered aggregator limitation as of 0.10 release.  A performance enhancement request has been filed on this behalf – https://github.com/druid-io/druid/issues/4267 to lift up this limitation.

Concluding Remarks

We have built a new PayPal Experimentation Platform to streamline setting up and running PayPal experiments, specifically, in building up new PayPal Experimentation Analytics platform for fast experiment reporting and interactive queries at scale. To speed up data-driven decisions and innovations, we are contributing new Druid post-aggregators; to calculate the whether the observations of control and treatments are by chance, or it reveals statistical significance so a decision can be made based on the data.

As Druid engine is designed and built with a modular architecture for extensibility. We are adding new post-aggregators to druid-stat extension without adding overhead to Druid engine.  In contributing these new stat post aggregators back to Druid, the whole Druid community can benefit from Druid not only for fast interactive queries but also for making a decision based on data.

Acknowledgements

New PayPal Experimentation Platform is a big undertaking by PayPal Experimentation team (PXP) and PayPal analyst community. The preliminary results have been very promising.

Chung-Ho Chen

Chung-Ho is a hands-on architect in PayPal Experimentation & Tracking Engineering team. Before that, Chung-Ho was Chief Data Engineer with PayPal Consumer data engineering, and PayPal Merchant data engineering. He has been a leading voice in driving the innovative use of new data technologies since joining PayPal in 2013.