Category Archives: Uncategorized

Autoscaling applications @ PayPal

By , and

PayPal’s infrastructure has grown tremendously over the past few years hosting multitudes of applications serving billions of transactions every year. Our strong belief in decoupling the infrastructure and the app development layer enables us to independently evolve quicker. With more than 2500 internal applications and more than 100k instances across our entire infrastructure, we are committed in ensuring high availability of our applications and optimal resource utilization.

In our previous post, we discussed in detail about Nyx, an automated remediation system for our infrastructure. The underlying modular architecture of Nyx is designed to listen to a spectrum of signals from different agents and quickly remediate the identified anomalies. This enables us to vertically scale more agents in terms of functions as well. The core module could also be easily extended to understand new types of signals. We have also reached a milestone of more than a million auto-remediations across our infrastructure.

In our continued efforts to ensure effective resource utilization for our infrastructure, we have extended the functionality of Nyx to enable auto-scaling of our applications. The problem of optimizing cost remains an interesting challenge with our journey into the public cloud as well. The main utility of this addition is to establish an automated process to understand the throughput of one instance in an application and then right-sizing in terms of number of computing instances for the current traffic trend. Because of the variation in traffic trends adding to the problem, an intrinsic way of dynamically quantifying the throughputs is necessary. And this enables all the applications at PayPal to remain truly elastic responding to such varying trends.


Autoscaling in general could be done either in a reactive or predictive way. Reactive autoscaling involves in quickly increasing or decreasing the cluster size to respond to the changes in traffic load. It is effective in cases to react to high velocity changes provided the mechanisms to provision and warm-up the newly added instances are efficient. A simple way to do this is to let the app developers define the thresholds using which instances could be scaled up or down.

Predictive way of autoscaling essentially attempts to model the underlying trend and predict the changes to better scale the applications. It is helpful in cases where the provisioning mechanisms and frameworks would need more time to warm up before serving incoming requests. However, it might not be very useful during surges in incoming requests. Other variations include an ensemble of these two approaches but with efficiency numbers that are very context specific.

Autoscaling @ PayPal

The autoscaling engine for applications hosted on PayPal’s infrastructure is a variant of the ensemble approach. A baseline of assumptions was established after understanding the various constraints for the system:

1. Complex topology – Emphasizing on the point that we host a huge number of applications with a wide variety of characteristics and dependencies. An assumption to correlate overall incoming traffic to scale every app in the structure seems straightforward but not the optimized solution to effectively right size every application cluster.

2. Developer isolation – Relying on app development team to establish throughput numbers proves to be ineffective. Because the assumption to right size the environment is to calculate the throughput of an instance for the current traffic trend and code version continuously. To do this, the app development team would need to define more processes to simulate production environments that voids the purpose of isolating them from the infrastructure related details.

3. Frequency of right sizing – Autoscaling could be triggered on an application pool to better serve the incoming traffic trend over a defined period. Say we would want a pool to be right sized for the next week with a decreasing traffic trend. However, we could observe varying patterns in each day itself with peaks occurring at specified periods within a day. We could either scale up and down for a given day or scale once to address the maximum traffic for the upcoming week itself. But there will be few resources under-utilized this way. We also need to take in consideration the underlying provisioning and decommissioning mechanisms to determine the frequency of right sizing. Hence, we need an effective way of establishing the right sizing plan after the throughput is established for the input traffic pattern.

Examples of incoming traffic pattern of two different applications (one week)

We have split the work into two parts:

  1. Quantifying an instance’s throughput in an application in a dynamic way.
  2. Dynamically determining an autoscaling plan.

We will be discussing only about the first part in this post.

Unit instance throughput calculation

The first step in right sizing is to determine the unit throughput of an instance in the application cluster. A standard way to approach this problem is to perform stress test to determine the throughput values of an instance serving the application. However, there are many drawbacks in this approach. The development team is required to mock their production environments and generate test scripts as well to automate the process. It is very hard and would add a lot of overhead to the development team to replicate the stochastic nature of production environments. So we had to define a simpler way to isolate them from these problems.

Our approach 

Our approach involves in safely utilizing the production environment itself to determine the throughput of one instance in an application cluster. To put it very briefly, our throughput test run selects one instance from the cluster running in production serving live traffic behind a load balancer. We then gradually divert an increased amount of traffic to this chosen instance relative to the other instances. Various metrics are monitored for this instance to observe the state at which it becomes an outlier in comparison to the other instances. This state is defined to be the maximum threshold beyond which there would be a negative impact to the end user experience. And we could determine the maximum throughput based on the amount of traffic diverted to this instance right before this state. We will discuss in detail about every step involved in this process:

  1. Defining a reference cluster and choosing a candidate instance
  2. Pre-test setup
  3. Load increase and monitoring
  4. Restoration

Entire workflow of determining unit instance’s throughput value.

Defining a reference cluster and choosing a candidate instance

The throughput test runs as one of the Nyx agents. The agent starts with picking up one of applications running behind a datacenter’s load balancer. In this exercise, the state of an application is defined using four metrics: cpu, memory, latency and error count. The workflow begins with identifying outlier instances in terms of any of these metrics using the DBSCAN algorithm. After removing outliers, the remaining instances forms our reference cluster. We need to ensure that all the instances that formed the reference cluster remain clustered together throughout our exercise. One candidate instance is chosen from this cluster to run our exercise.

Snapshot of app state after defining a reference cluster and choosing a candidate instance

Pre-test setup

The load balancer running on top of these instances route traffic in a round robin fashion. The load balancing method is changed to ratio-member to increase traffic to the candidate instance in steps. Also, certain safety checks are put in place to ensure smoother testing. We lock the entire pool from external modifications to avoid mutations while running the tests. We also have quick stop and restore options ready as we are dealing with live traffic. We ensure that the system is ready for a hard stop and restore during any point of the workflow should there be any problem.

Load increase and monitoring

After every step increase in weight of traffic load being diverted to the candidate instance, the four metrics are monitored continuously for any change in pattern with a hard check on latency. PayPal infrastructure defines acceptable thresholds for metrics like CPU, memory and we ensure relativistic thresholds using DBSCAN as well. We monitor the candidate instance for a specific period post the increase in traffic for any deviations in the above-mentioned metrics. We also store the state of the candidate instance post every increase during the entire monitoring session.

If there are no deviations then we could determine that the increase in traffic could be handled by the candidate instance and we could proceed with further increase. If there is a significant change in (or deviations beyond set max thresholds) for any of the monitored metrics then we define this state to be the end of our test. We also have a threshold on the load balancer weight to define a finite horizon over the step increases of load on the candidate instance.

Snapshot of host latencies at traffic 2x

Snapshot of host latencies at traffic 3x


The result of our throughput test could be two: the chosen candidate fails to handle the increase in traffic or the chosen candidate exhausts the number of weight increases on the load balancer. If the verdict is that the candidate instance fails to handle the load after a weight increase, the state before which the failure occurred is used to observe the throughput. In the second case, the last state at which the weight updates get exhausted determines the throughput. All the configuration changes are restored on the load balancer and we enter the post-run monitoring phase to ensure that the candidate instance reverts to its original state.

One of the exercises could be visualized below. We could observe the change in transaction numbers for the candidate instance during the exercise window. And the point at which the change in latency occurs (in this case) is when we end the monitoring phase and enter restoration.

TPM graph during the exercise – Candidate host clearly serving more transactions compared to its peers without affecting end-user experience.

Calculating required capacity

We will be discussing in the next post about determining auto-scaling plans post the unit compute throughput calculation. However, let us take a brief look about how we could establish the final capacity of an application cluster which is fairly straightforward at this point.


Peak throughput for the application = 1200

Unit compute throughput = 100

If AF = 1.5 then Final required instances (rounded) = (1200 / 100) * 1.5 = 18


Let us consider a case where the candidate instance could take all the traffic diverted towards it. However, the test exits after reaching a hard limit on the number of weight increases that could be done on the candidate instance. In this case, we could intuitively observe that the numbers obtained as a result of this exercise might not necessarily reflect the actual maximum throughput of an instance. However, we have observed after many runs in production that such situations are the result of input traffic itself being very low or it is a “well written” application. If our tests were to define an end condition only when the candidate instance becomes an outlier in these conditions, we might be running these tests forever. We could also encounter sudden increases in latency for such applications during the tests where we might cause a momentary disruption because of very high stress loads on the candidate instances. Hence, we need to introduce a definite stopping point while running this test for the previously mentioned situations. And we could clearly see that the throughput numbers obtained as a result of this exercise clearly reflects near perfect values.

These exercises are running in production 24×7 with at most 10 applications getting profiled at a time to avoid causing CPU spikes on the load balancers. And since the applications are constantly profiled, new code versions result in new throughput numbers which solves the problem of dealing with stale numbers.

Our approach works as we receive enough traffic at any time to saturate one instance for it to be pushed out as an outlier in comparison to other instances in the pool. So, the advantage of this process is that despite the time at which the exercise is run, the throughput number remains the same (peak traffic time or normal) for a given version of the code.

We will look at dynamically determining the auto-scaling plan in the next post.

DMARC-Related Recommendations Included in NIST Guidance on Trustworthy Email

By and

Another important milestone was recently achieved for Domain-based Message Authentication Reporting and Conformance (DMARC), one of the PayPal Ecosystem Security team’s major undertakings in making the internet a safer, more secure place.

After several years of collaboration with the email security community, the U.S. National Institute of Standards and Technology (NIST) included recommendations for supporting DMARC in NIST’s SP 800-177, Trustworthy Email. SP 800-177 was released in September and is intended to give recommendation and guidelines for enhancing trust in email. While the audience for NIST publications is typically US federal agencies, its guidance does tend to influence other global organizations and industry tides. The recommendations for DMARC in the publication include the following:

  • Security Recommendation 4-11: Sending domain owners who deploy SPF and/or DKIM are recommended to publish a DMARC record signaling to mail receivers the disposition expected for messages purporting to originate from the sender’s domain.
  • Security Recommendation 4-12: Mail receivers who evaluate SPF and DKIM results of received messages are recommended to dispose them in accordance with the sending domain’s published DMARC policy, if any. They are also recommended to initiate failure reports and aggregate reports according to the sending domain’s DMARC policies.

DMARC is a way to make it easier for email senders and receivers to determine whether or not a given message is legitimately from the sender, and what to do if it isn’t. This makes it easier to identify spoofed email messages, a common phishing technique, and reject them. Users often can’t tell a real message from a fake one, and mailbox providers have to make very difficult (and frequently incorrect) choices about which messages to deliver and which might harm users. Senders are also largely unaware of problems with their own email being abused and want feedback from receivers about fraudulent messages. DMARC addresses these issues, helping email senders and receivers work together to better secure email, protecting users and brands from costly abuse.

Our 2015 post on the publication of DMARC within the IETF highlighted a critical step towards making it a widely adopted policy layer that builds on top of Sender Policy Framework (SPF) and Domain Keys Identified Mail (DKIM). In addition to the inclusion by NIST, the U.K.’s Government Digital Service and Germany’s Federal Office for Information Security have incorporated similar DMARC guidance into their recommended email security posture. These global recommendations clearly indicate that DMARC is a required component of effective email security and PayPal is proud to have lead such an important initiative as DMARC that protects not just our company and customers, but anyone that uses email.



Carrier Payments Big Data Pipeline using Apache Storm


Carrier payments is a frictionless payment method enabling users to place charges for digital goods directly on their monthly mobile phone bill. There is no account needed, just the phone number. Payment authorization happens by verification of a four digit PIN sent via SMS to a user’s mobile phone. After the successful payment transaction, charges will appears on user’s monthly mobile phone bill.

Historically fraud has been handled on the mobile carrier side through various types of spending caps (daily, weekly, monthly, etc.). While these spending caps were able to keep fraud at bay in the early years, as this form of payment has matured, so have the fraud attempts. Over the last year we saw an increase in fraudulent activity based around social engineering. Through social engineering fraudsters are able to trick victims into revealing their PINs.

To tackle the increasing fraud activity we decided to build carrier payments risk platform using an hadoop eco system. First step in building the risk platform is to create big data pipeline that will collect data from various sources, clean or filter the data and structure it for data repositories. The accuracy of our decision-making directly depends on variety, quantity and speed of data collection.

Data comes from various sources. Payment application generates transaction data. Refund job generates data about refunds processed. Customer service department provides customer complaints regarding wrong charge or failed payments. Browser fingerprint can help establish another identity of transacting user. All these sources generate data in different formats and at varying frequencies. Putting together a data pipeline for this is challenging.

We wanted to build a data pipeline that processes data reliably and has the ability to replay in case of failure. It should be easy to add or remove new data sources or sinks into pipeline. Data pipeline should be flexible to allow schema changes. We should be able to scale the data pipeline.

System Architecture

First challenge was to provide a uniform and flexible entry point to data pipeline. We built a micro service with single REST end point that accept data in JSON format. This makes it easy for any system to send data to pipeline. The schema consists of property “ApplicationRef” and JSON object “data”. Sending data as JSON object allows easy schema changes.

Micro service writes data into kafka which also acts as buffer for data. There is separate kafka topic for any source that sends data to pipeline. Assigning separate kafka topic brings logical separation to data, makes the kakfa consumer logic simpler and allow us to provide more resources to topics that handle large volume of data.

We set up storm cluster to process data after reading it from kafka. To do real time computation on storm, you create “topologies”. Topology is a directed acyclic graph or DAG of computation. Topology contains logic for reading data from kafka, transforming it and writing to hive. Typically each topology consists of two nodes of computation – kafkaspout and hivebolt. Some of the topologies contain intermediate node of computation for transforming complex data. We created separate topologies to process data coming from various sources. At the moment we have 4 topologies – purchase topology, customer service topology, browser topology and refund topology.

Storm allows deploying and configuring topologies independently. This makes it possible for us to add more data sources in future or change existing ones without much hassle. This also allows us to allocate resources to topologies depending on data volume and processing complexities.


Storm cluster set up and parallelism

We set up high availability storm cluster. There are two nimbus nodes with one of them acting as leader node. We have two supervisor nodes each having 2 core Intel® Xeon® CPU ES-2670 0 @ 2.60GHz.

Each topology is configured with 2-number workers, 4-number executors and 4-number tasks. The number of executors is the number of threads spawned by each worker (JVM process). Together these settings define storm’s parallelism. Changing any of the above settings can easily change storm’s parallelism.


Storm message guarantee set up

Storm’s basic abstraction provides an at-least-once processing guarantee. Messages are replayed only in case of failure. This message semantics are ideal for our use case. The only problem was, we didn’t wish to keep trying failed message indeterminately. We added the following settings to our spoutconfig to stop trying failed message if there are at least five successful messages processed after a failed one. We also increased elapsed time between retrials of failed message.

spoutConfig.maxOffsetBehind = 5
spoutConfig.retryInitialDelayMs = 1000
spoutConfig.retryDelayMultiplier = 60.0
spoutConfig.retryDelayMaxMs = 24 * 60 * 60 * 1000


Our data pipeline is currently processing payment transaction, browser, refund and customer service data at >100 tuples/min rate. The biggest convenience factor with this architecture was staggered release of topologies. It was very easy to add new topology in storm without disturbing already running topologies. Time taken to release new topology is less than a day. Architecture is proved to be very reliable and we haven’t experience any issue or loss of data so far.





Spark in Flames – Profiling Spark Applications Using Flame Graphs


When your organization runs multiple jobs on a Spark cluster, resource utilization becomes a priority. Ideally, computations receive sufficient resources to complete in an acceptable time and release resources for other work.

In order to make sure applications do not waste any resources, we want to profile their threads to try and spot any problematic code. Common profiling methods are difficult to apply to a distributed application running on a cluster.

This post suggests an approach to profiling Spark applications. The form of thread profiling used is sampling – capturing stack traces and aggregating these stack traces into meaningful data, in this case displayed as Flame Graphs. Flame graphs are graphs which show what percentage of CPU time is spent running which methods in a clear and user-friendly way as interactive SVG files ( This solution is a derivation of the Hadoop jobs thread profiling solution described in a post by Ihor Bobak which relies on Etsy’s statsd-jvm-profiler.


Since Spark applications run on a cluster, and computation is split up into different executors (on different machines), each running their own processes, profiling such an application is trickier than profiling a simple application running on a single JVM. We need to capture stack traces on each executor’s process and collect them to a single place in order to compute Flame Graphs for our application.

To achieve this we specify the statsd-jvm-profiler java agent in the executor processes to capture stack traces. We configure the java agent to report to InfluxDB (a time-series database to centrally store all stack traces with timestamps). Next, we run a script which dumps the stack traces from InfluxDB to text files, which we then input to Flame Graphs SVG rendering script.

You can generate flame graphs for specific executors, or an aggregation of all executors.

As with other profiling methods, your application will incur slight performance overhead by running the profiling itself, so beware of constantly running your applications in production with the java agent.


Spark Application Profiling Overview


Below is an example of the result of the solution described. What you see is a Flame Graph, aggregated over stack traces taken from all executors running a simple Spark application, which contains a problematic method named inefficientMethod in its code. By studying the Flame Graph we see that the application spends 41.29% of its threads running this method. The second column in the graph, in which we find ineffecientMethod, is the application’s user code. The other columns are Spark native threads which appear in most of the stack traces so the focus of our profiling is on the column containing the user code. By focusing on that second column, we see that ineffecientMethod method takes up most of the user code’s CPU run time and the other parts of the user code are just a tiny sliver reaching the top of the Flame Graph.

Spark in Flames_Image 2


1. Download and install InfluxDB 1.0.0

1.0.0 is currently in beta, however 0.13 has concurrency bugs which cause it to fail (See:

Run it:

# linux
sudo service influxdb start

Access the DB using either the web UI (http://localhost:8083/) or shell:

# linux

Create a database to store the stack traces in:


Create a user:


2. Build/download statsd-jvm-profiler jar

Deploy the jar to the machines which are running the executor processes. One way to do this is to using spark-submit’s –jars attribute, which will deploy it to the executor.

--jars /path/to/statsd-jvm-profiler-2.1.0-jar-with-dependencies.jar

Specify the Java agent in your executor processes. This can be done, for example, by using spark-submit’s –conf attribute


3. Download

Install all required python modules that imports.

Run to create text files with stack traces as input for flame graphs:


4. Download

Generate flame graphs using the text files you dumped from DB: <INPUT_FILES> > <OUTPUT_FILES>


The following is an example of running and profiling a Spark application.

Submit spark application:

spark-submit \
--deploy-mode cluster \
--master yarn \
--class com.mycorporation.MySparkApplication \
--conf “,port=8086,reporter=InfluxDBReporter,database=profiler,username=profiler,password=profiler,prefix=MyNamespace.MySparkApplication,tagMapping=namespace.application” \
--name MySparkApplication \
-jars /path/to/profiling/statsd-jvm-profiler-2.1.0-jar-with-dependencies.jar \

Dump stack traces from DB: -o "" -u profiler -p profiler -d profiler -t namespace.application -e MyNamespace.MySparkApplication -x "my_stack_traces"

Generate Flame Graph (In this case for all executors): my_stack_traces/all_*.txt > my_flame_graph.svg

Powering Transactions Search with Elastic – Learnings from the Field



We see a lot of transactions at PayPal. Millions every day.

These transactions originate externally (a customer using PayPal to pay for a purchase on a website) as well as internally, as money moves through our system. Regular reports of these transactions are delivered to merchants in the form of a csv or a pdf file. Merchants use these reports to reconcile their books.

Recently, we set out to build a REST API that could return transaction data back to merchants. We also wanted to offer the capability to filter on different criteria such as name, email or transaction amount. Support for light aggregation/insight use cases was a stretch goal. This API would be used by our partners, merchants and external developers.

We choose Elastic as the data store. Elastic has proven, over the past 6 years, to be an actively developed product that constantly evolves to adapt to user needs. With a great set of core improvements introduced in version 2.x (memory mapped doc values, auto-regulated merge throttling), we didn’t need to look any further.

Discussed below is our journey and key learnings we had along the way, in setting up and using Elastic for this project.

Will It Tip Over

Once live, the system would have tens of terabytes of data spanning 40+ billion documents. Each document would have over a hundred attributes. There would be tens of millions of documents added every day. Each one of the planned Elastic blades has 20TB of SSD storage, 256 GB RAM and 48 cores (hyper).

While we knew Elastic had all the great features we were looking for, we were not too sure if it would be able to scale to work with this volume (and velocity) of data. There are a host of non-functional requirements that arise naturally in financial systems which have to be met. Let’s limit our discussion in this post to performance – response time to be specific.

Importance of Schema

Getting the fundamentals right is the key.

When we initially setup Elastic, we turned on strict validation of fields in the documents. While this gave us a feeling of security akin to what we’re used to with relational systems (strict field and type checks), it hurt performance.

We inspected the content of the Lucene index Elastic created using Luke. With our initial index setup, Elastic was creating sub-optimal indexes. For e.g. in places where we had defined nested arrays (marked index=”no”), Elastic was still creating child hidden documents in Lucene, one per element in the array. This document explains why, but it was still using up space when we can’t even query the property via the index. Due to this, we switched the “dynamic” index setting from strict to false. Avro helped ensure that the document conforms to a valid schema, when we prepared the documents for ingestion.

A shard should have no more than 2 billion parent plus nested child documents, if you plan to run force merge on it eventually (Lucene doc_id is an integer). This can seem high but is surprisingly easy to exceed, especially when de-normalizing high cardinality data into the source. An incorrectly configured index could have a large number of hidden Lucene documents being created under the covers.

Too Many Things to Measure

With the index schema in place, we needed a test harness to measure the performance of the cluster. We wanted to measure Elastic performance under different load conditions, configurations and query patterns. Taken together, the dimensions total more than 150 test scenarios. Sampling each by hand would be near impossible. jMeter and Beanshell scripting really helped here to auto-generate scenarios from code and have jMeter sample each hundreds of times. The results are then fed into Tableau to help make sense of the benchmark runs.

  • Indexing Strategy
    • 1 month data per shard, 1 week data per shard, 1 day data per shard
  • # of shards to use
  • Benchmark different forms of the query (constant score, filter with match all etc.)
  • User’s Transaction Volume
    • Test with accounts having 10 / 100 / 1000 / 10000 / 1000000 transactions per day
  • Query time range
    • 1 / 7 / 15 / 30 days
  • Store source documents in Elastic? Or store source in a different database and fetch only the matching IDs from Elastic?

Establishing a Baseline

The next step was to establish a baseline. We chose to start with a single node with one shard. We loaded a month’s worth of data (2 TB).

Tests showed we could search and get back 500 records from across 15 days in about 10 seconds when using just one node. This was good news since it could only get better from here. It also proves an Elastic (Lucene segments) shard can handle 2 billion documents indexed into it, more than what we’ll end up using.

One take away was high segment count increases response time significantly. This might not be so obvious when querying across multiple shards but is very obvious when there’s only one. Use force merge if you have the option (offline index builds). Using a high enough value for refresh_interval and translog.flush_threshold_size enables Elastic to collect enough data into a segment before a commit. The flip side was it increased the latency for the data to become available in the search results (we used 4GB & 180 seconds for this use case). We could clock over 70,000 writes per second from just one client node.

Nevertheless, data from the recent past is usually hot and we want all the nodes to chip in when servicing those requests. So next, we shard.

Sharding the Data

The same one month’s data (2 TB) was loaded onto 5 nodes with no replicas. Each Elastic node had one shard on it. We choose 5 nodes to have a few unused nodes. They would come in handy in case the cluster started to falter and needed additional capacity, and to test recovery scenarios. Meanwhile, the free nodes were used to load data into Elastic and acted as jMeter slave nodes.

With 5 nodes in play,

  • Response time dropped to 6 seconds (40% gain) for a query that scanned 15 days
  • Filtered queries were the most consistent performers
  • As a query scanned more records due to an increase in the date range, the response time also grew with it linearly
  • A force merge to 20 segments resulted in a response time of 2.5 seconds. This showed a good amount of time was being spent in processing results from individual segments, which numbered over 300 in this case. While tuning the segment merge process is largely automatic starting with Elastic 2.0, we can influence the segment count. This is done using the translog settings discussed before. Also, remember we can’t run a force merge on a live index taking reads or writes, since it can saturate available disk IO
  • Be sure to set the “throttle.max_bytes_per_sec” param to 100 MB or more if you’re using SSDs, the default is too low
  • Having the source documents stored in Elastic did not affect the response time by much, maybe 20ms. It’s surely more performant than having them off cluster on say Couchbase or Oracle. This is due to Lucene storing the source in a separate data structure that’s optimized for Elastic’s scatter gather query format and is memory mapped (see fdx and fdt files section under Lucene’s documentation). Having SSDs helped, of course


Sharding the Data

Final Setup

The final configuration we used had 5-9 shards per index depending on the age of the data. Each index held a week’s worth of data. This got us a reasonable shard count across the cluster but is something we will continue to monitor and tweak, as the system grows.

We saw response times around the 200 ms mark to retrieve 500 records after scanning 15 days’ worth of data with this setup. The cluster had 6 months of data loaded into it at this point.

Shard counts impact not just read times; they impact JVM heap usage due to Lucene segment metadata as well as recovery time, in case of node failure or a network partition. We’ve found it helps Elastic during rebalance if there are a number of smaller shards rather than a few large ones.

We also plan to spin up multiple instances per node to better utilize the hardware. Don’t forget to look at your kernel IO scheduler (see hardware recommendations) and the impact of NUMA and zone reclaim mode on Linux.

Final Setup


Elastic is a feature rich platform to build search and data intensive solutions. It removes the need to design for specific use cases, the way some NoSQL databases require. That’s a big win for us as it enables teams to iterate on solutions faster than would’ve been possible otherwise.

As long as we exercise due care when setting up the system and validate our assumptions on index design, Elastic proves to be a reliable workhorse to build data platforms on.

Node.JS Single Page Apps — handling cookies disabled mode


Cookies since their advent have been an integral part of web applications. Since the underlying web HTTP protocol is stateless, cookies provide a nifty way to carry state full information about the user to the web server. With the rise of the Single Page Application (SPA), cookies have become even more instrumental to provide for a state full front end communicating with a stateless backend. Cookies are commonly used for user authentication, experience customization, tracking users across multiple visits etc.

All modern browsers today by default enable cookies to be set by the various domains. However there may be cases in which the user could have disabled setting cookies or some HTTP clients such as certain web-views might not allow cookies by default. In order to make sure that the users who access your website from such clients are still able to get a seamless experience, it becomes imperative to handle the “Cookies Disabled Mode”.

This was one of our major considerations while re-architecting the Checkout flow at PayPal last year. How did we solve it? Memcookies!

Let’s break down the problem statement into its constituents:

  1. We need an alternative to the browser’s cookie jar to persist the cookies in the front end.
  2. We need a mechanism to send these cookies as part of every AJAX request to the server and to ensure that the new cookies being set in the response are again persisted in the alternative storage (mentioned in 1 above).

For the first one since our app is a JavaScript powered heavy weight SPA the ideal location for the persisted cookies is in the Front End JS context. Hence we wrote a client side JS script that took care of persisting the cookies onto the front-end.

For the second, the client side JS uses a custom HTTP header to send across all the cookies as part of every AJAX request from the client. We ended up writing a Node.JS middleware that intercepts every incoming request, reads the custom header, extracts all the cookies and populates them in the Cookies header for downstream middleware to consume.

For setting the response cookies the middleware again fires whenever the response headers are being sent and it reads all Set-Cookie headers, extracts the cookie name-values and populates it in the custom HTTP header that out client side script understands.

Note that this approach has certain limitations though:

  1. This approach works only for AJAX requests and not full page POST since it relies on being sent through custom HTTP headers, and being persisted in memory on a single page.
  2. Using the existing cookie header is off limits. It is not possible to set this header for AJAX requests, nor is it possible to read cookies for a response when they are set as HttpOnly. As such, we use a xcookies header instead.
  3. We do not want javascript to have free reign over these cookies, given that most of them are designed to be HttpOnly. Hence, we encrypt all cookies on the way out, and decrypt them on the way back in. So cookies are now transparently handled on the front-end.
  4. We rely on the front-end to tell us when it’s in cookies disabled mode.
  5. For the initial page render, we provide res._cookies for the renderer to drop on the page. The alternative is making an additional AJAX request, in order to get the xcookies object in headers.
  6. Since the browser can no longer expire cookies on outgoing responses, we can instead do this on incoming requests, by encoding the expiry time into the encrypted cookie value. Also, we can set a hard ceiling on this expiry time of 20 minutes, given that the cookies are only intended to exist until the user is redirected from the page.

Using this approach of persisting cookies in client side JS context and sending them to the backend server via HTTP headers has scaled pretty well for us. This middleware can ideally be used as a supplement prior to any cookie parser middleware so as to enable reading and setting cookies by the app even for cookie disabled browsers.

Credits to Daniel Brain who is the original author of memcookies.

You can checkout the memcookies middleware and the bundled client side JS along with their usage in the above linked repo. Pull Requests/Issues are welcome for suggestions and (or) improvements.

Open Source javascript offerings from PayPal Checkout!


We’ve had a pretty terrible history of failing to open sourcing code from the PayPal Checkout team. It’s way too easy to get caught in the trap of writing modular code, but including a lot of domain specific concerns and being left with something that is incredibly useful for your team, but incredibly unusable for anyone else.

We’re hoping to change that. Which is why today (after a few weeks of getting everything prepared) we’re releasing a number of modules under the PayPal KrakenJS umbrella, and we’re planning on open sourcing more consistently going forward.

Here are some of the modules that we’ve released today. Bug reports, PRs and general feedback are all extremely welcome!

Massive credit goes to Praveen Gorthy, Meng Shi, Joel Chen, Mark Stuart and Anurag Sinha (all awesome current and former PayPal Checkout engineers) for their contributions to these modules. Thanks guys!


This module attempts to make postMessaging between windows, popups and iframes a lot more deterministic, and a lot less fire-and-forget-y.

Post-robot lets you set up a listener for a particular event, which receives a message and then sends a response — much like a simple client/server pattern.

The difference is, instead of your postMessage being sent off to a window, and you just assuming there’s a listener set up that received your message, with post-robot you’ll actually get an error back if your message didn’t get through. Maybe the window was closed, maybe it hadn’t loaded the javascript to set up the listener, maybe it wasn’t on the right domain. With post-robot you’ll actually know, and be able to handle messaging errors gracefully.

It even comes with simplified support for IE9+, which by default doesn’t allow postMessaging between two different windows.

On top of that, you can even send functions in your post-messages, which can be called from one window to another, further bridging the gap between cross-domain windows and frames.

postRobot.on('getUser', function(source, data) {
    return {
        email: ''

postRobot.send(myWindow, 'getUser').then(function(data) {
    console.log('User is',, '!');
}).catch(function(err) {
    console.error('Error getting user:', err.toString())


This module is designed to let you do logging on the client side without worrying about sending hundreds of requests to your server.

There are a few modules out there which let you collect errors from the client side and publish them to the server, but beaver-logger will allow you to arbitrarily log as many things as you like, buffer them on the client side, and then periodically flush them to your server in a combined request.

It comes with a simple http endpoint for node/express, which just logs to the console — but you’re free to add whatever kind of logging you like and plug it in. You can even implement a publishing endpoint for other web stacks, and only use the client-side logger. It’s totally up to you.

We think that client side logging is absolutely essential for figuring out which use-cases your users are going through, what errors they’re getting, and generally what they’re doing on the black-box that is the client side. This logger puts that information back in your hands.

It also optionally logs performance data from your browser, front-end page transitions and loading times, and window unloads, to help you get an idea of your app’s performance.

$'user_logged_in', { email: });
$logger.error('window_error', { err: err.stack || err.toString() });


This module is designed as a toolkit for building cross-domain components, in iframes and popup windows. Think of things like PayPal Checkout in an iframe/lightbox, or Facebook comments.

The problem of same-domain components is mostly solved by frameworks like React, Angular, and Ember. But if you want to write a component that lives on one domain, and is presented on another, with all of the cross-domain sandboxing provided by browsers for iframes and windows, you’re pretty much out of luck.

You need to deal with creating iframes, handling post messages, dealing with error cases, and writing custom javascript that lives on your users’ websites to initialize whatever component you want to show.

xcomponent makes this all really easy. You set up your component’s interface: what tag is it going to use, what props is it going to take, what size is it going to be — and it handles creating the integration points for you. Now, whoever uses your component just has to call your component with whatever props you define, and it will set itself up on their page.

This way, you don’t need to think about sending post-messages or creating iframes. Whoever uses your component just gives you some data and functions in props, like a React component, and you can access any of those props or call any of those functions directly from your page as if you were on the same domain. This gives your component a really clean and easy-to-reason-about interface.

It will even set up your component for you in popular frameworks like React and Angular — so anyone can easily add it to their page, no matter what technology they’re using!

module.exports.loginComponent = xcomponent.create({
    tag: 'log-in',
    props: {
        defaultEmail: {
            type: 'email',
            required: false
        onLoginSuccess: {
            type: 'function'
    dimensions: {
        width: 400,
        height: 200

    props: {
        defaultEmail: '',
        onLoginSuccess: function(email) {
            console.log('User logged in with email:', email);


This module is an attempt to provide a framework-agnostic version of Angular’s ngMock. We find this module really useful, but we figured there’s no reason this stuff should be restricted to Angular. It’s incredibly useful for testing synchronously and deterministically.

Sync-browser-mocks will, essentially, let you write synchronous tests which make use of window.Promise, window.XmlHttpRequest, window.setTimeout and window.setInterval. Instead of waiting for these async things to complete asynchronously, you can simply flush the results from any queued timeouts, http requests or promises, and catch any errors in a synchronous fashion, which makes testing a lot more straightforward and deterministic.

setTimeout(function() {
    console.log('This will happen at some point in the future')
}, 1000);
setTimeout.flush(); // Actually no, it'll happen now!

var myEndpoint = $mockEndpoint.register({
    method: 'GET',
    uri: '/api/user/.+',
    handler: function() {
        return {
            name: 'Zippy the Pinhead'


This module is an (albeit slightly hacky) way to use ES6 modules / imports and exports with angular, and ditch its DI system. I wrote at length hereabout why I think Angular DI should be avoided in favor of ES6 modules— and if you’re of a similar mindset, this module is for you!

It enables you to write really clean code an import stuff directly from angular, without having to set up factories and services — because really,who knows the difference anyway?

import { $q, $http } from 'angular';
import { $foo } from 'some-dependency/foo';
import { $bar } from './some-local-dependency/bar';
export function someHelper() {
   // do something
export function someUtil() {
    // do something else

— — —

There are a few more modules we open sourced that a couple of my team-mates are blogging about. I’ll share the links soon, but for now they are:


For persisting cookies in cookies-disabled mode


For stateless crsf protection

Hope you all find these useful! Please let us know if you have any comments or feedback, and keep an eye on for more useful modules!

— Daniel, Engineering Lead, PayPal Checkout Team

Nyx – Lightsout management at PayPal

By , , , , and

Nyx – Lightsout management at PayPal


Increased adoption of cloud-based infrastructure by the industry has shown tremendous improvements in effectively running and managing applications. But most of the industries’ current practices to manage these applications are imperative in nature. In an ever-evolving situation with an increasing demand to better manage these applications, a declarative approach is needed. The ideal declarative system aims to determine the base state of each of the managed applications, monitor them continuously for any induced mutations and restore it back to the desired base state.

PayPal has one of the world’s largest cloud deployments with a wide spectrum of systems varying from legacy to state-of-the-art enabling our global infrastructure to be truly agile. Our uptime is mandated at 99.99% in terms of availability. The idea of a declarative lights out management system to dynamically manage these heterogeneous systems at scale raises a lot of interesting challenges.

Lights out management – A brief history

Administrators initially introduced lights-out management systems as a part of the server hardware itself to enable remote management. Administrators were able to gain access into the server boxes despite the underlying operating system’s working status and perform maintenance routines. Improvements have poured in the last decade for technologies paving way for an intelligent everything. Extending the notion of introducing intelligence to a lights-out management system sounds like a very interesting challenge.

Nyx – PayPal’s lights-out management system

Our journey towards this ambitious idea started with the basic steps of formalizing the scenario and systematic definition of all the problems. We named this early stage project, Nyx. The first step is an intuitive inference of each of these problems to understand its components.

At PayPal, we talk scale at its evolved form. The number of systems deployed on our private cloud and the data being aggregated about those systems are enormous. It is a very arduous task to identify various states of our systems and manually invoke a corrective action. Our first task in this process was to establish a baseline for various dimensions of data that we are collecting about the systems. We then act upon the systems based on various effective analysis over the data and perform dynamic corrective actions. Let us understand this in detail.

Determining the base state

Characterization of the base state for the varied set of systems exposes a duality in the process. Certain properties like the desired code version for a system or the node health are very straightforward to determine. A change in such properties could be easily determined as well and the necessary corrective actions could be triggered. But let us consider the case of a more extensive property like Service level agreement and a dynamic maintenance of the desired SLA for all the systems.

In this case, the first step is to define the SLA value and converting this value into another dimension representing parameters like compute flavors (if homogenous or heterogeneous sizing is needed), horizontal scaling level, response time and many more as per the requirement. In such cases, it is not very straightforward to determine the base state for a system under consideration and to proactively take corrective actions.

As a start, we took under consideration the intensive properties for a system and effectively define a model to represent a base state. This section gives you an overview of each of these components and the next section discusses about identifying the outliers given these base states.


Reachability status of every node in a system could be used collectively as one of the parameters to determine its quality of service. A node’s reachability is determined using the responses obtained from three utilities; ping, telnet and load balancer check. When the reachability routine starts, instantiation of the above utilities is cascaded one after the other in the given order.

Concluding the status of reachability using only the plain ping utility is not reliable because there are relatively higher chances of the packets getting dropped along the way. Telnet is used as an added guarantee to the ping’s result. This is still not good enough. For example in an event of network partition, there is a fair chance of Nyx itself getting quarantined from the rest of the servers. As a defensive check to avoid false positives, the result from the corresponding load balancer on each segment of our network is verified with the previous ping and telnet results. On a clear mark as “down” in the load balancer and the previous utilities’ results being false, the node is determined to be unreachable and hence alarmed.

Code version

Code version check is a primary use case for this system. A regular deployment job executed on ‘n’ number of nodes updates a central database with the active version of code that was deployed. This central database is considered as our source of truth for the desired code version (What It Should Be – WISB). Anomalies during a deployment job or other unexpected events could mutate an entire system or specific node in a system that has the actual code version (ground truth) running on the nodes (What It Really Is – WIRI). Difference between the WISB and WIRI for a given node could be used to determine if we need to execute any remediation on the node.

In case of code version, the ideal condition is when both the WISB and the WIRI remains equal. Our system identifies the difference between the WISB and WIRI for every node in a system and executes a code deployment with the correct version of code on finding a difference. An immediate future work on our track in this use case is to identify an intersecting set of problems during multiple deployments failure, effectively classify the underlying cause of failure and execute case specific remediation.

Runtime metrics

State changes happen unexpectedly before or after a system is provisioned and starts serving traffic. Runtime data from every node of a system including errors and latency numbers give information about the performance of the entire system. But it is very hard to determine an outlier with such variables. This section discusses the algorithms used to find the outliers from runtime metrics using these multiple variables and certain drawbacks as well:

Cluster based outlier identification

Considering the runtime metrics, it is relatively more complex to determine the outliers. We have used a data-clustering algorithm DBSCAN (Density-based spatial clustering of applications with noise) as it proves to be efficient for our case with its unique properties.
Consider a set of points in some space to be clustered. For the purpose of DBSCAN clustering, the points are classified as core points, (density-)reachable points and outliers, as follows:

  • A point p is a core point if at least minPts points are within distance ε of it, and those points are said to be directly reachable from p. No points are directly reachable from a non-core point.
  • A point q is reachable from p if there is a path p1, …, pn with p1 = p and pn = q, where each pi+1 is directly reachable from pi (so all the points on the path must be core points, with the possible exception of q).
  • All points not reachable from any other point are outliers.


Source : Wikipedia. Author : Chire

In this diagram, minPts = 3. Point A and the other red points are core points, because at least three points surround it in an ε radius. Because they are all reachable from one another, they form a single cluster. Points B and C are not core points, but are reachable from A (via other core points) and thus belong to the cluster as well. Point N is a noise point that is neither a core point nor density-reachable.

The reason behind using DBSCAN instead of k-means is the unavailability of the number of clusters that could be present from the given data about each metric. DBSCAN finds the number of clusters starting from the estimated density distribution. This is ideal for our situation, where an outlier is identified as the value that remains unusual from any cluster of nodes (that are considered to in a healthy state) and the number of such normal clusters are not known before hand.

The algorithm is executed for every node’s individual variables; CPU usage, memory usage and latency. The outliers marked from the previous step could then be acted upon to reinstate. (The epsilon value for the algorithm was calculated based on the domain under consideration). The notion of outlier would be reversed in a situation if the nodes that form a cluster represents the set of machines that are actually the outliers and points that are not included in any cluster are those that are behaving normally. We avoid such misinterpretation with the help of an overall threshold as given below and avoid false remediation. In such exceptional cases of unexpected behaviors, the system requires manual intervention.

Threshold based outlier identification

This approach is fairly intuitive from an implementation standpoint. Identifying the outliers based on threshold values could work for systems of smaller scale. Determining and maintaining threshold for thousands of applications is cumbersome and that is why this makes it an unsustainable solution on its own.

Our next priority in this routine is developing a method that is an ensemble of these two methods. The threshold value patterns being generated from the current system will serve as a training data for the ensemble model to better classify the outliers.



Signaling Agents

The signaling agents run periodically with a per agent function; unreachability, code version and outliers. These agents execute their routines and push the status of each node onto the common messaging bus from which the rest of system acts.

Message bus

The messaging bus serves as the common information aggregation point for each of the component in the system. We have used Kafka and ZeroMQ for the messaging system enabling a simple producer-consumer architecture.


Like the signaling agents, there are various types of reactors that listen to the messaging bus for commands; restart, replace and deploy executor. Executors listen to the messaging bus for their respective command type to be moved forward as jobs ready for execution. In the entire flow of generating signals, the conversion to commands, aggregation of commands, the safety check and job scheduling, information about every action is persisted on the control plane database for the purpose of auditing.

Nyx Core

The nyx core component of the system includes a set of subsystems with distinctive roles and follows a symmetric design approach. The initializer fires up the bare minimum components to coordinate and waits for the ensemble to come up. When it comes up, a leader election is initiated among the systems followed by splitting the entire list of pools available to each follower to ensure proper load balancing among the control plane systems. Each Nyx system maintains an in-memory view of the applications that belong to them. Leader takes care of Nyx itself by monitoring the machines, handling request proxies, and the auxiliary functions needed for Nyx to function. Followers on the other hand takes care of the applications being monitored and maintains the desired state.

Another notable characteristic of the Nyx systems is its self-healing. When anyone of the follower system goes down in the group, the pool splitting action is re-triggered and the load is split between the available machines, it rebuilds the in memory model also accordingly. When the leader itself goes down, the other systems in the group trigger the leader election again followed by pool splitting. These machines in the core listen to the messaging bus for updates from the signaling agents and converts the signals into executable commands.

Each of these individual commands including a node are then aggregated together based on a pool using a command compactor module in the core. This ensures that actions taken on the machines that belong to the same pool are executed at the same time rather than individually scheduling them passing it over to safety check routines.


Prediction flow

A better step to avoid unexpected interruptions is to be aware of the conditions during which the worst is about to happen. All the components described above could be ideal to remediate ailing systems but executing preventive actions would always be the efficient solution to ensure availability. Considering a very straightforward example, it is very less expensive to spin up few instances when an increasing traffic is observed rather than spinning them up after the old instances go down because of overload.

We have used two simple ways to identify a probable anomaly in the near future for various runtime metrics. The signals generated by all the signaling agents are pushed onto a separate queue for prediction. The first component in the prediction flow is a regression module that tries to fit the best possible curve for the recently observed values in the past window. The values for the immediate next window for all the metrics are predicted using this (with a margin of error).

The second component is a collection of basic feed-forward networks (from Java Encog library), which are trained using the runtime metrics data of past four years collected from our systems. Training data of each metric was given to four networks; “hour of the day”, “day of the week”, “day of the month” and “month of the year”. Every network is essentially given a time series data using which it is trained to predict the possible value of each metric in the next window (for each network).

The predicted value from both the components is used together to check for possible threshold violations as defined in the core systems and the preventive actions are generated accordingly.

Safety checks and balances

Systems focusing on remediation aspects need to make sure of not inducing further damage to an already ailing application. We introduced various safety checks throughout the nyx system to avoid such risks. Let us understand each of the safety checks in the system:

1. Consider a situation where six machines out of ten in a pool need a simple restart action on them because of the identified outliers in CPU usage. However, they are serving considerable amount of business critical traffic at the moment. Bringing down more than 50% of machines in a pool to for a brief period of time may not be a good idea. So, we have a safety threshold enforced during actions taken on every system and this command would be shrunk down to include the nodes within the safety range. All the other nodes in the command are persisted for auditing purposes but dropped from execution. So every command executed is ensured to be within its safety limits and the core system waits for another set of signals from the dropped set of machines to act upon with safety limits in place.

2. Checks to determine redundant jobs being executed on the same set of nodes are also in place. The core system checks our internal jobs database to ensure that the frequency of jobs executed on the same set of machines is not high. If this check fails, the command is persisted in the database, dropped and notified. Because if a same job is failing on the same set of machines often then the underlying problem needs to be investigated and our administrators are notified with a history of actions. The processed command is then pushed back onto the messaging bus for the executors to act upon.

3. Mutual exclusion and atomicity are other two safe guards in place to prevent concurrent activities from being executed on the same application. For instance, we don’t want to execute a restart action on a node where a deployment is already in progress. We have achieved this by using distributed locks.

4. After executing a command for a system, a simple check is in place to measure its effects by looking for signals in the next batch corresponding to the same system for which the command was issued. The overall health of the system is measured as well before and after we execute commands over it. A warning alert is raised if signals are observed even after the commands were executed. However, the redundant jobs check kick in to prevent the execution of the same commands on the system after a couple of tries.

Execution strategy

In order to prevent multiple failures at the same time during execution of generated commands, a basic strategy is applied for each type of command. Considering a command that acts upon of 10 nodes in a system. They are split into ‘x’ number of groups with a specified number of nodes in each group. The specified command is executed in parallel for all the nodes in one group but the groups themselves are cascaded one after the other. The success rate for each of the group is measured and the execution of the subsequent group is triggered only if a certain success rate threshold is crossed on the previous group of nodes. This information is used by the safety check components as well to decide their triggers during the execution flow.

Ground truth – Test the waters

We ran the system for a few weeks in manual mode to determine the reliability of generated commands. An administrator looks up at each generated command over the UI and schedules the job as necessary or cancels it. Recently, we turned the auto execute mode on and things are looking good. Here is a screenshot from our 1k remediations milestone.


We have collected cosmic amounts of data regarding the conditions during which a job is executed or cancelled during the manual mode. Going forward, our plan is to feed this data to a learning system that could better classify the actions that needs to be taken on the generated jobs based on its features thereby taking the first step towards a truly intelligent lights out management system.


Hour of Code @ PayPal


On Dec 12th, as part of the Hour Of Code initiative, I taught programming to a room full of 5-11 year old children of PayPal employees.
I have to say, I’m completely blown away by them.

I thought I’d have to do a lot of teaching. However, once I showed them how the basic structures work and how to put a program together, they took off!

I taught them about concurrent programming and debugging, disguised as a session on doing a dance animation. We started with post-it notes and a whiteboard to show them that computers are really dumb, and they only know how to follow the instructions that we give . The purpose here was to show that *we* write the instructions, and the computer just runs it.



(Special thanks to our programmable robots Aanan and Pavani)

Programmable RobotsOnce they got comfortable with these concepts, we then moved to the laptops.  We used Scratch, a visual programming language,  to create a dance animation. The kids got to choose the backdrop, the sprites, the dance steps.  Finally, they got to set the order in which to execute them, as well as creating loops in the sequences and adding sounds and events to the animation (e.g.: onClick do XYZ).

Again, I was  impressed with how quickly they took to it and the number of diverse ideas and creativity displayed!
Our youngest coder Risha, pictured below, was 5 years old!


At the end of the session I shared a bit about my story of how I got into software development at a young age. A few participants got to show their work to the rest of the class. Their pride in craftsmanship was obvious.  They glowed when they got to show the class and their parents what they’d created!


Overhearing so many kids telling their parents that they wanted to continue playing with this project at home makes me extremely proud.

These are my key takeaways from teaching this class:

  •  Programming shouldn’t be seen as a big scary thing. It’s not, and these kids prove it.
  • *All* kids should be exposed to programming (even if disguised as a game) from an early age. They get it, and this is the new literacy for this generation.
  • A bit obvious, but worth stating, gender and age were completely irrelevant. All kids did great!

There is a lot of demand or these types of classes / introductory sessions. Mine filled to capacity within 24 hours of the announcement. As technologists and proud craftspeople, we should all be making and effort to educate others. Education is the best way to contribute back to society, and while we —especially in the Valley– might take programming as a given, there really aren’t that many people who can teach young minds about it. I really do encourage other folks to host their own HourOfCode session!


Secure Authentication Proposal Accepted by W3C


Today the World Wide Web Consortium (W3C) accepted a submission of proposed technical work from W3C members PayPal, Google, Microsoft, and NokNok Labs. This submission consists of three draft initially specifications developed by the FIDO Alliance to facilitate browser support for replacing passwords as a means of authentication on the Web with something more secure. It is expected that the W3C will take these draft documents as a starting point and, through its standard process, evaluate, enhance, and publish them as W3C Recommendations (link to W3C recommendations page).  The goal is for the final specification to be implemented by Web browsers. With a common framework available in all browsers, Web developers will be able to rely on a secure, easy-to-use, and privacy-respecting mechanism for passwordless authentication.

As a catalyst for this work, the username/password paradigm for authentication has well-known issues (see links below) that have become exacerbated with its widespread use by Web sites. Millions of users of various companies across the world have been subjected to account takeovers, fraud, and identity theft as a direct result. While more secure methods of authentication are available, they have proven too expensive and/or too difficult to use to garner widespread use. The members of the Fido Alliance recognized the need for an authentication paradigm shift and have developed a framework and specifications to support eliminating passwords.

From the outset, the Fido Alliance recognized that significant, multistakeholder support would be required in order to effect Internet-scale change. The organization worked diligently to convince relying parties, technology vendors, and hardware manufactures of the need to work cooperatively to address the challenge of replacing passwords. Today the Fido Alliance includes 250 members and, with today’s acceptance by the W3C, the organization is delivering on its promise to enable platforms with open, free to use specifications for passwordless authentication.

The journey is far from over, but the development of the specifications and their acceptance by the W3C are important steps toward improved, easy-to-use, secure authentication. This is yet another example of how we continually strive to improve security not just for our own customers, but for all users of the Web.