Autoscaling applications @ PayPal

By , and

PayPal’s infrastructure has grown tremendously over the past few years hosting multitudes of applications serving billions of transactions every year. Our strong belief in decoupling the infrastructure and the app development layer enables us to independently evolve quicker. With more than 2500 internal applications and more than 100k instances across our entire infrastructure, we are committed in ensuring high availability of our applications and optimal resource utilization.

In our previous post, we discussed in detail about Nyx, an automated remediation system for our infrastructure. The underlying modular architecture of Nyx is designed to listen to a spectrum of signals from different agents and quickly remediate the identified anomalies. This enables us to vertically scale more agents in terms of functions as well. The core module could also be easily extended to understand new types of signals. We have also reached a milestone of more than a million auto-remediations across our infrastructure.

In our continued efforts to ensure effective resource utilization for our infrastructure, we have extended the functionality of Nyx to enable auto-scaling of our applications. The problem of optimizing cost remains an interesting challenge with our journey into the public cloud as well. The main utility of this addition is to establish an automated process to understand the throughput of one instance in an application and then right-sizing in terms of number of computing instances for the current traffic trend. Because of the variation in traffic trends adding to the problem, an intrinsic way of dynamically quantifying the throughputs is necessary. And this enables all the applications at PayPal to remain truly elastic responding to such varying trends.


Autoscaling in general could be done either in a reactive or predictive way. Reactive autoscaling involves in quickly increasing or decreasing the cluster size to respond to the changes in traffic load. It is effective in cases to react to high velocity changes provided the mechanisms to provision and warm-up the newly added instances are efficient. A simple way to do this is to let the app developers define the thresholds using which instances could be scaled up or down.

Predictive way of autoscaling essentially attempts to model the underlying trend and predict the changes to better scale the applications. It is helpful in cases where the provisioning mechanisms and frameworks would need more time to warm up before serving incoming requests. However, it might not be very useful during surges in incoming requests. Other variations include an ensemble of these two approaches but with efficiency numbers that are very context specific.

Autoscaling @ PayPal

The autoscaling engine for applications hosted on PayPal’s infrastructure is a variant of the ensemble approach. A baseline of assumptions was established after understanding the various constraints for the system:

1. Complex topology – Emphasizing on the point that we host a huge number of applications with a wide variety of characteristics and dependencies. An assumption to correlate overall incoming traffic to scale every app in the structure seems straightforward but not the optimized solution to effectively right size every application cluster.

2. Developer isolation – Relying on app development team to establish throughput numbers proves to be ineffective. Because the assumption to right size the environment is to calculate the throughput of an instance for the current traffic trend and code version continuously. To do this, the app development team would need to define more processes to simulate production environments that voids the purpose of isolating them from the infrastructure related details.

3. Frequency of right sizing – Autoscaling could be triggered on an application pool to better serve the incoming traffic trend over a defined period. Say we would want a pool to be right sized for the next week with a decreasing traffic trend. However, we could observe varying patterns in each day itself with peaks occurring at specified periods within a day. We could either scale up and down for a given day or scale once to address the maximum traffic for the upcoming week itself. But there will be few resources under-utilized this way. We also need to take in consideration the underlying provisioning and decommissioning mechanisms to determine the frequency of right sizing. Hence, we need an effective way of establishing the right sizing plan after the throughput is established for the input traffic pattern.

Examples of incoming traffic pattern of two different applications (one week)

We have split the work into two parts:

  1. Quantifying an instance’s throughput in an application in a dynamic way.
  2. Dynamically determining an autoscaling plan.

We will be discussing only about the first part in this post.

Unit instance throughput calculation

The first step in right sizing is to determine the unit throughput of an instance in the application cluster. A standard way to approach this problem is to perform stress test to determine the throughput values of an instance serving the application. However, there are many drawbacks in this approach. The development team is required to mock their production environments and generate test scripts as well to automate the process. It is very hard and would add a lot of overhead to the development team to replicate the stochastic nature of production environments. So we had to define a simpler way to isolate them from these problems.

Our approach 

Our approach involves in safely utilizing the production environment itself to determine the throughput of one instance in an application cluster. To put it very briefly, our throughput test run selects one instance from the cluster running in production serving live traffic behind a load balancer. We then gradually divert an increased amount of traffic to this chosen instance relative to the other instances. Various metrics are monitored for this instance to observe the state at which it becomes an outlier in comparison to the other instances. This state is defined to be the maximum threshold beyond which there would be a negative impact to the end user experience. And we could determine the maximum throughput based on the amount of traffic diverted to this instance right before this state. We will discuss in detail about every step involved in this process:

  1. Defining a reference cluster and choosing a candidate instance
  2. Pre-test setup
  3. Load increase and monitoring
  4. Restoration

Entire workflow of determining unit instance’s throughput value.

Defining a reference cluster and choosing a candidate instance

The throughput test runs as one of the Nyx agents. The agent starts with picking up one of applications running behind a datacenter’s load balancer. In this exercise, the state of an application is defined using four metrics: cpu, memory, latency and error count. The workflow begins with identifying outlier instances in terms of any of these metrics using the DBSCAN algorithm. After removing outliers, the remaining instances forms our reference cluster. We need to ensure that all the instances that formed the reference cluster remain clustered together throughout our exercise. One candidate instance is chosen from this cluster to run our exercise.

Snapshot of app state after defining a reference cluster and choosing a candidate instance

Pre-test setup

The load balancer running on top of these instances route traffic in a round robin fashion. The load balancing method is changed to ratio-member to increase traffic to the candidate instance in steps. Also, certain safety checks are put in place to ensure smoother testing. We lock the entire pool from external modifications to avoid mutations while running the tests. We also have quick stop and restore options ready as we are dealing with live traffic. We ensure that the system is ready for a hard stop and restore during any point of the workflow should there be any problem.

Load increase and monitoring

After every step increase in weight of traffic load being diverted to the candidate instance, the four metrics are monitored continuously for any change in pattern with a hard check on latency. PayPal infrastructure defines acceptable thresholds for metrics like CPU, memory and we ensure relativistic thresholds using DBSCAN as well. We monitor the candidate instance for a specific period post the increase in traffic for any deviations in the above-mentioned metrics. We also store the state of the candidate instance post every increase during the entire monitoring session.

If there are no deviations then we could determine that the increase in traffic could be handled by the candidate instance and we could proceed with further increase. If there is a significant change in (or deviations beyond set max thresholds) for any of the monitored metrics then we define this state to be the end of our test. We also have a threshold on the load balancer weight to define a finite horizon over the step increases of load on the candidate instance.

Snapshot of host latencies at traffic 2x

Snapshot of host latencies at traffic 3x


The result of our throughput test could be two: the chosen candidate fails to handle the increase in traffic or the chosen candidate exhausts the number of weight increases on the load balancer. If the verdict is that the candidate instance fails to handle the load after a weight increase, the state before which the failure occurred is used to observe the throughput. In the second case, the last state at which the weight updates get exhausted determines the throughput. All the configuration changes are restored on the load balancer and we enter the post-run monitoring phase to ensure that the candidate instance reverts to its original state.

One of the exercises could be visualized below. We could observe the change in transaction numbers for the candidate instance during the exercise window. And the point at which the change in latency occurs (in this case) is when we end the monitoring phase and enter restoration.

TPM graph during the exercise – Candidate host clearly serving more transactions compared to its peers without affecting end-user experience.

Calculating required capacity

We will be discussing in the next post about determining auto-scaling plans post the unit compute throughput calculation. However, let us take a brief look about how we could establish the final capacity of an application cluster which is fairly straightforward at this point.


Peak throughput for the application = 1200

Unit compute throughput = 100

If AF = 1.5 then Final required instances (rounded) = (1200 / 100) * 1.5 = 18


Let us consider a case where the candidate instance could take all the traffic diverted towards it. However, the test exits after reaching a hard limit on the number of weight increases that could be done on the candidate instance. In this case, we could intuitively observe that the numbers obtained as a result of this exercise might not necessarily reflect the actual maximum throughput of an instance. However, we have observed after many runs in production that such situations are the result of input traffic itself being very low or it is a “well written” application. If our tests were to define an end condition only when the candidate instance becomes an outlier in these conditions, we might be running these tests forever. We could also encounter sudden increases in latency for such applications during the tests where we might cause a momentary disruption because of very high stress loads on the candidate instances. Hence, we need to introduce a definite stopping point while running this test for the previously mentioned situations. And we could clearly see that the throughput numbers obtained as a result of this exercise clearly reflects near perfect values.

These exercises are running in production 24×7 with at most 10 applications getting profiled at a time to avoid causing CPU spikes on the load balancers. And since the applications are constantly profiled, new code versions result in new throughput numbers which solves the problem of dealing with stale numbers.

Our approach works as we receive enough traffic at any time to saturate one instance for it to be pushed out as an outlier in comparison to other instances in the pool. So, the advantage of this process is that despite the time at which the exercise is run, the throughput number remains the same (peak traffic time or normal) for a given version of the code.

We will look at dynamically determining the auto-scaling plan in the next post.