Nyx – Lightsout management at PayPal
Increased adoption of cloud-based infrastructure by the industry has shown tremendous improvements in effectively running and managing applications. But most of the industries’ current practices to manage these applications are imperative in nature. In an ever-evolving situation with an increasing demand to better manage these applications, a declarative approach is needed. The ideal declarative system aims to determine the base state of each of the managed applications, monitor them continuously for any induced mutations and restore it back to the desired base state.
PayPal has one of the world’s largest cloud deployments with a wide spectrum of systems varying from legacy to state-of-the-art enabling our global infrastructure to be truly agile. Our uptime is mandated at 99.99% in terms of availability. The idea of a declarative lights out management system to dynamically manage these heterogeneous systems at scale raises a lot of interesting challenges.
Lights out management – A brief history
Administrators initially introduced lights-out management systems as a part of the server hardware itself to enable remote management. Administrators were able to gain access into the server boxes despite the underlying operating system’s working status and perform maintenance routines. Improvements have poured in the last decade for technologies paving way for an intelligent everything. Extending the notion of introducing intelligence to a lights-out management system sounds like a very interesting challenge.
Nyx – PayPal’s lights-out management system
Our journey towards this ambitious idea started with the basic steps of formalizing the scenario and systematic definition of all the problems. We named this early stage project, Nyx. The first step is an intuitive inference of each of these problems to understand its components.
At PayPal, we talk scale at its evolved form. The number of systems deployed on our private cloud and the data being aggregated about those systems are enormous. It is a very arduous task to identify various states of our systems and manually invoke a corrective action. Our first task in this process was to establish a baseline for various dimensions of data that we are collecting about the systems. We then act upon the systems based on various effective analysis over the data and perform dynamic corrective actions. Let us understand this in detail.
Determining the base state
Characterization of the base state for the varied set of systems exposes a duality in the process. Certain properties like the desired code version for a system or the node health are very straightforward to determine. A change in such properties could be easily determined as well and the necessary corrective actions could be triggered. But let us consider the case of a more extensive property like Service level agreement and a dynamic maintenance of the desired SLA for all the systems.
In this case, the first step is to define the SLA value and converting this value into another dimension representing parameters like compute flavors (if homogenous or heterogeneous sizing is needed), horizontal scaling level, response time and many more as per the requirement. In such cases, it is not very straightforward to determine the base state for a system under consideration and to proactively take corrective actions.
As a start, we took under consideration the intensive properties for a system and effectively define a model to represent a base state. This section gives you an overview of each of these components and the next section discusses about identifying the outliers given these base states.
Reachability status of every node in a system could be used collectively as one of the parameters to determine its quality of service. A node’s reachability is determined using the responses obtained from three utilities; ping, telnet and load balancer check. When the reachability routine starts, instantiation of the above utilities is cascaded one after the other in the given order.
Concluding the status of reachability using only the plain ping utility is not reliable because there are relatively higher chances of the packets getting dropped along the way. Telnet is used as an added guarantee to the ping’s result. This is still not good enough. For example in an event of network partition, there is a fair chance of Nyx itself getting quarantined from the rest of the servers. As a defensive check to avoid false positives, the result from the corresponding load balancer on each segment of our network is verified with the previous ping and telnet results. On a clear mark as “down” in the load balancer and the previous utilities’ results being false, the node is determined to be unreachable and hence alarmed.
Code version check is a primary use case for this system. A regular deployment job executed on ‘n’ number of nodes updates a central database with the active version of code that was deployed. This central database is considered as our source of truth for the desired code version (What It Should Be – WISB). Anomalies during a deployment job or other unexpected events could mutate an entire system or specific node in a system that has the actual code version (ground truth) running on the nodes (What It Really Is – WIRI). Difference between the WISB and WIRI for a given node could be used to determine if we need to execute any remediation on the node.
In case of code version, the ideal condition is when both the WISB and the WIRI remains equal. Our system identifies the difference between the WISB and WIRI for every node in a system and executes a code deployment with the correct version of code on finding a difference. An immediate future work on our track in this use case is to identify an intersecting set of problems during multiple deployments failure, effectively classify the underlying cause of failure and execute case specific remediation.
State changes happen unexpectedly before or after a system is provisioned and starts serving traffic. Runtime data from every node of a system including errors and latency numbers give information about the performance of the entire system. But it is very hard to determine an outlier with such variables. This section discusses the algorithms used to find the outliers from runtime metrics using these multiple variables and certain drawbacks as well:
Cluster based outlier identification
Considering the runtime metrics, it is relatively more complex to determine the outliers. We have used a data-clustering algorithm DBSCAN (Density-based spatial clustering of applications with noise) as it proves to be efficient for our case with its unique properties.
Consider a set of points in some space to be clustered. For the purpose of DBSCAN clustering, the points are classified as core points, (density-)reachable points and outliers, as follows:
- A point p is a core point if at least minPts points are within distance ε of it, and those points are said to be directly reachable from p. No points are directly reachable from a non-core point.
- A point q is reachable from p if there is a path p1, …, pn with p1 = p and pn = q, where each pi+1 is directly reachable from pi (so all the points on the path must be core points, with the possible exception of q).
- All points not reachable from any other point are outliers.
Source : Wikipedia. Author : Chire
In this diagram, minPts = 3. Point A and the other red points are core points, because at least three points surround it in an ε radius. Because they are all reachable from one another, they form a single cluster. Points B and C are not core points, but are reachable from A (via other core points) and thus belong to the cluster as well. Point N is a noise point that is neither a core point nor density-reachable.
The reason behind using DBSCAN instead of k-means is the unavailability of the number of clusters that could be present from the given data about each metric. DBSCAN finds the number of clusters starting from the estimated density distribution. This is ideal for our situation, where an outlier is identified as the value that remains unusual from any cluster of nodes (that are considered to in a healthy state) and the number of such normal clusters are not known before hand.
The algorithm is executed for every node’s individual variables; CPU usage, memory usage and latency. The outliers marked from the previous step could then be acted upon to reinstate. (The epsilon value for the algorithm was calculated based on the domain under consideration). The notion of outlier would be reversed in a situation if the nodes that form a cluster represents the set of machines that are actually the outliers and points that are not included in any cluster are those that are behaving normally. We avoid such misinterpretation with the help of an overall threshold as given below and avoid false remediation. In such exceptional cases of unexpected behaviors, the system requires manual intervention.
Threshold based outlier identification
This approach is fairly intuitive from an implementation standpoint. Identifying the outliers based on threshold values could work for systems of smaller scale. Determining and maintaining threshold for thousands of applications is cumbersome and that is why this makes it an unsustainable solution on its own.
Our next priority in this routine is developing a method that is an ensemble of these two methods. The threshold value patterns being generated from the current system will serve as a training data for the ensemble model to better classify the outliers.
The signaling agents run periodically with a per agent function; unreachability, code version and outliers. These agents execute their routines and push the status of each node onto the common messaging bus from which the rest of system acts.
The messaging bus serves as the common information aggregation point for each of the component in the system. We have used Kafka and ZeroMQ for the messaging system enabling a simple producer-consumer architecture.
Like the signaling agents, there are various types of reactors that listen to the messaging bus for commands; restart, replace and deploy executor. Executors listen to the messaging bus for their respective command type to be moved forward as jobs ready for execution. In the entire flow of generating signals, the conversion to commands, aggregation of commands, the safety check and job scheduling, information about every action is persisted on the control plane database for the purpose of auditing.
The nyx core component of the system includes a set of subsystems with distinctive roles and follows a symmetric design approach. The initializer fires up the bare minimum components to coordinate and waits for the ensemble to come up. When it comes up, a leader election is initiated among the systems followed by splitting the entire list of pools available to each follower to ensure proper load balancing among the control plane systems. Each Nyx system maintains an in-memory view of the applications that belong to them. Leader takes care of Nyx itself by monitoring the machines, handling request proxies, and the auxiliary functions needed for Nyx to function. Followers on the other hand takes care of the applications being monitored and maintains the desired state.
Another notable characteristic of the Nyx systems is its self-healing. When anyone of the follower system goes down in the group, the pool splitting action is re-triggered and the load is split between the available machines, it rebuilds the in memory model also accordingly. When the leader itself goes down, the other systems in the group trigger the leader election again followed by pool splitting. These machines in the core listen to the messaging bus for updates from the signaling agents and converts the signals into executable commands.
Each of these individual commands including a node are then aggregated together based on a pool using a command compactor module in the core. This ensures that actions taken on the machines that belong to the same pool are executed at the same time rather than individually scheduling them passing it over to safety check routines.
A better step to avoid unexpected interruptions is to be aware of the conditions during which the worst is about to happen. All the components described above could be ideal to remediate ailing systems but executing preventive actions would always be the efficient solution to ensure availability. Considering a very straightforward example, it is very less expensive to spin up few instances when an increasing traffic is observed rather than spinning them up after the old instances go down because of overload.
We have used two simple ways to identify a probable anomaly in the near future for various runtime metrics. The signals generated by all the signaling agents are pushed onto a separate queue for prediction. The first component in the prediction flow is a regression module that tries to fit the best possible curve for the recently observed values in the past window. The values for the immediate next window for all the metrics are predicted using this (with a margin of error).
The second component is a collection of basic feed-forward networks (from Java Encog library), which are trained using the runtime metrics data of past four years collected from our systems. Training data of each metric was given to four networks; “hour of the day”, “day of the week”, “day of the month” and “month of the year”. Every network is essentially given a time series data using which it is trained to predict the possible value of each metric in the next window (for each network).
The predicted value from both the components is used together to check for possible threshold violations as defined in the core systems and the preventive actions are generated accordingly.
Safety checks and balances
Systems focusing on remediation aspects need to make sure of not inducing further damage to an already ailing application. We introduced various safety checks throughout the nyx system to avoid such risks. Let us understand each of the safety checks in the system:
1. Consider a situation where six machines out of ten in a pool need a simple restart action on them because of the identified outliers in CPU usage. However, they are serving considerable amount of business critical traffic at the moment. Bringing down more than 50% of machines in a pool to for a brief period of time may not be a good idea. So, we have a safety threshold enforced during actions taken on every system and this command would be shrunk down to include the nodes within the safety range. All the other nodes in the command are persisted for auditing purposes but dropped from execution. So every command executed is ensured to be within its safety limits and the core system waits for another set of signals from the dropped set of machines to act upon with safety limits in place.
2. Checks to determine redundant jobs being executed on the same set of nodes are also in place. The core system checks our internal jobs database to ensure that the frequency of jobs executed on the same set of machines is not high. If this check fails, the command is persisted in the database, dropped and notified. Because if a same job is failing on the same set of machines often then the underlying problem needs to be investigated and our administrators are notified with a history of actions. The processed command is then pushed back onto the messaging bus for the executors to act upon.
3. Mutual exclusion and atomicity are other two safe guards in place to prevent concurrent activities from being executed on the same application. For instance, we don’t want to execute a restart action on a node where a deployment is already in progress. We have achieved this by using distributed locks.
4. After executing a command for a system, a simple check is in place to measure its effects by looking for signals in the next batch corresponding to the same system for which the command was issued. The overall health of the system is measured as well before and after we execute commands over it. A warning alert is raised if signals are observed even after the commands were executed. However, the redundant jobs check kick in to prevent the execution of the same commands on the system after a couple of tries.
In order to prevent multiple failures at the same time during execution of generated commands, a basic strategy is applied for each type of command. Considering a command that acts upon of 10 nodes in a system. They are split into ‘x’ number of groups with a specified number of nodes in each group. The specified command is executed in parallel for all the nodes in one group but the groups themselves are cascaded one after the other. The success rate for each of the group is measured and the execution of the subsequent group is triggered only if a certain success rate threshold is crossed on the previous group of nodes. This information is used by the safety check components as well to decide their triggers during the execution flow.
Ground truth – Test the waters
We ran the system for a few weeks in manual mode to determine the reliability of generated commands. An administrator looks up at each generated command over the UI and schedules the job as necessary or cancels it. Recently, we turned the auto execute mode on and things are looking good. Here is a screenshot from our 1k remediations milestone.
We have collected cosmic amounts of data regarding the conditions during which a job is executed or cancelled during the manual mode. Going forward, our plan is to feed this data to a learning system that could better classify the actions that needs to be taken on the generated jobs based on its features thereby taking the first step towards a truly intelligent lights out management system.