It is not uncommon for services in PayPal to cover 1000 VMs or more. These services make use of very small VMs and produce very low throughput for each VM. At the same time, the large number of nodes takes a toll on the network and routing infrastructure. Several of these services are interconnected into a complicated mesh, making a user request travel through many network hops. As the number of these services adds up, latency gradually increases and the user experience deteriorates.
While it is good for a service to have a critical mass of VMs spread across many data centers for redundancy, additional VMs beyond the critical mass have diminishing returns. There is an inherent cost to too many services spanning hundreds of VMs, in terms of management and monitoring, ineffective caching, but more importantly in terms of agility. It may take from a few minutes, up to an hour to roll out a new version of the service across 100 VMs. It takes ten times longer to roll out 1000 VMs.
We have been riding the single-thread performance increases of Moore’s law for many decades now, but the trends slowed down since about 2005. Unless there are breakthroughs in alternate technologies like quantum computing, we are at the limit of transistor density and clock speed. This means newer processors do not necessarily make your single-threaded applications run faster any longer. The trend on power savings drove the microprocessor industry to provide more processors at lower clock speeds. This now means applications will have to adapt to making use of more CPUs per VM.
The industry cries “micro-services” out loud. But what are micro-services? When would a service be considered a micro-service? When would it become a nano-service ? And how will engineers and organizations know what the right service boundary is? We heard that micro-services are services that do only one thing. Does this mean if an organization does ten thousand things we need to have ten thousand services? Perhaps. The truth is, services grow organically and are aligned more with organizations than with their function in the grand scheme. A re-org splits an engineering team into two and, quite often, the one service owned by that team is now split into two services. This means even micro-services should be structured to be modular and be able to adapt to organizational re-structuring. In essence, you may want a micro-service to be built upon a multitude of these loosely-coupled nano-services.
Lastly, even our micro-services or nano-services get complicated and hard to maintain. We need modern programming techniques that allow us to quickly build and assemble these services. Ideally, there should be good visibility into that one thing a service does and you should not have to dig into layers and layers of code to figure it out.
These set of problems and requirements drive us to look for a next-generation framework or infrastructure, and include:
- Scalable, both horizontally to hundreds of nodes and vertically to very many processors, handling billions of requests per day
- Low latency, controllable at a very fine grain
- Resilient to failure
- Flexibility in adjusting the service boundaries
- A programming model AND culture encouraging scalability and simplicity, including clean failure and error handling
Based on these requirements, we quickly narrowed down our options and turned to the Actor Model of Computation for its ultimate scalability. Being a message-based system also allows us to control latency at very good granularity as opposed to deeply stack-based systems. There are two prominent players: Erlang and Akka. Erlang has some very prominent properties, especially the runtime upgradability, and can be a good choice. But PayPal already has a significant investment in the Java Virtual Machine (JVM). Growing a new stack from scratch in this environment is extremely difficult with its many operational hooks and security requirements. Having many of these hooks ready as JVM libraries, for good or worse, does significantly help with cost and time-to-market. Therefore, we decided to use Akka and Spray as the http library as it fully honors the Akka actor and execution model.
The unique mix of functional programming and the actor model in Akka (and definitely Erlang, too) allowed us to write code that is easy to reason about, easy to test, and especially easy to handle errors and failure scenarios when compared to the traditional model used on the JVM. This is a great benefit allowing faster, resilient, and simpler code with streamlined error handling and fewer bugs.
Because Akka and Spray are in the form of libraries and provide the ultimate flexibility to whatever you want to build, it unfortunately allows every service to build their ecosystem from scratch. This would lack standardization and manageability across many of these services we want to grow, causing these services to only become “specialty” services, each with its own way of deployment and management. We looked at Play as an alternative. While it is extremely simple, it did not natively follow the Akka message-based APIs and conventions but rather allow the use of Akka on top of Play.
A Mini-Introduction to squbs
A new stack called “squbs”(spelled in all lower case with the pronunciation rhyming with “cubes”) makes use of the loose coupling already provided by actors. It creates a modular layer (for your nano-services) called “cubes” that are symmetric to other cubes. Unlike libraries with concrete dependencies at the API layers, cubes ride on the actor system and only expose the messaging interface already provided in Akka. The interdependency between cubes are loose and symmetric. It is not hard to see the roots of the name “squbs” from these concepts and properties.
There are only a few principles coming together for designing squbs:
- It must be extremely lightweight with no measurable performance deficit over a similar Akka application built from scratch.
- New APIs over the Akka APIs are based on absolute necessity. Developers should not need to learn any squbs API or message protocol to build services or applications on squbs. The knowledge base needed to build squbs applications should be the Akka knowledge base which developers can acquire from training, documentation, and forums available on the Internet.
- It should be open source from the core up, with hooks for plugging in PayPal operationalization that cannot be open sourced.
With this, squbs became the standard for building Akka-based reactive applications @PayPal.
Culture & Language
Programming to the reactive, functional landscape is very different from the traditional Java programming we have done for the last 20 years. It requires immutability, leaving behind Java Beans and nulls and most mutable APIs in the Java ecosystem, adopting error containers like the Scala Try (a Java version is available in open source), and an ultimate awareness of dangerous function closing-overs and any blocking behavior.
While squbs supports both Scala and Java use cases, Akka APIs are clearly more suitable to their native Scala ecosystem. We also debate whether it is easier for an engineering team with Java background to stay with Java and adopt a culture very different from how we have programmed Java for the last 20 years (empirical programming with pervasive mutability) and try to set our own culture, standards, and guidelines against what Java programmers are used to do, or have them learn Scala and adopt the pre-existing Scala culture well suited for this programming model, with a lot of support libraries that are readily immutable and functional.
The culture adoption is hard to describe. Before teams get their hands dirty, the ultimate answer was almost always Java. It should not be to any surprise as it is their bread and butter language. There is also a resistance from management and architects that do not get their hands dirty. It is the resistance to any change, a resistance to do what you know and not go beyond. Any such change is perceived to introduce a tremendous risk to their projects – in many cases not even knowing they are already trying to create a sub-culture just under the Java brand name – a set of people who speak Java in a very different dialect and will have to maintain this functional, reactive dialect of Java in each of their teams.
For teams and members that actually do get this far and get their sleeves rolled up, they tend to see very quickly their work is much easier when using Akka’s native tongue: Scala. The libraries, the culture, and all they need is readily available. It feels natural, with no tweaks. Also, learning Scala to the point of building reactive services is barely scratching the surface of the ocean depths of Scala.
How do you take a programmer who only knows how to write linear code, and make them build high performance, actor-based systems? Since we do not have the luxury to hire the top 5% ubercoders, we have to make sure our developers can be trained to do the job.
Luckily, a vast majority of services do similar things. They receive requests or messages, make database calls to read/write the database, make other service calls, call a rule engine, fetch data from cache, write to cache, all in combination. Here goes our micro-service that does one thing. Some others have one or other forms of stream processing, sometimes ETL.
Because it is useful to create some common patterns that teams can readily adopt, we defined and built our set of programming patterns, which over time, will manifest as application templates. It allows for developers to see the problem and marry it to a well-defined and well-studied pattern ensuring short-term success to their projects.
A common pattern is the “Orchestrator Pattern” which orchestrates requests or messages by talking to a multitude of resources, all asynchronously. These resources may be in the same system, with a manifestation as an actor, another cube which just happens to be on the same address space, or a remote resource altogether.
Another trait we have is graciously called the “Perpetual Stream”. It is no different from just Akka Streams, except that it encourages very long-running streams that will start with the service and stop when the service instance gets stopped. Providing this pattern and utility in squbs allows for streams to hook into the system state and ensure no messages are dropped at shutdown. It provides an additional benefit of modeling the whole flow of the service in a central place, providing a clear oversight and understanding of the service’s functionality.
The adoption of Akka and squbs have already provided very high-scale results. With as little as 8 VMs and 2 vCPU each, applications were able to serve over a billion hits a day. Our systems stay responsive even at 90% CPU, very uncharacteristic for our older architectures. This provides for transaction densities never seen before. Batches or micro-batches do their jobs in one-tenth of the time it took before. With wider adoption, we will see this kind of technology being able to reduce cost and allow for much better organizational growth without growing the compute infrastructure accordingly.
Needless to say, Akka and squbs are still players at the “infrastructure” level. squbs is an open source project by eBay and PayPal. It was designed to be open sourced from the very beginning and is free from pollution and deep library dependencies. We believe the customization hooks allow squbs to fit into any operational environment and would benefit any organization who wants to adopt Akka-based technologies in a larger scale.
Visit us at https://github.com/paypal/squbs and leave your comments and questions on the Gitter channel you find on this github site.