Tag Archives: Scale

Powering Transactions Search with Elastic – Learnings from the Field

By

Introduction

We see a lot of transactions at PayPal. Millions every day.

These transactions originate externally (a customer using PayPal to pay for a purchase on a website) as well as internally, as money moves through our system. Regular reports of these transactions are delivered to merchants in the form of a csv or a pdf file. Merchants use these reports to reconcile their books.

Recently, we set out to build a REST API that could return transaction data back to merchants. We also wanted to offer the capability to filter on different criteria such as name, email or transaction amount. Support for light aggregation/insight use cases was a stretch goal. This API would be used by our partners, merchants and external developers.

We choose Elastic as the data store. Elastic has proven, over the past 6 years, to be an actively developed product that constantly evolves to adapt to user needs. With a great set of core improvements introduced in version 2.x (memory mapped doc values, auto-regulated merge throttling), we didn’t need to look any further.

Discussed below is our journey and key learnings we had along the way, in setting up and using Elastic for this project.

Will It Tip Over

Once live, the system would have tens of terabytes of data spanning 40+ billion documents. Each document would have over a hundred attributes. There would be tens of millions of documents added every day. Each one of the planned Elastic blades has 20TB of SSD storage, 256 GB RAM and 48 cores (hyper).

While we knew Elastic had all the great features we were looking for, we were not too sure if it would be able to scale to work with this volume (and velocity) of data. There are a host of non-functional requirements that arise naturally in financial systems which have to be met. Let’s limit our discussion in this post to performance – response time to be specific.

Importance of Schema

Getting the fundamentals right is the key.

When we initially setup Elastic, we turned on strict validation of fields in the documents. While this gave us a feeling of security akin to what we’re used to with relational systems (strict field and type checks), it hurt performance.

We inspected the content of the Lucene index Elastic created using Luke. With our initial index setup, Elastic was creating sub-optimal indexes. For e.g. in places where we had defined nested arrays (marked index=”no”), Elastic was still creating child hidden documents in Lucene, one per element in the array. This document explains why, but it was still using up space when we can’t even query the property via the index. Due to this, we switched the “dynamic” index setting from strict to false. Avro helped ensure that the document conforms to a valid schema, when we prepared the documents for ingestion.

A shard should have no more than 2 billion parent plus nested child documents, if you plan to run force merge on it eventually (Lucene doc_id is an integer). This can seem high but is surprisingly easy to exceed, especially when de-normalizing high cardinality data into the source. An incorrectly configured index could have a large number of hidden Lucene documents being created under the covers.

Too Many Things to Measure

With the index schema in place, we needed a test harness to measure the performance of the cluster. We wanted to measure Elastic performance under different load conditions, configurations and query patterns. Taken together, the dimensions total more than 150 test scenarios. Sampling each by hand would be near impossible. jMeter and Beanshell scripting really helped here to auto-generate scenarios from code and have jMeter sample each hundreds of times. The results are then fed into Tableau to help make sense of the benchmark runs.

  • Indexing Strategy
    • 1 month data per shard, 1 week data per shard, 1 day data per shard
  • # of shards to use
  • Benchmark different forms of the query (constant score, filter with match all etc.)
  • User’s Transaction Volume
    • Test with accounts having 10 / 100 / 1000 / 10000 / 1000000 transactions per day
  • Query time range
    • 1 / 7 / 15 / 30 days
  • Store source documents in Elastic? Or store source in a different database and fetch only the matching IDs from Elastic?

Establishing a Baseline

The next step was to establish a baseline. We chose to start with a single node with one shard. We loaded a month’s worth of data (2 TB).

Tests showed we could search and get back 500 records from across 15 days in about 10 seconds when using just one node. This was good news since it could only get better from here. It also proves an Elastic (Lucene segments) shard can handle 2 billion documents indexed into it, more than what we’ll end up using.

One take away was high segment count increases response time significantly. This might not be so obvious when querying across multiple shards but is very obvious when there’s only one. Use force merge if you have the option (offline index builds). Using a high enough value for refresh_interval and translog.flush_threshold_size enables Elastic to collect enough data into a segment before a commit. The flip side was it increased the latency for the data to become available in the search results (we used 4GB & 180 seconds for this use case). We could clock over 70,000 writes per second from just one client node.

Nevertheless, data from the recent past is usually hot and we want all the nodes to chip in when servicing those requests. So next, we shard.

Sharding the Data

The same one month’s data (2 TB) was loaded onto 5 nodes with no replicas. Each Elastic node had one shard on it. We choose 5 nodes to have a few unused nodes. They would come in handy in case the cluster started to falter and needed additional capacity, and to test recovery scenarios. Meanwhile, the free nodes were used to load data into Elastic and acted as jMeter slave nodes.

With 5 nodes in play,

  • Response time dropped to 6 seconds (40% gain) for a query that scanned 15 days
  • Filtered queries were the most consistent performers
  • As a query scanned more records due to an increase in the date range, the response time also grew with it linearly
  • A force merge to 20 segments resulted in a response time of 2.5 seconds. This showed a good amount of time was being spent in processing results from individual segments, which numbered over 300 in this case. While tuning the segment merge process is largely automatic starting with Elastic 2.0, we can influence the segment count. This is done using the translog settings discussed before. Also, remember we can’t run a force merge on a live index taking reads or writes, since it can saturate available disk IO
  • Be sure to set the “throttle.max_bytes_per_sec” param to 100 MB or more if you’re using SSDs, the default is too low
  • Having the source documents stored in Elastic did not affect the response time by much, maybe 20ms. It’s surely more performant than having them off cluster on say Couchbase or Oracle. This is due to Lucene storing the source in a separate data structure that’s optimized for Elastic’s scatter gather query format and is memory mapped (see fdx and fdt files section under Lucene’s documentation). Having SSDs helped, of course

 

Sharding the Data

Final Setup

The final configuration we used had 5-9 shards per index depending on the age of the data. Each index held a week’s worth of data. This got us a reasonable shard count across the cluster but is something we will continue to monitor and tweak, as the system grows.

We saw response times around the 200 ms mark to retrieve 500 records after scanning 15 days’ worth of data with this setup. The cluster had 6 months of data loaded into it at this point.

Shard counts impact not just read times; they impact JVM heap usage due to Lucene segment metadata as well as recovery time, in case of node failure or a network partition. We’ve found it helps Elastic during rebalance if there are a number of smaller shards rather than a few large ones.

We also plan to spin up multiple instances per node to better utilize the hardware. Don’t forget to look at your kernel IO scheduler (see hardware recommendations) and the impact of NUMA and zone reclaim mode on Linux.

Final Setup

Conclusion

Elastic is a feature rich platform to build search and data intensive solutions. It removes the need to design for specific use cases, the way some NoSQL databases require. That’s a big win for us as it enables teams to iterate on solutions faster than would’ve been possible otherwise.

As long as we exercise due care when setting up the system and validate our assumptions on index design, Elastic proves to be a reliable workhorse to build data platforms on.

Benching Microbenchmarks

By

In under one week, Statistics for Software flew past 10 Myths for Enterprise Python to become the most visited post in the history of the PayPal Engineering blog. And that’s not counting the Japanese translation. Taken as an indicator of increased interest in software quality, this really floats all boats.

That said, there were enough emails and comments to call for a quick followup about one particularly troubling area.

Statistics for benchmarks

The saying in software goes that there are lies, damned lies, and software benchmarks.

A tachometer (an instrument indicating how hard an engine is working)

Too much software is built without the most basic instrumentation.

Yes, quantiles, histograms, and other fundamentals covered in Statistics for Software can certainly be applied to improve benchmarking. One of the timely inspirations for the post was our experience with a major network appliance vendor selling 5-figure machines, without providing or even measuring latency in quantiles. Just throughput averages.

To fix this, we gave them a Jupyter notebook that drove test traffic, and a second notebook provided the numbers they should have measured. We’ve amalgamated elements of both into a single notebook on PayPal’s Github. Two weeks later they had a new firmware build that sped up our typical traffic’s 99th percentile by two orders of magnitude. Google, Amazon, and their other customers will probably get the fixes in a few weeks, too. Meanwhile, we’re still waiting on our gourmet cheese basket.

Even though our benchmarks were simple, they were specific to the use case, and utilized robust statistics. But even the most robust statistics won’t solve the real problem: systematic overapplication of one or two microbenchmarks across all use cases. We must move forward, to a more modern view.

Performance as a feature

Any framework or application branding itself as performant must include measurement instrumentation as an active interface. One cannot simply benchmark once and claim performance forever.1 Applications vary widely. There is no performance-critical situation where measurement is not also necessary. Instead, we see a glut of microframeworks, throwing out even the most obvious features in the name of speed.

Speed is not a built-in property. Yes, Formula 1 race cars are fast and yes, F1 designers are very focused on weight reduction. But they are not shaving off grams to set weight records. The F1 engineers are making room for more safety, metrics, and alerting. Once upon a time, this was not possible, but technology has come a long way since last century. So it goes with software.

To honestly claim performance on a featuresheet, a modern framework must provide a fast, reliable, and resource-conscious measurement subsystem, as well as a clear API for accessing the measurements. These are good uses of your server cycles. PayPal’s internal Python framework does all of this on top of SuPPort, faststat, and lithoxyl.

Benching the microbenchmark

An ECHO-brand ping pong paddle and ball.

Enough with the games already. They’re noisy and not even that fun.2

Microbenchmarks were already showing signs of fatigue. Strike one was the frequent lack of reproducibility. Strike two came when software authors began gaming the system, changing what was written to beat the benchmark. Now, microbenchmarks have officially struck out. Echos and ping-pongs are worth less than their namesakes.

Standard profiling and optimization techniques, such as those chronicled in Enterprise Software with Python, still have their place for engineering performance. But those measurements are provisional and temporary. Today, we need software that provides idiomatic facilities for live measurement every individual system’s true performance.


  1. I’m not naming names. Yet. You can follow me on Twitter in case that changes. 
  2. Line art by Frank Santoro.

Enterprise Overhaul: Resolving DNS

By

Everyone assumes all software engineers are great with numbers. If only they knew the truth. How many people’s phone numbers can you recite? No peeking and emergency numbers don’t count! Don’t worry if you couldn’t name that many. Here’s the real embarrassing test of the day: How many sites’ IP addresses can you name? No pinging and local subnets don’t count!

telephone
Most telephones still looked like this when DNS was invented. Not pictured: the phonebook.

Back in the mid-1980s, the first Domain Name System (DNS) implementations started putting our IP addresses into server-based contact lists and the Internet has never looked the same since. These days, we may associate DNS with large-scale networks, but it’s important to remember that DNS really came from a very human distaste for numbers. Thirty years later, we engineers use it so much in normal Internet usage that it’s easy to take for granted.

DNS may be a mature, but the fact of networks is that it always takes at least two to tango. As new technologies and deployments emerge, the implications of integrating with DNS must still be revisited. Your datacenter is not the Internet, even if it’s in the cloud. Continuing the enterprise themes of our previous posts, this post looks at how to resolve a few of the DNS pitfalls preying on precious reliability and performance.

A protocol precaution

The client side of DNS, resolution, is virtually all UDP. This is interesting because UDP is designed as a lightweight, unreliable transport. However, in many of the most common use cases, DNS calls precede TCP-backed HTTP and other protocols based on reliable transports. This fundamental difference changes many things. Looking upstream, UDP does not load-balance like TCP. Because UDP is not connection-oriented or congestion-controlled, DNS traffic will act very differently at scale.

So our first lesson is to stay true to the stateless nature of UDP and avoid putting stateful load balancers in front of DNS infrastructure. Instead, configure clients and servers to conform to the built-in load-handling architecture of DNS. The Internet’s DNS “deployment” is load balanced via its inherent hierarchy and IP Anycast.

Client integration

Back on the client side, you can do a lot to optimize and robustify your application’s DNS integration. The first step is to take a hard look at your stack. Whether you’re running Python, Java, JavaScript, or C++, the defaults may not be for you, especially when working with traffic within the datacenter.

For example, while not supported here at PayPal, it’s safe to say Tornado is a popular Python web framework, with many asynchronous networking features. But, silently and subtly, DNS is not one of them. Tornado’s default DNS resolution behavior will block the entire IO event loop, leading to big issues at scale.

And that’s just one example of library DNS defaults jeopardizing application reliability. Third-party packages and sometimes even builtins in Java, Node.js, Python, and other stacks are full of hidden DNS faux pas.

For instance, the average off-the-shelf HTTP client seems like a neutral-enough component. Where would we be without reliable standbys like wget? And that is how the trouble starts. The DNS defaults in most tools are designed to make for good Internet citizens, not reliable and performant enterprise foundations.

DNS_in_the_real_world.svg
The hops Internet-connected applications make for you. It’s no wonder the default timeout is 5000 milliseconds.

The first difference is name resolution timeouts. By default, resolve.conf, netty, and c-ares (gevent, node.js, curl) are all configured to a whopping 5 seconds. But this is your enterprise, your service, and your datacenter. Look at the SLA of your service and the reliability of your DNS. If your service can’t take an extra 500 milliseconds some percentage of the time, then you should lower that timeout. I’ve usually recommended 200 milliseconds or less. If your infrastructure can’t resolve DNS faster than that, do one or more of the following:

  1. Put the authoritative DNS servers topologically closer.
  2. Add caching DNS servers, maybe even on the same machine.
  3. Build application-level DNS caching.

Option #1 is purely a network issue, and a matter for network operations to discuss. For brevity’s sake, option #2 is outside the scope of this article. But option #3 is the one we recommend most, because it is bureaucracy-free and relatively easy to implement, even with enterprise considerations.

Application-level DNS caching

When designing an enterprise application-level DNS cache, we must recognize that we are not discussing standard-issue web components like scrapers and browsers. Most enterprise services talk to a fixed set of relatively few machines. Even the most powerful and complex production PayPal services communicate with fewer than 200 addresses, partly due to the prevalence of load balancing LTMs in our architecture.

For our gevent-based Python stack, we use an asynchronous DNS cache that refreshes those addresses every five minutes. Plus, the stack warms up our application’s DNS cache by kicking off preresolution of many known DNS-addressed hosts at startup, ensuring that the first requests are as fast as later ones.

Some may be asking, why use a custom, application-level DNS cache when virtually every operating system caches DNS automatically? In short, when the OS cache expires, the next DNS resolution will block, causing stacks without this asynchronous DNS cache to block on the next resolution. Our DNS cache allows us to use mildly stale addresses while the cache is refreshing, making us robust to many DNS issues. For our use cases both the chances and consequences of connecting to the wrong server are so minute that it’s not worth inflating outlier response times by inlining DNS. This arrangement also makes services much more robust to network glitches and DNS outages, as well as allowing for more logging and instrumentation around the explicit DNS resolution so you can see when DNS is performing badly.

Denecessitizing DNS?

The overhaul wouldn’t be complete without exploring one final scenario. What’s it like to not use DNS at all? It may sound odd, given the number of technologies built on DNS in the last 30 years. But even today, PayPal production services still communicate to each other using a statically generated IP-address-based system, like a souped-up hosts file. This design decision long predates my tenure here, and for a long time I considered it technical debt. But after collaborating with architects here and at other enterprise datacenters, I’ve come to appreciate the advantages of skipping DNS. DNS was designed for multi-authority, federated, eventually-consistent networks, like the Internet. Even the biggest datacenters are not the Internet. A datacenter is topologically smaller, has only one operational authority, and must meet much tighter reliability requirements.

pp_midtier
A little peek at PayPal’s midtier-to-midtier traffic. Each shrunken line of text is a service endpoint. It looks like a lot, but each endpoint only talks to a few others.

Whether or not your system uses DNS, when you own the entire network it’s still best practice to maintain a central, version-controlled, “single source of truth” repository for networking configurations. After all, even DNS server configurations have to come from somewhere. If it were possible to efficiently and reliably push that same information to every client, would you?

Explicit preresolution of all service names reduces the window of inconsistency while saving the datacenter billions of network requests. If you already have a scalable deployment system, could it also fill the network topology gap, saving you the trouble of overhauling, scaling, and maintaining an Internet system for enterprise use? There’s a lot packed in a question like that, but it’s something to consider when designing your service ecosystem.

In short

So, to sum it all up, here are the key takeaways:

  • Beware the pitfalls of stateful load-balancing for DNS and UDP.
  • Tighten up your timeouts according to your SLAs.
  • Consider an in-application DNS cache with explicit resolution.
  • The fastest and most reliable request is the request you don’t have to make.
  • A datacenter is not the Internet.

It may be obvious now, but it bears repeating. If you’re not careful, out-of-box solutions will fill your inbox with avoidable problems. Quality enterprise engineering means taking a microscope to libraries, with deliberate overhauling for your organization’s needs.

If you found this interesting, have some experience, and know a thing or two about security, have we got the job for you! See this description and contact mahmoud@paypal.com and kurose@paypal.com with your résumé/portfolio.

 

Introducing SuPPort

By

In our last post, Ten Myths of Enterprise Python, we promised a deeper dive into how our Python Infrastructure works here at PayPal and eBay. There is a problem, though. There are only so many details we can cover, and at the end of the day, it’s just so much better to show than to tell.

support_logoSo without further ado, we’re pleased to introduce SuPPort, an in-development distillation of our PayPal Python Infrastructure.

Started in 2010, Python Infrastructure initially powered PayPal’s internal price-setting interfaces, then grew to support payment orchestration interfaces, and now in 2015 it supports dozens of projects at PayPal and eBay, having handled billions of production-critical requests for a wide range of teams and tiers. So what does it mean to distill this functionality into SuPPort?

SuPPort is an event-driven server framework designed for building scalable and maintainable services and clients. It’s built on top of several open-source technologies, so before we dig into the workings of SuPPort, we ought to showcase its foundations:

Some or all of these may be new to many developers, but all-in-all they comprise a powerful set of functionality. With power comes complexity, and while Python as a language strives for technical convergence, there are many ways to approach the problem of developing, scaling, and maintaining components. SuPPort is one way gevent and the libraries above have been used to build functional services and products with anywhere from 100 requests per day to 100 requests per second and beyond.

Enterprise Ideals, Flexible Features

Many motivations have gone into building up a Python stack at PayPal, but as in any enterprise environment, we continuously aim to achieve the following:

Of course organizations of all sizes want these features as well, but the key difference is that large organizations like PayPal usually end up building more. All while demanding a higher degree of redundancy and risk mitigation from their processes. This often results in great cost in terms of both hardware and developer productivity. Fortunately for us, Python can be very efficient in both respects.

So, let’s take a stroll through a selection of SuPPort’s feature set in the context of these criteria! Note that if you’re unfamiliar with evented programming, nonblocking sockets, and gevent in particular, some of this may seem quite foreign. The gevent tutorial is a good entry point for the intermediate Python programmer, which can be supplemented with this well-illustrated introduction to server architectures.

Interoperability

Python usage here at PayPal has spread to virtually every imaginable use case: administrative interfaces, midtier services, operations automation, developer tools, batch jobs; you name it, Python has filled a gap in that area. This legacy has resulted in a few rather interesting abstractions exposed in SuPPort.

BufferedSocket

PayPal has hundreds of services across several tiers. Interoperating between these means having to implement over half a dozen network protocols. The BufferedSocket type eliminated our inevitable code duplication, handling a lot of the nitty-gritty of making a socket into a parser-friendly data source, while retaining timeouts for keeping communications responsive. A must-have primitive for any gevent protocol implementer.

ConnectionManager

Errors happen in live environments. DNS requests fail. Packets are lost. Latency spikes. TCP handshakes are slow. SSL handshakes are slower. Clients rarely handle these problems gracefully. This is why SuPPort includes the ConnectionManager, which provides robust error handling code for all of these cases with consistent logging and monitoring. It also provides a central point of configuration for timeouts and host fallbacks.

Introspectability

As part of a large organization, we can afford to add more machines, and are even required to keep a certain level of redundancy and idle hardware. And while DevOps is catching on in many larger-scale environments, there are many cases in enterprise environments where developers are not allowed to attend to their production code.

SuPPort currently comes with all the same general-purpose introspection capabilities that PayPal Python developers enjoy, meaning that we get you as much structured information about your application as possible without actually requiring login privileges. Of course almost every aspect of this is configurable, to suit a wide variety of environments from development to production.

Context management

Python famously has no global scope: all values are namespaced in module scope. But there are still plenty of aspects of the runtime that are global. Some are out of our control, like the OS-assigned process ID, or the VM-managed garbage collection counters. Others aspects are in our control, and best practice in concurrent programming is to keep these as well-managed as possible.

SuPPort uses a system of Contexts to explicitly manage nonlocal state, eliminating difficult-to-track implicit global state for many core functions. This has the added benefit of creating opportunities to centrally manage and monitor debugging data and statistics (some charts of which are shown below), all made available through the MetaApplication, detailed further down.

Accept SSL Charts

Charting quantiles and recent timings for incoming SSL connections from a remote service.

SuPPort Stats table

Values are also available in table-based and JSON formats, for easy human and machine readability.

MetaApplication

While not exclusively a web server framework, SuPPort leverages its strong roots in the web to provide both a web-based user interface and API full of useful runtime information.

As you can see below, there is a lot of information exposed through this default interface. This is partly because of restricted environments not allowing local login on machines, and another part is the relative convenience of a browser for most developers. Not pictured is the feature that the same information is available in JSON format for easy programmatic consumption. Because this application is such a rich source of information, we recommend using SuPPort to run it on a separate port which can be firewalled accordingly, as seen in this example.

MetaApplication Screenshot #1

A screenshot of the MetaApplication, showing load averages and other basic information, as well as subroutes to further info.

MetaApplication Screenshot 2

Another shot of the MetaApplication, showing process and runtime info

 

Infallibility

At the end of the day, reliability over long periods of time is what earns a stack approval and adoption. At this point, the SuPPort architecture has a billion production requests under its belt here at PayPal, but on the way we put it through the proverbial paces. At various points, we have tested and confirmed several edge behaviors. Here are just a few key characteristics of a well-behaved application:

  • Gracefully sheds traffic under load (no unbounded queues here)
  • Can and has run at 90%+ CPU load for days at a time
  • Is free from framework memory leaks
  • Is robust to memory leakage in user code

To illustrate, a live service handling millions of requests per day had a version of OpenSSL installed which was leaking memory on every handshake. Thanks to preemptive worker cycling on excessive process memory usage, no intervention was required and no customers were impacted. The worker cycling was noted in the logs, the leak was traced to OpenSSL, and operations was notified. The problem was fixed with the next regularly scheduled release rather than being handled as a crisis.

No monkeypatching

One of the first and sometimes only ways that people experience gevent is through monkeypatching. At the top of your main module you issue a call to gevent that automatically swaps out virtually all system libraries with their cooperatively concurrent ones. This sort of magic is relatively rare in Python programming, and rightfully so. Implicit activities like this can have unexpected consequences. SuPPort is a no-monkeypatching approach to gevent. If you want to implement your own network-level code, it is best to use gevent.socket directly. If you want gevent-incompatible libraries to work with gevent, best to use SuPPort’s gevent-based threading capabilities.

Using threads with gevent

“Threads? In my gevent? I thought the whole point of greenlets and gevent was to eliminate evil, evil threads!” –Countless strawmen

Originating in Stackless and ported over in 2004 by Armin Rigo (of PyPy fame), greenlets are mature and powerful concurrency primitives. We wanted to add that power to the process- and thread-based world of POSIX. There’s no point running from standard OS capabilities; threads have their place. Many architectures adopt a thread-per-request or process-per-request model, but the last thing we want is the number of threads going up as load increases. Threads are expensive; each thread adds a bit of contention to the mix, and in many environments the memory overhead alone, typically 4-8MB per thread, presents a problem. At just a few kilobytes apiece, greenlet’s microthreads are three orders of magnitude less costly.

Furthermore, thread usage in our architecture is hardly about parallelism; we use worker processes for that. In the SuPPort world, threads are about preemption. Cooperative greenlets are much more efficient overall, but sometimes you really do need guarantees about responsiveness.

One excellent example of how threads provide this responsiveness is the ThreadQueueServer detailed below. But first, there are two built-in Threadpools with decorators worth highlighting, io_bound and cpu_bound:

io_bound

This decorator is primarily used to wrap opaque clients built without affordances for cooperative concurrent IO. We use this to wrap cx_Oracle and other C-based clients that are built for thread-based parallelization. Other major use cases for io_bound is when getting input from standard input (stdin) and files.

A rough sketch of what threads inside a worker look like. The outer box is a process, inner boxes are threads/threadpools, and each text label refers to a coroutine/greenlet.

A rough sketch of what threads inside a worker look like. The outer box is a process, inner boxes are threads/threadpools, and each text label refers to a coroutine/greenlet.

cpu_bound

The cpu_bound decorator is used to wrap expensive operations that would halt the event loop for too long. We use it to wrap long-running cryptography and serialization tasks, such as decrypting private SSL certificates or loading huge blobs of XML and JSON. Because the majority of use cases’ implementations do not release the Global Interpreter Lock, the cpu_bound ThreadPool is actually just a pool of one thread, to minimize CPU contention from multiple unparallelizable CPU-intensive tasks.

It’s worth noting that some deserialization tasks are not worth the overhead of dispatching to a separate thread. If the data to be deserialized is very short or a result is already cached. For these cases, we have the cpu_bound_if decorator, which conditionally dispatches to the thread, yielding slightly higher responsiveness for low-complexity requests.

Also note that both of these decorators are reentrant, making dispatch idempotent. If you decorate a function that itself eventually calls a decorated function, performance won’t pay the thread dispatch tax twice.

ThreadQueueServer

The ThreadQueueServer exists as an enhanced approach to pulling new connections off of a server’s listening socket. It’s SuPPort’s way of incorporating an industry-standard practice, commonly associated with nginx and Apache, into the gevent WSGI server world.

If you’ve read this far into the post, you’re probably familiar with the standard multi-worker preforking server architecture; a parent process opens a listening socket, forks one or more children that inherit the socket, and the kernel manages which worker gets which incoming client connection.

Preforking architecture

Basic preforking architecture. The OS balances traffic between workers, monitored by an arbiter.

The problem with this approach is that it generally results in inefficient distribution of connections, and can lead to some workers being overloaded while others have cycles to spare. Plus, all worker processes are woken up by the kernel in a race to accept a single inbound connection, in what’s commonly referred to as the thundering herd.

The solution implemented here uses a thread that sleeps on accept, removing connections from the kernel’s listen queue as soon as possible, then explicitly pushing accepted connections to the main event loop. The ability to inspect this user-space connection queue enables not only even distribution but also intelligent behavior under high load, such as closing incoming connections when the backlog gets too long. This fail-fast approach prevents the kernel from holding open fully-established connections that cannot be reached in a reasonable amount of time. This backpressure takes the wait out of client failure scenarios leading to a more responsive extrinsic system, as well.

What’s next for SuPPort

The sections above highlight just a small selection of the features already in SuPPort, and there are many more to cover in future posts. In addition to those, we will also be distilling more code from our internal codebase out into the open. Among these we are particularly excited about:

  • Enhanced human-readable structured logging
  • Advanced network security functionality based on OpenSSL
  • Distributed online statistics collection
  • Additional generalizations for TCP client infrastructure

And of course, more tests! As soon as we get a couple more features distilled out, we’ll start porting out more than the skeleton tests we have now. Suffice to say, we’re really looking forward to expanding our set of codified concurrent software learnings, and incorporating as much community feedback as possible, so don’t forget to subscribe to the blog and the SuPPort repo on GitHub.

Mahmoud Hashemi, Kurt Rose, Mark Williams, and Chris Lane

10 Myths of Enterprise Python

By

2016 Update: Whether you enjoy myth busting, Python, or just all enterprise software, you will also likely enjoy Enterprise Software with Python, presented by the author of the article below, and published by O’Reilly.

PayPal enjoys a remarkable amount of linguistic pluralism in its programming culture. In addition to the long-standing popularity of C++ and Java, an increasing number of teams are choosing JavaScript and Scala, and Braintree‘s acquisition has introduced a sophisticated Ruby community.

One language in particular has both a long history at eBay and PayPal and a growing mindshare among developers: Python.

Python has enjoyed many years of grassroots usage and support from developers across eBay. Even before official support from management, technologists of all walks went the extra mile to reap the rewards of developing in Python. I joined PayPal a few years ago, and chose Python to work on internal applications, but I’ve personally found production PayPal Python code from nearly 15 years ago.

Today, Python powers over 50 projects, including:

  • Features and products, like RedLaser
  • Operations and infrastructure, both OpenStack and proprietary
  • Mid-tier services and applications, like the one used to set PayPal’s prices and check customer feature eligibility
  • Monitoring agents and interfaces, used for several deployment and security use cases
  • Batch jobs for data import, price adjustment, and more
  • And far too many developer tools to count

In the coming series of posts I’ll detail the initiatives and technologies that led the eBay/PayPal Python community to grow from just under 25 engineers in 2011 to over 260 in 2014. For this introductory post, I’ll be focusing on the 10 myths I’ve had to debunk the most in eBay and PayPal’s enterprise environments.

Myth #1: Python is a new language

What with all the start-ups using it and kids learning it these days, it’s easy to see how this myth still persists. Python is actually over 23 years old, originally released in 1991, 5-years before HTTP 1.0 and 4-years before Java. A now-famous early usage of Python was in 1996: Google’s first successful web crawler.

If you’re curious about the long history of Python, Guido van Rossum, Python’s creator, has taken the care to tell the whole story.

Myth #2: Python is not compiled

While not requiring a separate compiler toolchain like C++, Python is in fact compiled to bytecode, much like Java and many other compiled languages. Further compilation steps, if any, are at the discretion of the runtime, be it CPython, PyPy, Jython/JVM, IronPython/CLR, or some other process virtual machine. See Myth #6 for more info.

The general principle at PayPal and elsewhere is that the compilation status of code should not be relied on for security. It is much more important to secure the runtime environment, as virtually every language has a decompiler, or can be intercepted to dump protected state. See the next myth for even more Python security implications.

Myth #3: Python is not secure

Python’s affinity for the lightweight may not make it seem formidable, but the intuition here can be misleading. One central tenet of security is to present as small a target as possible. Big systems are anti-secure, as they tend to overly centralize behaviors, as well as undercut developer comprehension. Python keeps these demons at bay by encouraging simplicity. Furthermore, CPython addresses these issues by being a simple, stable, and easily-auditable virtual machine. In fact, a recent analysis by Coverity Software resulted in CPython receiving their highest quality rating.

Python also features an extensive array of open-source, industry-standard security libraries. At PayPal, where we take security and trust very seriously, we find that a combination of hashlib, PyCrypto, and OpenSSL, via PyOpenSSL and our own custom bindings, cover all of PayPal’s diverse security and performance needs.

For these reasons and more, Python has seen some of its fastest adoption at PayPal (and eBay) within the application security group. Here are just a few security-based applications utilizing Python for PayPal’s security-first environment:

  • Creating security agents for facilitating key rotation and consolidating cryptographic implementations
  • Integrating with industry-leading HSM technologies
  • Constructing TLS-secured wrapper proxies for less-compliant stacks
  • Generating keys and certificates for our internal mutual-authentication schemes
  • Developing active vulnerability scanners

Plus, myriad Python-built operations-oriented systems with security implications, such as firewall and connection management. In the future we’ll definitely try to put together a deep dive on PayPal Python security particulars.

Myth #4: Python is a scripting language

Python can indeed be used for scripting, and is one of the forerunners of the domain due to its simple syntax, cross-platform support, and ubiquity among Linux, Macs, and other Unix machines.

In fact, Python may be one of the most flexible technologies among general-use programming languages. To list just a few:

  1. Telephony infrastructure (Twilio)
  2. Payments systems (PayPal, Balanced Payments)
  3. Neuroscience and psychology (many, many, examples)
  4. Numerical analysis and engineering (numpy, numba, and many more)
  5. Animation (LucasArts, Disney, Dreamworks)
  6. Gaming backends (Eve Online, Second Life, Battlefield, and so many others)
  7. Email infrastructure (Mailman, Mailgun)
  8. Media storage and processing (YouTube, Instagram, Dropbox)
  9. Operations and systems management (Rackspace, OpenStack)
  10. Natural language processing (NLTK)
  11. Machine learning and computer vision (scikit-learn, Orange, SimpleCV)
  12. Security and penetration testing (so many and eBay/PayPal
  13. Big Data (Disco, Hadoop support)
  14. Calendaring (Calendar Server, which powers Apple iCal)
  15. Search systems (ITA, Ultraseek, and Google)
  16. Internet infrastructure (DNS) (BIND 10)

Not to mention web sites and web services aplenty. In fact, PayPal engineers seem to have a penchant for going on to start Python-based web properties, YouTube and Yelp, for instance. For an even bigger list of Python success stories, check out the official list.

Myth #5: Python is weakly-typed

Python’s type system is characterized by strong, dynamic typing. Wikipedia can explain more.

Not that it is a competition, but as a fun fact, Python is more strongly-typed than Java. Java has a split type system for primitives and objects, with null lying in a sort of gray area. On the other hand, modern Python has a unified strong type system, where the type of None is well-specified. Furthermore, the JVM itself is also dynamically-typed, as it traces its roots back to an implementation of a Smalltalk VM acquired by Sun.

Python’s type system is very nice, but for enterprise use there are much bigger concerns at hand.

Myth #6: Python is slow

First, a critical distinction: Python is a programming language, not a runtime. There are several Python implementations:

  1. CPython is the reference implementation, and also the most widely distributed and used.
  2. Jython is a mature implementation of Python for usage with the JVM.
  3. IronPython is Microsoft’s Python for the Common Language Runtime, aka .NET.
  4. PyPy is an up-and-coming implementation of Python, with advanced features such as JIT compilation, incremental garbage collection, and more.

Each runtime has its own performance characteristics, and none of them are slow per se. The more important point here is that it is a mistake to assign performance assessments to a programming languages. Always assess an application runtime, most preferably against a particular use case.

Having cleared that up, here is a small selection of cases where Python has offered significant performance advantages:

  1. Using NumPy as an interface to Intel’s MKL SIMD
  2. PyPy‘s JIT compilation achieves faster-than-C performance
  3. Disqus scales from 250 to 500 million users on the same 100 boxes

Admittedly these are not the newest examples, just my favorites. It would be easy to get side-tracked into the wide world of high-performance Python and the unique offerings of runtimes. Instead of addressing individual special cases, attention should be drawn to the generalizable impact of developer productivity on end-product performance, especially in an enterprise setting.

A comparison of C++ to Python,. The Python is approximately a tenth the size.

C++ vs Python,. Two languages, one output, as apples to apples as it gets.

Given enough time, a disciplined developer can execute the only proven approach to achieving accurate and performant software:

  1. Engineer for correct behavior, including the development of respective tests
  2. Profile and measure performance, identifying bottlenecks
  3. Optimize, paying proper respect to the test suite and Amdahl’s Law, and taking advantage of Python’s strong roots in C.

It might sound simple, but even for seasoned engineers, this can be a very time-consuming process. Python was designed from the ground up with developer timelines in mind. In our experience, it’s not uncommon for Python projects to undergo three or more iterations in the time it C++ and Java to do just one. Today, PayPal and eBay have seen multiple success stories wherein Python projects outperformed their C++ and Java counterparts, with less code (see right), all thanks to fast development times enabling careful tailoring and optimization. You know, the fun stuff.

Myth #7: Python does not scale

Scale has many definitions, but by any definition, YouTube is a web site at scale. More than 1 billion unique visitors per month, over 100 hours of uploaded video per minute, and going on 20 pecent of peak Internet bandwidth, all with Python as a core technology. Dropbox, Disqus, Eventbrite, Reddit, Twilio, Instagram, Yelp, EVE Online, Second Life, and, yes, eBay and PayPal all have Python scaling stories that prove scale is more than just possible: it’s a pattern.

The key to success is simplicity and consistency. CPython, the primary Python virtual machine, maximizes these characteristics, which in turn makes for a very predictable runtime. One would be hard pressed to find Python programmers concerned about garbage collection pauses or application startup time. With strong platform and networking support, Python naturally lends itself to smart horizontal scalability, as manifested in systems like BitTorrent.

Additionally, scaling is all about measurement and iteration. Python is built with profiling and optimization in mind. See Myth #6 for more details on how to vertically scale Python.

Myth #8: Python lacks good concurrency support

Occasionally debunking performance and scaling myths, and someone tries to get technical, “Python lacks concurrency,” or, “What about the GIL?” If dozens of counterexamples are insufficient to bolster one’s confidence in Python’s ability to scale vertically and horizontally, then an extended explanation of a CPython implementation detail probably won’t help, so I’ll keep it brief.

Python has great concurrency primitives, including generators, greenlets, Deferreds, and futures. Python has great concurrency frameworks, including eventlet, gevent, and Twisted. Python has had some amazing work put into customizing runtimes for concurrency, including Stackless and PyPy. All of these and more show that there is no shortage of engineers effectively and unapologetically using Python for concurrent programming. Also, all of these are officially support and/or used in enterprise-level production environments. For examples, refer to Myth #7.

The Global Interpreter Lock, or GIL, is a performance optimization for most use cases of Python, and a development ease optimization for virtually all CPython code. The GIL makes it much easier to use OS threads or green threads (greenlets usually), and does not affect using multiple processes. For more information, see this great Q&A on the topic and this overview from the Python docs.

Here at PayPal, a typical service deployment entails multiple machines, with multiple processes, multiple threads, and a very large number of greenlets, amounting to a very robust and scalable concurrent environment (see figure below). In most enterprise environments, parties tends to prefer a fairly high degree of overprovisioning, for general prudence and disaster recovery. Nevertheless, in some cases Python services still see millions of requests per machine per day, handled with ease.

Sketch of a PayPal Python server worker

A rough sketch of a single worker within our coroutine-based asynchronous architecture. The outermost box is the process, the next level is threads, and within those threads are green threads. The OS handles preemption between threads, whereas I/O coroutines are cooperative.

Myth #9: Python programmers are scarce

There is some truth to this myth. There are not as many Python web developers as PHP or Java web developers. This is probably mostly due to a combined interaction of industry demand and education, though trends in education suggest that this may change.

That said, Python developers are far from scarce. There are millions worldwide, as evidenced by the dozens of Python conferences, tens of thousands of StackOverflow questions, and companies like YouTube, Bank of America, and LucasArts/Dreamworks employing Python developers by the hundreds and thousands. At eBay and PayPal we have hundreds of developers who use Python on a regular basis, so what’s the trick?

Well, why scavenge when one can create? Python is exceptionally easy to learn, and is a first programming language for children, university students, and professionals alike. At eBay, it only takes one week to show real results for a new Python programmer, and they often really start to shine as quickly as 2-3 months, all made possible by the Internet’s rich cache of interactive tutorials, books, documentation, and open-source codebases.

Another important factor to consider is that projects using Python simply do not require as many developers as other projects. As mentioned in Myth #6 and Myth #9, lean, effective teams like Instagram are a common trope in Python projects, and this has certainly been our experience at eBay and PayPal.

Myth #10: Python is not for big projects

Myth #7 discussed running Python projects at scale, but what about developing Python projects at scale? As mentioned in Myth #9, most Python projects tend not to be people-hungry. while Instagram reached hundreds of millions of hits a day at the time of their billion dollar acquisition, the whole company was still only a group of a dozen or so people. Dropbox in 2011 only had 70 engineers, and other teams were similarly lean. So, can Python scale to large teams?

Bank of America actually has over 5,000 Python developers, with over 10 million lines of Python in one project alone. JP Morgan underwent a similar transformation. YouTube also has engineers in the thousands and lines of code in the millions. Big products and big teams use Python every day, and while it has excellent modularity and packaging characteristics, beyond a certain point much of the general development scaling advice stays the same. Tooling, strong conventions, and code review are what make big projects a manageable reality.

Luckily, Python starts with a good baseline on those fronts as well. We use PyFlakes and other tools to perform static analysis of Python code before it gets checked in, as well as adhering to PEP8, Python’s language-wide base style guide.

Finally, it should be noted that, in addition to the scheduling speedups mentioned in Myth #6 and #7, projects using Python generally require fewer developers, as well. Our most common success story starts with a Java or C++ project slated to take a team of 3-5 developers somewhere between 2-6 months, and ends with a single motivated developer completing the project in 2-6 weeks (or hours, for that matter).

A miracle for some, but a fact of modern development, and often a necessity for a competitive business.

A clean slate

Mythology can be a fun pastime. Discussions around these myths remain some of the most active and educational, both internally and externally, because implied in every myth is a recognition of Python’s strengths. Also, remember that the appearance of these seemingly tedious and troublesome concerns is a sign of steadily growing interest, and with steady influx of interested parties comes the constant job of education. Here’s hoping that this post manages to extinguish a flame war and enable a project or two to talk about the real work that can be achieved with Python.

Keep an eye out for future posts where I’ll dive deeper into the details touched on in this overview. If you absolutely must have details before then, or have corrections or comments, shoot me an email at mahmoud@paypal.com. Until then, happy coding!