Tag Archives: Performance

Beam Me Up – Profiling a Beam-over-Spark Application


As we move forward with adopting Apache Beam for some of our streaming needs, our Beam applications need to be tested for stability. Such tests are aimed at ensuring performance does not degrade over time, and applications are capable of maintaining desired performance characteristics (e.g., latency) as they run over long periods of time. When we ran a Beam-over-Spark application (Beam 0.7.0-SNAPSHOT ; Spark 1.6.2) for a period of several hours, the batch processing time was increasing unexpectedly (e.g., regardless of traffic seasonality).

In this post we share the steps and methods we used to diagnose the performance degradation we witnessed in our application’s (batch) processing time, a diagnosis which ultimately led to identifying the root cause and resolving it.

Houston, We Have a Problem

One of the key metrics in Spark is the lastCompletedBatch_processingDelay, which indicates the time required to process a batch of events. If the batch processing duration exceeds the time gap between consecutive batches (as configured by the batch duration parameter), the application can no longer keep up with the input rate and a backlog of pending batches is formed. If the backlog keeps growing, at some point the application is likely to run out of memory and crash.


Diving In

Since a Beam-over-Spark application consists of multiple layers, we first sought to narrow down the search space. We had therefore eliminated any non-trivial beam operators and IO operations such as Kafka reads and writes, and remained with an identity-map-like pipeline over an in-memory source which generated a stream of integers produced by a Beam’s GenerateSequence construct.

If the performance problem was to persist even in the simplified, stripped down, application, it would likely indicate an issue in one of the fundamental layers rather than in our application’s code (which at that point was reduced to a trivial identity-map-like transformation).


Interestingly enough, the problem did indeed persist even in the stripped down version of the application.

Spark UI

We next turned to the Spark UI to gather basic information.

Among the task metrics, task deserialization times took the longest, ranging from 13 to 19 ms. However, at this point in time it was not clear if this had anything to do with the issues we were seeing.


A second sampling taken some hours later, showed that task deserialization times kept growing and reached 41-79 ms, while other task metrics remained stable and did not show signs of degradation – task deserialization became the prime suspect.


Two main causes could account for the increase in task deserialization times:

  1. Task size grows and the system needs to deserialize more data.
  2. The manner in which deserialization takes place adversely affects deserialization times, and takes longer over time.

Looking through the driver’s logs, tasks’ size remained within a stable range of a few kilobytes, indicating that the extra time was not being taken up by larger objects getting deserialized, but probably had more to do with how the deserialization was being performed, and less with what was being deserialized.

CPU Profiling

Flame graphs are a known method for profiling applications’ performance. A recent post by our very own Aviem Zur goes into details and explains how flame graphs can be leveraged to profile Spark applications.

We used flame graphs to profile our application’s driver and executor JVMs (not at the same time, to avoid collisions at the db level).
Driver profiling was performed using:


Executor profiling is performed using:


While there did not seem to be an apparent bottleneck which could account for the performance degradation we had been seeing, the TorrentBroadcast.readBroadcastBlock() method caught our eye since we could not easily explain why reading a block would take what we considered a relatively long time.

Tracking down bottlenecks

We also tried comparing two flame graphs of the same component as suggested in Brendan Gregg’s Blog. The first flame graph was generated close to the application’s start time, and the second shortly after the processing delay grew considerably. However, the diff flame graph did not reveal new insights.  At this point it seemed like looking deeper into TorrentBroadcast.readBroadcastBlock() could be beneficial.

Spark Logs

We then turned to the logs to search for potential issues in the TorrentBroadcast.readBroadcastBlock() method. To make log reading easier and avoid the logs being written to by multiple CPUs on the same executor, we changed the spark executor (and driver) thread count to 1.

The logs showed evidence indicating that broadcast variable reading times were growing:

DEBUG BlockManager: Getting local block broadcast_691
INFO TorrentBroadcast: Reading broadcast variable 691 took 12 ms

Fast forward into the future:

DEBUG BlockManager: Getting local block broadcast_211582
INFO TorrentBroadcast: Reading broadcast variable 211582 took 31 ms

Since broadcast variables are used by the spark driver to transfer tasks to the executor nodes, this seemed quite relevant to the problem at hand. We decided to explore the (task) broadcast deserialization flow further, and look for possible delays.

23:55:04,152 DEBUG BlockManager: Getting local block broadcast_211582

23:55:04,161 TRACE BlockManager: Put for block broadcast_211582_piece0 took 0 ms to get into synchronized block
23:55:04,181 INFO MemoryStore: Block broadcast_211582_piece0 stored as bytes in memory (estimated size 6.7 KB, free 3.3 MB)

23:55:04,183 DEBUG BlockManager: Putting block broadcast_211582_piece0 without replication took 22 ms
23:55:04,183 INFO TorrentBroadcast: Reading broadcast variable 211582 took 31 ms

We noticed that there was a time gap of over 181 – 161 = 20 ms, ~64% of the total time (!), between the following two consecutive printouts:

23:55:04,161 TRACE BlockManager: Put for block broadcast_211582_piece0 took 0 ms to get into synchronized block
23:55:04,181 INFO MemoryStore: Block broadcast_211582_piece0 stored as bytes in memory (estimated size 6.7 KB, free 3.3 MB)

In terms of code, seemingly there was not much happening between these two printouts, which led to ~64% of the deserialization time being unaccounted for.

A closer look at the second printout (MemoryStore.scala) revealed some interesting insights:

logInfo("Block %s stored as %s in memory (estimated size %s, free %s)".format(

Where blocksMemoryUsed is defined as follows:

private def blocksMemoryUsed: Long = memoryManager.synchronized {
    memoryUsed - currentUnrollMemory

and currentUnrollMemory is defined as follows:

def currentUnrollMemory: Long = memoryManager.synchronized {
    unrollMemoryMap.values.sum + pendingUnrollMemoryMap.values.sum

This basically translates into the unrollMemoryMap and pendingUnrollMemoryMap being scanned every time this particular log line is printed out. If these maps were to grow to extreme sizes, that could definitely explain the delays we were seeing, moreover, that could account for the over-time degradation aspect of our issue, since the delays would grow along with the maps’ size.

We briefly considered raising the log level to avoid that particular printout, but a quick usage search identified more usages of currentUnrollMemory in places other than log printouts, making such a workaround ineffective.

To further establish our hypothesis and see if one of the above mentioned map data structures (or both) did indeed grow large – enter memory dumps.

Memory Dumps

A memory dump of one of the executors close to when the application started looked like so:

Memory Dump

A memory dump of the same executor some hours later looked like so:

Memory Dump

The unrollMemoryMap grew from 3477 to 145,555 entries, bringing the scala.collection.mutable.DefaultEntry overall instance count to 145,596.

Memory Dump

Having this information at our disposal, we started sifting through Spark’s JIRA.

SPARK-17465 described the phenomena we were witnessing and suggested it was fixed in version 1.6.3, 2.0.1, 2.1.0, where the purging of the unrollMemoryMap and pendingUnrollMemoryMap maps was fixed.

Running our application with Spark 1.6.3 for a few days verified that the issue had indeed been resolved.


The resolution to our performance issue boiled down to upgrading Spark from 1.6.2 to 1.6.3 (a Beam spark-runner for Spark 2.x is WIP).

We hope that the methods described in this post can help others dealing with performance issues of a similar nature and inspire the community to share their stories from the trenches.

In addition, as part of our profiling session we filed and addressed the following Apache Beam tickets:

Python by the C side


C shells by the C shoreMahmoud’s note: This will be my last post on the PayPal Engineering blog. If you’ve enjoyed this sort of content subscribe to my blog/pythondoeswhat.com or follow me on Twitter. It’s been fun!

All the world is legacy code, and there is always another, lower layer to peel away. These realities cause developers around the world to go on regular pilgrimage, from the terra firma of Python to the coasts of C. From zlib to SQLite to OpenSSL, whether pursuing speed, efficiency, or features, the waters are powerful, and often choppy. The good news is, when you’re writing Python, C interactions can be a day at the beach.


A brief history

As the name suggests, CPython, the primary implementation of Python used by millions, is written in C. Python core developers embraced and exposed Python’s strong C roots, taking a traditional tack on portability, contrasting with the “write once, debug everywhere” approach popularized elsewhere. The community followed suit with the core developers, developing several methods for linking to C. Years of these interactions have made Python a wonderful environment for interfacing with operating systems, data processing libraries, and everything the C world has to offer.

This has given us a lot of choices, and we’ve tried all of the standouts:

Approach Vintage Representative User Notable Pros Notable Cons
C extension modules 1991 Standard library Extensive documentation and tutorials. Total control. Compilation, portability, reference management. High C knowledge.
SWIG 1996 crfsuite Generate bindings for many languages at once Excessive overhead if Python is the only target.
ctypes 2003 oscrypto No compilation, wide availability Accessing and mutating C structures cumbersome and error prone.
Cython 2007 gevent, kivy Python-like. Highly mature. High performance. Compilation, new syntax and toolchain.
cffi 2013 cryptography, pypy Ease of integration, PyPy compatibility New/High-velocity.

There’s a lot of history and detail that doesn’t fit into a table, but every option falls into one of three categories:

  1. Writing C
  2. Writing code that translates to C
  3. Writing code that calls into libraries that present a C interface

Each has its merits, so we’ll explore each category, then finish with a real, live, worked example.

Writing C

Python’s core developers did it and so can you. Writing C extensions to Python gives an interface that fits like a glove, but also requires knowing, writing, building, and debugging C. The bugs are much more severe, too, as a segmentation fault that kills the whole process is much worse than a Python exception, especially in an asynchronous environment with hundreds of requests being handled within the same process. Not to mention that the glove is also tailored to CPython, and won’t fit quite right, or at all, in other execution environments.

At PayPal, we’ve used C extensions to speed up our service serialization. And while we’ve solved the build and portability issue, we’ve lost track of our share of references and have moved on from writing straight C extensions for new code.

Translating to C

After years of writing C, certain developers decide that they can do better. Some of them are certainly onto something.

Going Cythonic

Cython is a superset of the Python programming language that has been turning type-annotated Python into C extensions for nearly a decade, longer if you count its predecessor, Pyrex. Apart from its maturity, the points that matters to us are:

  • Every Python file is a valid Cython file, enabling incremental, iterative optimization
  • The generated C is highly portable, building on Windows, Mac, and Linux
  • It’s common practice to check in the generated C, meaning that builders don’t need to have Cython installed.

Not to mention that the generated C often makes use of performance tricks that are too tedious or arcane to write by hand, partially motivated by scientific computing’s constant push. And through all that, Cython code maintains a high level of integration with Python itself, right down to the stack trace and line numbers.

PayPal has certainly benefitted from their efforts through high-performance Cython users like gevent, lxml, and NumPy. While our first go with Cython didn’t stick in 2011, since 2015, all native extensions have been written and rewritten to use Cython. It wasn’t always this way however.

A sip, not a SWIG

An early contributor to Python at PayPal got us started using SWIG, the Simplified Wrapper and Interface Generator, to wrap PayPal C++ infrastructure. It served its purpose for a while, but every modification was a slog compared to more Pythonic techniques. It wasn’t long before we decided it wasn’t our cup of tea.

Long ago SWIG may have rivaled extension modules as Python programmers’ method of choice. These days it seems to suit the needs of C library developers looking for a fast and easy way to wrap their C bindings for multiple languages. It also says something that searching for SWIG usage in Python nets as much SWIG replacement libraries as SWIG usage itself.

Calling into C

So far all our examples have involved extra build steps, portability concerns, and quite a bit of writing languages other than Python. Now we’ll dig into some approaches that more closely match Python’s own dynamic nature: ctypes and cffi.

Both ctypes and cffi leverage C’s Foreign Function Interface (FFI), a sort of low-level API that declares callable entrypoints to compiled artifacts like shared objects (.so files) on Linux/FreeBSD/etc. and dynamic-link libraries (.dll files) on Windows. Shared objects take a bit more work to call, so ctypes and cffi both use libffi, a C library that enables dynamic calls into other C libraries.

Shared libraries in C have some gaps that libffi helps fill. A Linux .so, Windows .dll, or OS X .dylib is only going to provide symbols: a mapping from names to memory locations, usually function pointers. Dynamic linkers do not provide any information about how to use these memory locations. When dynamically linking shared libraries to C code, header files provide the function signatures; as long as the shared library and application are ABI compatible, everything works fine. The ABI is defined by the C compiler, and is usually carefully managed so as not to change too often.

However, Python is not a C compiler, so it has no way to properly call into C even with a known memory location and function signature. This is where libffi comes in. If symbols define where to call the API, and header files define what API to call, libffi translates these two pieces of information into how to call the API. Even so, we still need a layer above libffi that translates native Python types to C and vice versa, among other tasks.


ctypes is an early and Pythonic approach to FFI interactions, most notable for its inclusion in the Python standard library.

ctypes works, it works well, and it works across CPython, PyPy, Jython, IronPython, and most any Python runtime worth its salt. Using ctypes, you can access C APIs from pure Python with no external dependencies. This makes it great for scratching that quick C itch, like a Windows API that hasn’t been exposed in the os module. If you have an otherwise small module that just needs to access one or two C functions, ctypes allows you to do so without adding a heavyweight dependency.

For a while, PayPal Python code used ctypes after moving off of SWIG. We found it easier to call into vanilla shared objects built from C++ with an extern C rather than deal with the SWIG toolchain. ctypes is still used incidentally throughout the code for exactly this: unobtrusively calling into certain shared objects that are widely deployed. A great open-source example of this use case is oscrypto, which does exactly this for secure networking. That said, ctypes is not ideal for huge libraries or libraries that change often. Porting signatures from headers to Python code is tedious and error-prone.


cffi, our most modern approach to C integration, comes out of the PyPy project. They were seeking an approach that would lend itself to the optimization potential of PyPy, and they ended up creating a library that fixes many of the pains of ctypes. Rather than handcrafting Python representations of the function signatures, you simply load or paste them in from C header files.

For all its convenience, cffi’s approach has its limits. C is really almost two languages, taking into account preprocessor macros. A macro performs string replacement, which opens a Fun World of Possibilities, as straightforward or as complicated as you can imagine. cffi’s approach is limited around these macros, so applicability will depend on the library with which you are integrating.

On the plus side, cffi does achieve its stated goal of outperforming ctypes under PyPy, while remaining comparable to ctypes under CPython. The project is still quite young, and we are excited to see where it goes next.

A Tale of 3 Integrations: PKCS11

We promised an example, and we almost made it three.

PKCS11 is a cryptography standard for interacting with many hardware and software security systems. The 200-plus-page core specification includes many things, including the official client interface: A large set of C header-style information. There are a variety of pre-existing bindings, but each device has its own vendor-specific quirks, so what are we waiting for?


As stated earlier, ctypes is not great for sprawling interfaces. The drudgery of converting function signatures invites transcription bugs. We somewhat automated it, but the approach was far from perfect.

Our second approach, using cffi, worked well for our first version’s supported feature subset, but unfortunately PKCS11 uses its own CK_DECLARE_FUNCTION macro instead of regular C syntax for defining functions. Therefore, cffi’s approach of skipping #define macros will result in syntactically invalid C code that cannot be parsed. On the other hand, there are other macro symbols which are compiler or operating system intrinsics (e.g. __cplusplus, _WIN32, __linux__). So even if cffi attempted to evaluate every macro, we would immediately runs into problems.

So in short, we’re faced with a hard problem. The PKCS11 standard is a gnarly piece of C. In particular:

  1. Many hundreds of important constant values are created with #define
  2. Macros are defined, then re-defined to something different later on in the same file
  3. pkcs11f.h is included multiple times, even once as the body of a struct

In the end, the solution that worked best was to write up a rigorous parser for the particular conventions used by the slow-moving standard, generate Cython, which generates C, which finally gives us access to the complete client, with the added performance bonus in certain cases. Biting this bullet took all of a day and a half, we’ve been very satisfied with the result, and it’s all thanks to a special trick up our sleeves.

Parsing Expression Grammars

Parsing expression grammars (PEGs) combine the power of a true parser generating an abstract syntax tree, not unlike the one used for Python itself, with the convenience of regular expressions. One might think of PEGs as recursive regular expressions. There are several good libraries for Python, including parsimonious and parsley. We went with the former for its simplicity.

For this application, we defined a two grammars, one for pkcs11f.h and one for pkcs11t.h:


    file = ( comment / func / " " )*
    func = func_hdr func_args
    func_hdr = "CK_PKCS11_FUNCTION_INFO(" name ")"
    func_args = arg_hdr " (" arg* " ); #endif"
    arg_hdr = " #ifdef CK_NEED_ARG_LIST" (" " comment)?
    arg = " " type " " name ","? " " comment
    name = identifier
    type = identifier
    identifier = ~"[A-Z_][A-Z0-9_]*"i
    comment = ~"(/\*.*?\*/)"ms


    file = ( comment / define / typedef / struct_typedef / func_typedef / struct_alias_typedef / ignore )*
    typedef = " typedef" type identifier ";"
    struct_typedef = " typedef struct" identifier " "? "{" (comment / member)* " }" identifier ";"
    struct_alias_typedef = " typedef struct" identifier " CK_PTR"? identifier ";"
    func_typedef = " typedef CK_CALLBACK_FUNCTION(CK_RV," identifier ")(" (identifier identifier ","? comment?)* " );"    member = identifier identifier array_size? ";" comment?
    array_size = "[" ~"[0-9]"+ "]"
    define = "#define" identifier (hexval / decval / " (~0UL)" / identifier / ~" \([A-Z_]*\|0x[0-9]{8}\)" )
    hexval = ~" 0x[A-F0-9]{8}"i
    decval = ~" [0-9]+"
    type = " unsigned char" / " unsigned long int" / " long int" / (identifier " CK_PTR") / identifier
    identifier = " "? ~"[A-Z_][A-Z0-9_]*"i
    comment = " "? ~"(/\*.*?\*/)"ms
    ignore = ( " #ifndef" identifier ) / " #endif" / " "

Short, but dense, in true grammatical style. Looking at the whole program, it’s a straightforward process:

  1. Apply the grammars to the header files to get our abstract syntax tree.
  2. Walk the AST and sift out the semantically important pieces, function signatures in our case.
  3. Generate code from the function signature data structures.

Using only 200 lines of code to bring such a massive standard to bear, along with the portability and performance of Cython, through the power of PEGs ranks as one of the high points of Python in practice at PayPal.

Wrapping up

It’s been a long journey, but we stayed afloat and we’re happy to have made it. To recap:

  • Python and C are hand-in-glove made for one another.
  • Different C integration techniques have their applications, our stances are:
    • ctypes for dynamic calls to small, stable interfaces
    • cffi for dynamic calls to larger interfaces, especially when targeting PyPy
    • Old-fashioned C extensions if you’re already good at them
    • Cython-based C extensions for the rest
    • SWIG pretty much never
  • Parsing Expression Grammars are great!

All of this encapsulates perfectly why we love Python so much. Python is a great starter language, but it also has serious chops as a systems language and ecosystem. That bottom-to-top, rags-to-riches, books-to-bits story is what makes it the ineffable, incomparable language that it is.

C you around!

Kurt and Mahmoud

Powering Transactions Search with Elastic – Learnings from the Field



We see a lot of transactions at PayPal. Millions every day.

These transactions originate externally (a customer using PayPal to pay for a purchase on a website) as well as internally, as money moves through our system. Regular reports of these transactions are delivered to merchants in the form of a csv or a pdf file. Merchants use these reports to reconcile their books.

Recently, we set out to build a REST API that could return transaction data back to merchants. We also wanted to offer the capability to filter on different criteria such as name, email or transaction amount. Support for light aggregation/insight use cases was a stretch goal. This API would be used by our partners, merchants and external developers.

We choose Elastic as the data store. Elastic has proven, over the past 6 years, to be an actively developed product that constantly evolves to adapt to user needs. With a great set of core improvements introduced in version 2.x (memory mapped doc values, auto-regulated merge throttling), we didn’t need to look any further.

Discussed below is our journey and key learnings we had along the way, in setting up and using Elastic for this project.

Will It Tip Over

Once live, the system would have tens of terabytes of data spanning 40+ billion documents. Each document would have over a hundred attributes. There would be tens of millions of documents added every day. Each one of the planned Elastic blades has 20TB of SSD storage, 256 GB RAM and 48 cores (hyper).

While we knew Elastic had all the great features we were looking for, we were not too sure if it would be able to scale to work with this volume (and velocity) of data. There are a host of non-functional requirements that arise naturally in financial systems which have to be met. Let’s limit our discussion in this post to performance – response time to be specific.

Importance of Schema

Getting the fundamentals right is the key.

When we initially setup Elastic, we turned on strict validation of fields in the documents. While this gave us a feeling of security akin to what we’re used to with relational systems (strict field and type checks), it hurt performance.

We inspected the content of the Lucene index Elastic created using Luke. With our initial index setup, Elastic was creating sub-optimal indexes. For e.g. in places where we had defined nested arrays (marked index=”no”), Elastic was still creating child hidden documents in Lucene, one per element in the array. This document explains why, but it was still using up space when we can’t even query the property via the index. Due to this, we switched the “dynamic” index setting from strict to false. Avro helped ensure that the document conforms to a valid schema, when we prepared the documents for ingestion.

A shard should have no more than 2 billion parent plus nested child documents, if you plan to run force merge on it eventually (Lucene doc_id is an integer). This can seem high but is surprisingly easy to exceed, especially when de-normalizing high cardinality data into the source. An incorrectly configured index could have a large number of hidden Lucene documents being created under the covers.

Too Many Things to Measure

With the index schema in place, we needed a test harness to measure the performance of the cluster. We wanted to measure Elastic performance under different load conditions, configurations and query patterns. Taken together, the dimensions total more than 150 test scenarios. Sampling each by hand would be near impossible. jMeter and Beanshell scripting really helped here to auto-generate scenarios from code and have jMeter sample each hundreds of times. The results are then fed into Tableau to help make sense of the benchmark runs.

  • Indexing Strategy
    • 1 month data per shard, 1 week data per shard, 1 day data per shard
  • # of shards to use
  • Benchmark different forms of the query (constant score, filter with match all etc.)
  • User’s Transaction Volume
    • Test with accounts having 10 / 100 / 1000 / 10000 / 1000000 transactions per day
  • Query time range
    • 1 / 7 / 15 / 30 days
  • Store source documents in Elastic? Or store source in a different database and fetch only the matching IDs from Elastic?

Establishing a Baseline

The next step was to establish a baseline. We chose to start with a single node with one shard. We loaded a month’s worth of data (2 TB).

Tests showed we could search and get back 500 records from across 15 days in about 10 seconds when using just one node. This was good news since it could only get better from here. It also proves an Elastic (Lucene segments) shard can handle 2 billion documents indexed into it, more than what we’ll end up using.

One take away was high segment count increases response time significantly. This might not be so obvious when querying across multiple shards but is very obvious when there’s only one. Use force merge if you have the option (offline index builds). Using a high enough value for refresh_interval and translog.flush_threshold_size enables Elastic to collect enough data into a segment before a commit. The flip side was it increased the latency for the data to become available in the search results (we used 4GB & 180 seconds for this use case). We could clock over 70,000 writes per second from just one client node.

Nevertheless, data from the recent past is usually hot and we want all the nodes to chip in when servicing those requests. So next, we shard.

Sharding the Data

The same one month’s data (2 TB) was loaded onto 5 nodes with no replicas. Each Elastic node had one shard on it. We choose 5 nodes to have a few unused nodes. They would come in handy in case the cluster started to falter and needed additional capacity, and to test recovery scenarios. Meanwhile, the free nodes were used to load data into Elastic and acted as jMeter slave nodes.

With 5 nodes in play,

  • Response time dropped to 6 seconds (40% gain) for a query that scanned 15 days
  • Filtered queries were the most consistent performers
  • As a query scanned more records due to an increase in the date range, the response time also grew with it linearly
  • A force merge to 20 segments resulted in a response time of 2.5 seconds. This showed a good amount of time was being spent in processing results from individual segments, which numbered over 300 in this case. While tuning the segment merge process is largely automatic starting with Elastic 2.0, we can influence the segment count. This is done using the translog settings discussed before. Also, remember we can’t run a force merge on a live index taking reads or writes, since it can saturate available disk IO
  • Be sure to set the “throttle.max_bytes_per_sec” param to 100 MB or more if you’re using SSDs, the default is too low
  • Having the source documents stored in Elastic did not affect the response time by much, maybe 20ms. It’s surely more performant than having them off cluster on say Couchbase or Oracle. This is due to Lucene storing the source in a separate data structure that’s optimized for Elastic’s scatter gather query format and is memory mapped (see fdx and fdt files section under Lucene’s documentation). Having SSDs helped, of course


Sharding the Data

Final Setup

The final configuration we used had 5-9 shards per index depending on the age of the data. Each index held a week’s worth of data. This got us a reasonable shard count across the cluster but is something we will continue to monitor and tweak, as the system grows.

We saw response times around the 200 ms mark to retrieve 500 records after scanning 15 days’ worth of data with this setup. The cluster had 6 months of data loaded into it at this point.

Shard counts impact not just read times; they impact JVM heap usage due to Lucene segment metadata as well as recovery time, in case of node failure or a network partition. We’ve found it helps Elastic during rebalance if there are a number of smaller shards rather than a few large ones.

We also plan to spin up multiple instances per node to better utilize the hardware. Don’t forget to look at your kernel IO scheduler (see hardware recommendations) and the impact of NUMA and zone reclaim mode on Linux.

Final Setup


Elastic is a feature rich platform to build search and data intensive solutions. It removes the need to design for specific use cases, the way some NoSQL databases require. That’s a big win for us as it enables teams to iterate on solutions faster than would’ve been possible otherwise.

As long as we exercise due care when setting up the system and validate our assumptions on index design, Elastic proves to be a reliable workhorse to build data platforms on.

Benching Microbenchmarks


In under one week, Statistics for Software flew past 10 Myths for Enterprise Python to become the most visited post in the history of the PayPal Engineering blog. And that’s not counting the Japanese translation. Taken as an indicator of increased interest in software quality, this really floats all boats.

That said, there were enough emails and comments to call for a quick followup about one particularly troubling area.

Statistics for benchmarks

The saying in software goes that there are lies, damned lies, and software benchmarks.

A tachometer (an instrument indicating how hard an engine is working)

Too much software is built without the most basic instrumentation.

Yes, quantiles, histograms, and other fundamentals covered in Statistics for Software can certainly be applied to improve benchmarking. One of the timely inspirations for the post was our experience with a major network appliance vendor selling 5-figure machines, without providing or even measuring latency in quantiles. Just throughput averages.

To fix this, we gave them a Jupyter notebook that drove test traffic, and a second notebook provided the numbers they should have measured. We’ve amalgamated elements of both into a single notebook on PayPal’s Github. Two weeks later they had a new firmware build that sped up our typical traffic’s 99th percentile by two orders of magnitude. Google, Amazon, and their other customers will probably get the fixes in a few weeks, too. Meanwhile, we’re still waiting on our gourmet cheese basket.

Even though our benchmarks were simple, they were specific to the use case, and utilized robust statistics. But even the most robust statistics won’t solve the real problem: systematic overapplication of one or two microbenchmarks across all use cases. We must move forward, to a more modern view.

Performance as a feature

Any framework or application branding itself as performant must include measurement instrumentation as an active interface. One cannot simply benchmark once and claim performance forever.1 Applications vary widely. There is no performance-critical situation where measurement is not also necessary. Instead, we see a glut of microframeworks, throwing out even the most obvious features in the name of speed.

Speed is not a built-in property. Yes, Formula 1 race cars are fast and yes, F1 designers are very focused on weight reduction. But they are not shaving off grams to set weight records. The F1 engineers are making room for more safety, metrics, and alerting. Once upon a time, this was not possible, but technology has come a long way since last century. So it goes with software.

To honestly claim performance on a featuresheet, a modern framework must provide a fast, reliable, and resource-conscious measurement subsystem, as well as a clear API for accessing the measurements. These are good uses of your server cycles. PayPal’s internal Python framework does all of this on top of SuPPort, faststat, and lithoxyl.

Benching the microbenchmark

An ECHO-brand ping pong paddle and ball.

Enough with the games already. They’re noisy and not even that fun.2

Microbenchmarks were already showing signs of fatigue. Strike one was the frequent lack of reproducibility. Strike two came when software authors began gaming the system, changing what was written to beat the benchmark. Now, microbenchmarks have officially struck out. Echos and ping-pongs are worth less than their namesakes.

Standard profiling and optimization techniques, such as those chronicled in Enterprise Software with Python, still have their place for engineering performance. But those measurements are provisional and temporary. Today, we need software that provides idiomatic facilities for live measurement every individual system’s true performance.

  1. I’m not naming names. Yet. You can follow me on Twitter in case that changes. 
  2. Line art by Frank Santoro.

Outbound SSL Performance in Node.js


When browsing the internet, we all know that employing encryption via SSL is extremely important. At PayPal, security is our top priority. We use end-to-end encryption, not only for our public website, but for all our internal service calls as well. SSL encrypton will, however, affect Node.js performance at scale. We’ve spent time tuning our outbound connections to try and get the most out of them. This is a list of some of those SSL configuration adjustments that we’ve found can dramatically improve outbound SSL performance.

SSL Ciphers

Out of the box, Node.js SSL uses a very strong set of cipher algorithms. In particular, Diffie Hellman and Elliptical Curve algorithms are hugely expensive and basically cripple Node.js performance when you begin making a lot of outbound SSL calls at default settings. To get an idea about just how slow this can be, here is a CPU sample taken from a service call:

918834.0ms 100.0% 0.0 node (91770)
911376.0ms 99.1% 0.0   start
911376.0ms 99.1% 0.0    node::Start
911363.0ms 99.1% 48.0    uv_run
909839.0ms 99.0% 438.0    uv__io_poll
876570.0ms 95.4% 849.0     uv__stream_io
873590.0ms 95.0% 32.0       node::StreamWrap::OnReadCommon
873373.0ms 95.0% 7.0         node::MakeCallback
873265.0ms 95.0% 15.0         node::MakeDomainCallback
873125.0ms 95.0% 61.0          v8::Function::Call
873049.0ms 95.0% 13364.0       _ZN2v88internalL6InvokeEbNS0
832660.0ms 90.6% 431.0          _ZN2v88internalL21Builtin
821687.0ms 89.4% 39.0            node::crypto::Connection::ClearOut
813884.0ms 88.5% 37.0             ssl23_connect
813562.0ms 88.5% 54.0              ssl3_connect
802651.0ms 87.3% 35.0               ssl3_send_client_key_exchange
417323.0ms 45.4% 7.0                 EC_KEY_generate_key
383185.0ms 41.7% 12.0                ecdh_compute_key
1545.0ms 0.1% 4.0                    tls1_generate_master_secret
123.0ms 0.0% 4.0                     ssl3_do_write

Let’s focus on key generation:

802651.0ms 87.3% 35.0 ssl3_send_client_key_exchange
417323.0ms 45.4% 7.0 EC_KEY_generate_key
383185.0ms 41.7% 12.0 ecdh_compute_key

87% of the time for the call is spent in keygen!

These ciphers can be changed to be less compute intensive. This is done in the https (or agent) options. For example:

var agent = new https.Agent({
    "key": key,
    "cert": cert,
    "ciphers": "AES256-GCM-SHA384"

The key here is exclusion of the expensive Diffie-Hellman algorithms (i.e. DH, EDH, ECDH). With something like that, we can see a dramatic change in the sample:

57945.0ms 32.5% 16.0 ssl3_send_client_key_exchange
28958.0ms 16.2% 9.0 generate_key
26827.0ms 15.0% 2.0 compute_key

You can learn more about the cipher strings from openSSL documentation.

SSL Session Resume

If your server supports SSL session resume, then you can pass sessions in the (undocumented as yet) https (or agent) option session. You can also wrap your agent‘s createConnection function:

var createConnection = agent.createConnection;

agent.createConnection = function (options) {
    options.session = session;
    return createConnection.call(agent, options);

Session resume will decrease the cost of your connections by performing an abbreviated handshake on connection.

Keep Alive

Enabling keepalive in agent will mitigate SSL handshakes. A keepalive agent, such as agentkeepalive, can ‘fix’ Node’s keepalive troubles but is unnecessary in Node 0.12.

Another thing to keep in mind is agent maxSockets, where high numbers can result in a negative performance impact. Scale your maxSockets based on the volume of outbound connections you are making.

Slab Size

tls.SLAB_BUFFER_SIZE determines the allocation size of the slab buffers used by tls clients (and servers). The size defaults to 10 megabytes.

This allocation will grow your rss and increase garbage collection time. This means performance hits at high volume. Adjusting this to a lower number can improve memory and garbage collection performance. In 0.12, however, slab allocation has been improved and these adjustments are no longer necessary.

Recent SSL Changes in 0.12

Testing out Fedor’s SSL enhancements.

Test Description

Running an http server that acts as a proxy to an SSL server, all running on localhost.


Running 10s test @
20 threads and 20 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 69.38ms 30.43ms 268.56ms 95.24%
Req/Sec 14.95 4.16 20.00 58.65%
3055 requests in 10.01s, 337.12KB read
Requests/sec: 305.28
Transfer/sec: 33.69KB

v0.11.10-pre (build from master)

Running 10s test @
20 threads and 20 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 75.87ms 7.10ms 102.87ms 71.55%
Req/Sec 12.77 2.43 19.00 64.17%
2620 requests in 10.01s, 276.33KB read
Requests/sec: 261.86
Transfer/sec: 27.62KB

There isn’t a lot of difference here, but that’s due to the default ciphers, so let’s adjust agent options for ciphers. For example:

var agent = new https.Agent({
    "key": key,
    "cert": cert,
    "ciphers": "AES256-GCM-SHA384"


Running 10s test @ http://localhost:3000/
20 threads and 20 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 59.85ms 6.77ms 95.71ms 77.29%
Req/Sec 16.39 2.36 22.00 61.97%
3339 requests in 10.00s, 368.46KB read
Requests/sec: 333.79
Transfer/sec: 36.83KB

v0.11.10-pre (build from master)

Running 10s test @ http://localhost:3000/
20 threads and 20 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 38.99ms 5.96ms 71.87ms 86.22%
Req/Sec 25.43 5.70 35.00 63.36%
5160 requests in 10.00s, 569.41KB read
Requests/sec: 515.80
Transfer/sec: 56.92KB

As we can see, there is a night and day difference with Fedor’s changes: almost 2x the performance between 0.10 and 0.12!

Wrap Up

One might ask “why not just turn off SSL, then it’s fast!”, and this may be an option for some. Actually, this is typically the answer I get when I ask others how they overcame SSL performance issues. But if anything enterprise SSL requirements will increase rather than decrease; and although a lot has been done to improve SSL in Node.js, performance tuning is still needed. Hopefully some of the above tips will help in tuning for your SSL use case.