Tag Archives: benchmarks

Benching Microbenchmarks


In under one week, Statistics for Software flew past 10 Myths for Enterprise Python to become the most visited post in the history of the PayPal Engineering blog. And that’s not counting the Japanese translation. Taken as an indicator of increased interest in software quality, this really floats all boats.

That said, there were enough emails and comments to call for a quick followup about one particularly troubling area.

Statistics for benchmarks

The saying in software goes that there are lies, damned lies, and software benchmarks.

A tachometer (an instrument indicating how hard an engine is working)

Too much software is built without the most basic instrumentation.

Yes, quantiles, histograms, and other fundamentals covered in Statistics for Software can certainly be applied to improve benchmarking. One of the timely inspirations for the post was our experience with a major network appliance vendor selling 5-figure machines, without providing or even measuring latency in quantiles. Just throughput averages.

To fix this, we gave them a Jupyter notebook that drove test traffic, and a second notebook provided the numbers they should have measured. We’ve amalgamated elements of both into a single notebook on PayPal’s Github. Two weeks later they had a new firmware build that sped up our typical traffic’s 99th percentile by two orders of magnitude. Google, Amazon, and their other customers will probably get the fixes in a few weeks, too. Meanwhile, we’re still waiting on our gourmet cheese basket.

Even though our benchmarks were simple, they were specific to the use case, and utilized robust statistics. But even the most robust statistics won’t solve the real problem: systematic overapplication of one or two microbenchmarks across all use cases. We must move forward, to a more modern view.

Performance as a feature

Any framework or application branding itself as performant must include measurement instrumentation as an active interface. One cannot simply benchmark once and claim performance forever.1 Applications vary widely. There is no performance-critical situation where measurement is not also necessary. Instead, we see a glut of microframeworks, throwing out even the most obvious features in the name of speed.

Speed is not a built-in property. Yes, Formula 1 race cars are fast and yes, F1 designers are very focused on weight reduction. But they are not shaving off grams to set weight records. The F1 engineers are making room for more safety, metrics, and alerting. Once upon a time, this was not possible, but technology has come a long way since last century. So it goes with software.

To honestly claim performance on a featuresheet, a modern framework must provide a fast, reliable, and resource-conscious measurement subsystem, as well as a clear API for accessing the measurements. These are good uses of your server cycles. PayPal’s internal Python framework does all of this on top of SuPPort, faststat, and lithoxyl.

Benching the microbenchmark

An ECHO-brand ping pong paddle and ball.

Enough with the games already. They’re noisy and not even that fun.2

Microbenchmarks were already showing signs of fatigue. Strike one was the frequent lack of reproducibility. Strike two came when software authors began gaming the system, changing what was written to beat the benchmark. Now, microbenchmarks have officially struck out. Echos and ping-pongs are worth less than their namesakes.

Standard profiling and optimization techniques, such as those chronicled in Enterprise Software with Python, still have their place for engineering performance. But those measurements are provisional and temporary. Today, we need software that provides idiomatic facilities for live measurement every individual system’s true performance.

  1. I’m not naming names. Yet. You can follow me on Twitter in case that changes. 
  2. Line art by Frank Santoro.