Software development begins as a quest for capability, doing what could not be done before. Once that what is achieved, the engineer is left with the how. In enterprise software, the most frequently asked questions are, “How fast?” and more importantly, “How reliable?”
Questions about software performance cannot be answered, or even appropriately articulated, without statistics.
Yet most developers can’t tell you much about statistics. Much like math, statistics simply don’t come up for typical projects. Between coding the new and maintaining the old, who has the time?
Engineers must make the time. I understand fifteen minutes can seem like a big commitment these days, so maybe bookmark it. Insistent TLDR seekers can head for our instrumentation section or straight to the summary.
For the dedicated few, class is in session. It’s time to broach the topic, learn what works, and take the guesswork out of software. A few core practices go a long way in generating meaningful systems analysis. And a few common mistakes set projects way back. This guide aims to lighten software maintenance and speed up future development through answers made possible by the right kinds of applied statistics.
To begin, let’s consider a target component we want to measure and improve. At PayPal this is often one of our hundreds of HTTP server applications. If you were to look at our internal frameworks, you would find hundreds of code paths instrumented to generate streams of measurements: function execution times, service latencies, request lengths, and response codes, to name a few.
We discuss more about instrumentation below, but for now we assume these measurement points exist and focus on numerical measurements like execution times.
The correct starting point minimizes bias. We assume the least, treating our measurements as values of random variables. So opens the door to the wide world of descriptive statistics, a whole area devoted to describing the behavior of randomness. While this may sound obvious, remember that much of statistics is inferential, dedicated to modeling future outcomes. They may go hand-in-hand, but knowing the difference and proper names drastically speeds up future research. Lesson #1 is that it pays to learn the right statistics terminology.
Measurement is all about balance. Too little data and you might as well have not bothered. Or worse, you make an uninformed decision that takes your service down with it. Then again, too much data can take down a service. And even if you keep an eye on memory consumption, you need to consider the resources required to manage excess data. “Big data” looms large, but it is not a necessary step toward understanding big systems. We want dense, balanced data.
So what tools does the engineer have for balanced measurement? When it comes to metadata, there is never a shortage of volume, dimensions, or techniques. To avoid getting lost, we focus on descriptive summary statistics that can be collected with minimal overhead. With that direction, let’s start out by covering some familiar territory.
Statistical moments may not sound familiar, but virtually everyone uses them, including you. By age 10, most students can compute an average, otherwise known as the arithmetic mean. Our everyday mean is known to statisticians as the first moment. The mean is nice to calculate and can be quite enlightening, but there’s a reason this post doesn’t end here.
The mean is just the most famous of four commonly used moments. These moments are used to create a progression that adds up to a standard data description:
- Mean – The central tendency of the data
- Variance – The looseness of the grouping around the mean
- Skewness – The symmetry or bias toward one end of the scale
- Kurtosis – The amount of data in “the wings” of the set
The mean, variance, skewness, and kurtosis. In all images blue is less than yellow is less than pink.
These four standardized moments represent the most widely-applied metrics for describing the shape of a distribution. Each successive measure adds less practical information than the last, so skewness and kurtosis are often omitted. And while many are just hearing about these measures for the first time, omission may not be a bad thing.
The mean, variance, skewness, and kurtosis are almost never the right tools for a performance-minded engineer. Moment-based measures are not trustworthy messengers of the critical engineering metadata we seek.
Moment-based measures are not robust. Non-robust statistics simultaneously:
- Bend to skew by outliers
- Dilute the meaning of those outliers
An outlier is any data point distant from the rest of the distribution. “Outlier” may sound remote and improbable, but they are everywhere, making moment-based statistics uninformative and even dangerous. Outliers often represent the most critical data for a troubleshooting engineer.
So why do so many still use moments for software? The short answer, if you’ll pardon the pun: momentum. The mean and variance have two advantages: easy implementation and broad usage. In reality, that familiarity leads to damaging assumptions that ignore specifics like outliers. For software performance, the mean and variance are useful only in the context of robust statistical measures.
So, enough buildup. Lesson #2 is avoid relying solely on non-robust statistics like the mean to describe your data. Now, what robust techniques can we actually rely on?
If you’ve ever looked at census data or gotten standardized test results, you’re already familiar with quantiles. Quantiles are points that subdivide a dataset’s range into equal parts. Most often, we speak of quartiles (4 parts) and percentiles (100 parts). Measures of central tendency tend to be the most popular, and the same goes for the 50th percentile, the median.
While experienced engineers are happier to see the median than the mean, the data doesn’t really get interesting until we get into the extremes. For software performance and reliability, that means the 95th, 98th, 99th, and 99.9th percentiles. We call these the extremes, but as in high-traffic systems, they are far from uncommon. We also look for the range formed by the minimum and maximum values, sometimes called the 0th and 100th percentiles.
The challenge with quantiles and ranges is efficient computation. For instance, the traditional way to calculate the median is to choose the middle value, or the average of the two middle values, from a sorted set of all the data. Considering all the data at once is the only way to calculate exact quantiles.
Keeping all our data in memory would be prohibitively expensive for our use cases. Instead, using the principle of “good enough”, an engineer has ways of estimating quantiles much more efficiently. Statistics is founded on utilitarian compromise.
As a field, statistics formed to tackle the logistical issue of approximating an unreachable population. Even today, it’s still not possible to poll every person, taste every apple, or drive every car. So statistics continues to provide.
In the realm of software performance, data collection is automatable, to the point of making measurement too automatic. The problem becomes collation, indexing, and storage. Hard problems, replete with hard-working people.
Here we’re trying to make things easier. We want to avoid those hard problems. The easiest way to avoid too much data is to throw data away. Discarded data may not need storage, but if you’re not careful, it will live on, haunting data in the form of biases. We want pristine, meaningful data, indistinguishable from the population, just a whole lot smaller. So statistics provides us random sampling.
Reservoir sampling only holds on to a little bit of data, letting most of it pass through uncollected.
The twist is that we want to sample from an unknown population, considering only one data point at a time. This use case calls for a special corner of computer science: online algorithms, a subclass of streaming algorithms. “Online” implies only individual points are considered in a single pass. “Streaming” implies the program can only consider a subset of the data at a time, but can work in batches or run multiple passes. Fortunately, Donald Knuth helped popularize an elegant approach that enables random sampling over a stream: Reservoir sampling. Let’s jump right in.
First we designate a
counter, which will be incremented for every data point seen. The
reservoir is generally a list or array of predefined
size. Now we can begin adding data. Until we encounter
size elements, elements are added directly to
reservoir is full, incoming data points have a
size / counter chance to replace an existing sample point. We never look at the value of a data point and the random chance is guaranteed by definition. This way
reservoir is always representative of the dataset as a whole, and is just as likely to have a data point from the beginning as it is from the end. All this, with bounded memory requirements, and very little computation. See the instrumentation section below for links to Python implementations.
In simpler terms, the reservoir progressively renders a scaled-down version of the data, like a fixed-size thumbnail. Reservoir sampling’s ability to handle populations of unknown size fits perfectly with tracking response latency and other metrics of a long-lived server process.
Once you have a reservoir, what are the natural next steps? At PayPal, our standard procedure looks like:
- Look at the min, max, and other quantiles of interest (generally median, 95th, 99th, 99.9th percentiles).
- Visualize the CDF and histogram to get a sense for the shape of the data, usually by loading the data in a Jupyter notebook and using Pandas, matplotlib, and occasionally bokeh.
Reservoirs are perfect for getting that histogram silhouette of your performance data. CDF overlays and QQ plots can help with comparison.
And that’s it really. Beyond this we are usually adding more dimensions, like comparisons over time or between datacenters. Having the range, quantiles, and sampled view of the data really eliminates so much guesswork that we end up saving time. Tighten a timeout, implement a retry, test, and deploy. Maybe add higher-level acceptance criteria like an Apdex score. Regardless of the performance problem, we know when we’ve fixed it and we have the right numbers and charts to back us up.
Accuracy is a matter of having the right resolution.
Reservoir sampling does have its shortcomings. In particular, like an image thumbnail, accuracy is only as good as the resolution configured. In some cases, the data near the edges gets a bit blocky. Good implementations of reservoir sampling will already track the maximum and minimum values, but for engineers interested in the edges, we recommend keeping an increased set of the exact outliers. For example, for critical paths, we sometimes explicitly track the n highest response times observed in the last hour.
Depending on your runtime environment, resources may come at a premium. Reservoir sampling requires very little processing power, provided you have an efficient PRNG. Even your Arduino has one of those. But memory costs can pile up. Generally speaking, accuracy scales with the square root of size. Twice as much accuracy will cost you four times as much memory, so there are diminishing returns.
At PayPal, the typical reservoir is allocated 16,384 floating point slots, for a total of 64 kilobytes. At this rate, humans run out of memory before the servers do. Tracking 500 variables only takes 8 megabytes. As a developer, remembering what they all are is a different story.
Usually, reservoirs get us what we want and we can get on with non-statistical development. But sometimes, the situation calls for a more tailored approach.
At various points at PayPal, we’ve worked with q-digests, biased quantile estimators, and plenty of other advanced algorithms for handling performance data. After a lot of experimentation, two approaches remain our go-tos, both of which are much simpler than one might presume.
The first approach, histogram counters, establishes ranges of interest, called bins or buckets, based on statistics gathered from a particular reservoir. While reservoir counting is data agnostic, looking only at a random value to decide where to put the data, bucketed counting looks at the value, finds the bucket whose range includes that value, and increments the bucket’s associated counter. The value itself is not stored. The code is simple, and the memory consumption is even lower, but the key advantage is the execution speed. Bucketed counting is so low overhead that it allows statistics collection to permeate much deeper into our code than other algorithms would allow.
The second approach, Piecewise Parabolic Quantile Estimation (P2 for short), is an engineering classic. A product of the 1980s electronics industry, P2 is a pragmatic online algorithm originally designed for simple devices. When we look at a reservoir’s distribution and decide we need more resolution for certain quantiles, P2 lets us specify the quantiles ahead of time, and maintains those values on every single observation. The memory consumption is very low, but due to the math involved, P2 uses more CPU than reservoir sampling and bucketed counting. Furthermore, we’ve never seen anyone attempt combination of P2 estimators, but we assume it’s nontrivial. The good news is that for most distributions we see, our P2 estimators are an order of magnitude more accurate than reservoir sampling.
These approaches both take something learned from the reservoir sample and apply it toward doing less. Histograms provide answers in terms of a preconfigured value ranges, P2 provides answers at preconfigured quantile points of interest. Lesson #3 is to choose your statistics to match the questions you need to answer. If you don’t know what those questions are, stick to general tools like reservoir sampling.
There are a lot of ways to combine statistics and engineering, so it’s only fair that we offer some running starts.
We focused a lot on statistical fundamentals, but how do we generate relevant datasets in the first place? Our answer is through structured instrumentation of our components. With the right hooks in place, the data will be there when we need it, whether we’re staying late to debug an issue or when we have a spare cycle to improve performance.
Much of PayPal’s Python services’ robustness can be credited to a reliable remote logging infrastructure, similar to, but more powerful than, rsyslog. Still, before we can send data upstream, the process must collect internal metrics. We leverage two open-source projects, fast approaching major release:
- faststat – Optimized statistical accumulators
- lithoxyl – Next-generation logging and application instrumentation
Lithoxyl is a high-level library designed for application introspection. It’s intended to be the most Pythonic logging framework possible. This includes structured logging and various accumulators, including reservoir sampling, P2, and others. But more importantly, Lithoxyl creates a separate instrumentation aspect for applications, allowing output levels and formats to be managed separately from the instrumentation points themselves.
Faststat operates on a lower level. True to its name, Faststat is a compiled Python extension that implements accumulators for the measures described here and many more. This includes everything from geometric/harmonic means to Markov-like transition tracking to a metametric that tracks the time between stat updates. At just over half a microsecond per point, Faststat’s low-overhead allows it to permeate into some of the deepest depths of our framework code. Faststat lacks output mechanisms of its own, so our internal framework includes a simple web API and UI for browsing statistics, as well as a greenlet that constantly uploads faststat data to a remote accumulation service for alerting and archiving.
One of the many advantages to investing in instrumentation early is that you get a sense for the performance overhead of data collection. Reliability and features far outweigh performance in the enterprise space. Many critical services I’ve worked on could be multiple times faster without instrumentation, but removing this aspect would render them unmaintainable, which brings me to my next point.
Good work takes cycles. All the methods described here are performance-minded, but you have to spend cycles to regain cycles. An airplane could carry more passengers without all those heavy dials and displays up front. It’s not hard to imagine why logging and metrics is, for most services, second only to the features themselves. Always remember and communicate that having to choose between features and instrumentation does not bode well for the reliability record of the application or organization.
For those who really need to move fast and would prefer to reuse or subscribe, there are several promising choices out there, including New Relic and Prometheus. Obviously we have our own systems, but those offerings do have percentiles and histograms.
Without detracting from the main course above, this section remains as dessert. Have your fill, but keep in mind that the concepts above fulfill 95% of the needs of an enterprise SOA environment like PayPal.
We focused here on descriptive, non-parametric statistics for numerical applications. Statistics is much more than that. Having the vocabulary means the difference between the right answer and no answer. Here are some areas to expand into, and how they have applied to our work:
Non-parametric statistics gave us quantiles, but offers so much more. Generally, non-parametric describes any statistical construct that does not make assumptions about probability distribution, e.g. normal or binomial. This means it has the most broadly-applicable tools in the descriptive statistics toolbox. This includes everything from the familiar histogram to the sleeker kernel density estimation (KDE). There’s also a wide variety of nonparametric tests aimed at quantitatively discovering your data’s distribution and expanding into the wide world of parametric methods.
Parametric statistics contrast with non-parametric statistics in that the data is presumed to follow a given probability distribution. If you’ve established or assumed that your data can be modeled as one of the many published distributions, you’ve given yourself a powerful set of abstractions with which to reason about your system. We could do a whole article on the probability distributions we expect from different parts of our Python backend services (hint: expect a lot of fish and phones). Teasing apart the curves inherent in your system is quite a feat, but we never drift too far from the real observations. As with any extensive modeling exercise, heed the cautionary song of the black swan.
Inferential statistics contrast with descriptive statistics in that the goal is to develop models and predict future performance. Applying predictive modeling, like regression and distribution fitting, can help you assess whether you are collecting sufficient data, or if you’re missing some metrics. If you can establish a reliable model for your service and hook it into monitoring and alerting, you’ll have reached SRE nirvana. In the meantime, many teams make do with simply overlaying charts with the last week. This is often quite effective, diminishing the need for mathematical inference, but does require constant manual interpretation, doesn’t compose well for longer-term trend analysis, and really doesn’t work when the previous week isn’t representative (i.e., had an outage or a spike).
Categorical statistics contrast with numerical statistics in that the data is not mathematically measurable. Categorical data can be big, such as IPs and phone numbers, or small, like user languages. Our key non-numerical metrics are around counts, or cardinality, of categorical data. Some components have used HyperLogLog and Count-Min sketches for distributable streaming cardinality estimates. While reservoir sampling is much simpler, and can be used for categorical data as well, HLL and CMS offer increased space efficiency, and more importantly: proven error bounds. After grasping reservoir sampling, but before delving into advanced cardinaltiy structures, you may want to have a look at boltons ThresholdCounter, the heavy hitters counter used extensively in PayPal’s Python services. Regardless, be sure to take a look at this ontology of basic statistical data types.
Multivariate statistics allow you to analyze multiple output variables at a time. It’s easy to go overboard with multiple dimensions, as there’s always an extra dimension if you look for it. Nevertheless, a simple, practical exploration of correlations can give you a better sense of your system, as well as inform you as to redundant data collection.
Measurements are like a blanket over real behavior.
Multimodal statistics abound in real world data: multiple peaks or multiple distributions packed into a single dataset. Consider response times from an HTTP service:
- Successful requests (200s) have a “normal” latency.
- Client failures (400s) complete quickly, as little work can be done with invalid requests.
- Server errors (500s) can either be very quick (backend down) or very slow (timeouts).
Here we can assume that we have several curves overlaid, with 3 obvious peaks. This exaggerated graph makes it clear that maintaining a single set of summary statistics can do the data great injustice. Two peaks really narrows down the field of effective statistical techniques, and three or more will present a real challenge. There are times when you will want to discover and track datasets separately for more meaningful analysis. Other times it makes more sense to bite the bullet and leave the data mixed.
Time-series statistics transforms measurements by contextualizing them into a single, near-universal dimension: time intervals. At PayPal, time series are used all over, from per-minute transaction and error rates sent to OpenTSDB, to the Python team’s homegrown $PYPL Pandas stock price analysis. Not all data makes sense as a time series. It may be easy to implement certain algorithms over time series streams, but be careful about overapplication. Time-bucketing contorts the data, leading to fewer ways to safely combine samples and more shadows of misleading correlations.
Moving metrics, sometimes called rolling or windowed metrics, are another powerful class of calculation that can combine measurement and time. For instance, the exponentially-weighted moving average (EWMA), famously used by UNIX for its load averages:
10:53PM up 378 days, 1:01, 3 users, load average: 1.37, 0.22, 0.14
This output packs a lot of information into a small space, and is very cheap to track, but it takes some knowledge and understanding to interpret correctly. EWMA is simultaneously familiar and nuanced. It’s fun to consider whether you want time series-style disrcete buckets or the continuous window of a moving statistic. For instance, do you want the counts for yesterday, or the past 24 hours? Do you want the previous hour or the last 60 minutes? Based on the questions people ask about our applications, PayPal Python services keep few moving metrics, and generally use a lot more time series.
Even a little survival modeling will make your code cleaner.
Survival analysis is used to analyze the lifetimes of system components, and must make an appearance in any engineering article about reliability. Invaluable for simulations and post-mortem investigations, even a basic understanding of the bathtub curve can provide insight into lifespans of running processes. Failures are rooted in causes at the beginning, middle, and end of expected lifetime, which when overlaid, create a bathtub aggregate curve. When the software industry gets to a point where it leverages this analysis as much as the hardware industry, the technology world will undoubtedly have become a cleaner place.
It’s been a journey, but I hope you learned something new. If nothing else, you know how we do it. Remember the three lessons:
- Statistics is full of very specific terminology and techniques that you may not know that you need. It pays to take time to learn them.
- Averages and other moment-based measures don’t cut it for software performance and reliability. Never underestimate the outlier.
- Use quantiles, counting, and other robust metrics to generate meaningful data suitable for exploring the specifics of your software.
If you would like to learn more about statistics, especially branching out from descriptive statistics, I recommend Allen Downey‘s video course Data Exploration in Python, as well as his free books Think Stats and Think Bayes. If you would like to learn more about enterprise software development, or have developers in need of training, take a look at my new video course, Enterprise Software with Python. It includes even more software fundamentals informed by PayPal Python development, and is out now on oreilly.com and Safari.
I hope you found this a useful enough guide that you’ll save yourself some time, either by applying the techniques described or subscribing and sharing the link with other developers. Everyone needs a reintroduction to statistics after a few years of real work. Even if you studied statistics in school, real work, real data, and real questions have a way of making one wonder who passed those statistics exams. As interconnectivity increases, a rising tide of robust measurement floats all software engineering boats.