From Big Data to Fast Data in Four Weeks or How Reactive Programming is Changing the World – Part 2


Part 2: Lambda Architecture meets reality

Part 1 can be found here.

Fast Data

Fast forward to December 2015. We have a cross data center Kafka clusters, we have Spark adoption through the roof. All of this, however, was to fix our traditional batch platform. I’m not going to pretend we never thought about real-time stuff. We’d been gearing up toward the Lambda architecture all along, but truly we were not working specifically for the sake of the near real-time analytics.

The beauty of our current stack and skill set is that streaming just comes with it. All we needed to do is to write a Spark Streaming app connected directly to our Kafka and feed the same Teradata repo with micro batches.


Most of the stuff just worked out of the box, with some caveats.
Spark streaming requires a different thinking about resource allocation on Hadoop, dynamic memory management on Spark cache helps a lot. Long running streaming jobs also require a different approach to monitoring and SLA measurements. Our operational ecosystem needed some upgrades. For example, we had to create our own monitoring scripts to parse Spark logs to determine whether a job is still alive.
The only real nasty surprise we got, was from the downstream Tableau infrastructure Product and Marketing people were using. The Tableau server could not refresh its dashboards as fast as we were updating the metrics in Teradata.

Lambda architecture in practice

Let’s get a closer look at our version of the Lambda architecture.

There are two main scenarios inherent in streaming analytical systems, which our Lambda architecture implementation helps to overcome:

Upstream delays in the pipe

We need to remember that asynchronous near real time analytical systems are pretty sensitive to traffic fluctuations. Conceptually our pipe looks like series of buffers spread out across network zones, data centers, and nodes. Some buffers are pushing, some pulling. In total, it is highly available and protected from traffic spikes. Still, fairly often communication between different buffers can be degraded or completely lost. We will not lose events, persistent buffers take care of that. The correctness of the metric calculation, however, will suffer because of the delays. Batch system, on the other hand, provides greater correctness, but we have to artificially delay metric calculation to account for potential delays in the upstream pipe.

We tell our internal customers that they can have “close enough” numbers almost immediately and exact numbers within hours.

We use Spark for both batch and streaming. Same code base. The streaming part runs on micro-batch schedule – i.e. seconds or minutes, the Batch part runs on macro-batch schedule – hours, days. Correspondingly, streaming code calculates short duration metrics like 1 minute totals, and batch code calculates longer duration metrics like 1 hour or 1 day totals, both batches consume the same raw events.

The chained export process adds the latest available calculations into SQL data store, regardless of which part has produced the output.

Imagine some metric, say “number of clicks per page”. Micro-batch will pull new events from Kafka every minute and push totals to SQL data store. User might query last hour totals – an aggregate of the last 60 minutes.  At the same time, one or two nodes has been taken off-line due to some network issue. Some thousands of events might be still buffered (persisted) at the local node storage. Then after 30 minutes, nodes are back on-line and transmitting whatever events left in the buffer. However, at the time when the user saw the “last hour” total, late coming events were not available. Batch, which is scheduled to run on prior 4 hours of data, will include delayed records and will re-calculate hourly totals correctly. Another batch, which is scheduled to run daily, will make sure that whatever might be missed by 4-hour batch is added etc.

Interruptions or pauses in streaming calculations

Another common issue with near real time calculations is availability of the calculating process itself. What happens when our micro-batch goes down for whatever reason? (Could be some systemic Hadoop failure, bug, or planned downtime).

Spark Streaming micro-batch is a long-running Yarn process. This means it will allocate a certain amount of resources and will keep those resources forever (or until it is killed). Keeping this in mind, the amount of resources allocated (number and size of the containers as scheduled by Yarn), must be fine-tuned to fit the volume of the micro-batch pretty closely. Ideally, the duration of the single run should be very close to the micro-batch iteration time to minimize idle time between iterations.

In our case, we usually fine tune the number and size of Spark executors by determining our peak volume and running micro-batch experimentally for a number of days. Normally we would also oversubscribe resources to allow for some fluctuations in traffic. Let’s say we’ve decided to run our micro-batch every 1 minute. Yarn queue, in which our streaming job will run, will have max size to fit up to 2-3 min worth of traffic in one run, just in case of a traffic spike or quick job restart. What it means is our micro-batch will pull at most 3 minutes’ worth of traffic from Kafka in a single iteration and it will spend no more than 1 minute processing it.

What happens when micro-batch is down for an hour for cluster maintenance? When micro-batch starts after long pause and tries to pull all the events since last iteration, there will be 20 times more data to process than we have resources allocated in the queue. (Total delay is 60 min / max expected for single iteration 3 minute = 20 times). If we try to process all the backlog using our small queue it will run forever and create even more backlog.

Our approach for streaming calculations is to skip forward to the last 3 min after any interruption. This will create gaps in 1 minute totals. Any rolled up totals in this case will be off.

For this scenario as well, macro-batches, totaling raw events independently, will create correct hourly/daily/weekly totals despite any gap in 1 minute totals.


Bottom line, combining streaming micro-batch calculations with traditional macro-batch calculations will cover all user needs from rough numbers for trending during the day to exact numbers for a weekly report.

What’s next?

There are still two reactive dominoes to tumble before we achieve the full reactive nirvana across our stack.

The most obvious step is to relocate user facing dashboards to a nimbler analytical platform. We had already been using Druid for a while and developed a Spark – Druid connector. Now we can push calculation results to users at whatever velocity we need.

Familiarity with FP and Scala and understanding of the Akka role in the success of Spark and Kafka, helped us to re-evaluate our event collector architecture as well. It is written using a traditional Java Servlets container and uses lots of complicated multithreading code to consume multiple streams of events asynchronously.

Reactive approach (specifically reactive Akka Streams) and Scala is exactly what we need to drastically simplify our Java code base and to wring every last ounce of scalability from our hardware.

Thanks to our friends at the PayPal framework team, Reactive programming is now becoming mainstream at PayPal: SQUBS a new reactive way for PayPal to build applications


Is my claim that Reactive Programming is changing the world too dramatic? The cost of scalability is a key factor to consider. Frameworks like Akka are democratizing one of the most complex, hard to test, and debug domains of software development. With a solid foundation, Sparks and Kafkas of the world are evolving much faster than the prior generation of similar tools. This means more time is being spent on business critical features than on fixing inevitable (in the past) concurrency bugs.

Consider our case.

In the past, real-time analytic systems (like CEP – complex event processing) were an extreme luxury, literally costing millions of dollars in license fees just to set it up. Companies would invest in those systems for very important core business reasons only. For PayPal or Credit Card companies it would be fraud detection. Product roll-outs or marketing campaigns however, would very rarely reach the threshold of such importance.
Today we get the streaming platform practically for free as a package deal.

So yes! Reactive programming is changing the world.