Tag Archives: Kafka

Gimel: PayPal’s Analytics Data Processing Platform


It is no secret that the volume, velocity and variety of data is exploding. As a result, the case to move workloads to the Hadoop big data ecosystem is becoming easier by the day.

With Hadoop and the big data world come the additional complexities of managing the semantics of reading from and writing to various data stores. SQL has not been the lingua franca of the analytics world so far. In this world, you need to deal with a lot of choices – compute engines like Map Reduce, Spark, Flink, Hive, etc. and data stores like HDFS/Hive, Kafka, Elasticsearch, HBase, Aerospike, Cassandra, Couchbase, etc. The code required in order to read from and write to these data stores from different compute engines is fragile and cumbersome and differs significantly between data stores.

In the big data world, it is also extremely difficult to catalog datasets across various data stores in a way that makes it easy to add to the catalog or to search and find datasets.

Besides being unable to find the correct datasets, developers, analysts and data scientists don’t want to deal with the code to read and write data, and obviously want to focus their attention and time towards dataset development, analysis and exploration.

We built the Gimel Data Platform – the centralized metadata catalog PCatalog and the unified Data API – to address these challenges and help commoditize data access in the big data world.

What is the Gimel Data Platform

At the core, the Gimel Data Platform consists of two components:

PCatalog: A centralized metadata catalog powered by Gimel Discovery Services, which register datasets found across all the data stores in the company and makes them searchable through a PCatalog portal as well as programmatically through RESTful APIs.

Data API: A unified API to read from and write to data stores, including data streaming systems like Kafka. In addition to providing the API, the Gimel platform also enables SQL access to the API so developers, analysts and data scientists can now use Gimel SQL to access data stored on any data store. The Data API is built to work in batch mode or in streaming mode.


Gimel ecosystem

Gimel ecosystem

The Gimel Data Platform is tightly integrated with the Gimel Compute Platform, which offers a framework and a set of APIs to access the compute infrastructure — the Gimel Platform is accessed through the Gimel SDK and via Jupyter notebooks.

Additionally, the Gimel ecosystem, as seen from the image above, handles logging, monitoring, and alerting automatically so that there is a robust audit trail maintained for the user as well as for the operator.

By making SQL a first-class citizen, the Gimel Data Platform significantly reduces the time it takes to prototype and build analytic and data pipelines.

User Experience with Gimel

Regardless of the type of user or use case, the experience is simplified into at most 3 steps:

Find your datasets: Log into PCatalog portal and search for your datasets

Once you search for and find your dataset, you can take the logical name of the dataset for the next step.

PCatalog search

PCatalog search

Access your datasets: Navigate to Jupyter notebooks & analyze data

Use the logical name of the dataset you found in the earlier step in your analysis.

Gimel query across data stores

Gimel query across data stores

Schedule and productionalize

Productionalize your analysis

Productionalize your analysis

In a nutshell, PCatalog and its services provide the required dataset name, which is a logical pointer to the physical object. The Gimel library, which requires only the name of the dataset, abstracts the logic to access any supported data store while providing the ability for the user to express logic in SQL.

Implementation Details

Let’s take a look into the following pieces:

PCatalog Portal

The PCatalog portal provides multiple views into the entire metadata catalog. These views, some of which are only visible to administrators, give a comprehensive view of all the physical properties of any dataset, including the data store it belongs to.

Dataset Overview

PCatalog - Dataset overview

PCatalog – Dataset overview

Storage System Attributes (owned and set by Storage System Owner)

PCatalog - System Attributes

PCatalog – System Attributes

Physical Object Level Attributes (owned and set by Object Owner)

PCatalog - Object Attributes

PCatalog – Object Attributes

Dataset Readiness for Consumption on various Hadoop Clusters

PCatalog - Dataset Deployment Status

PCatalog – Dataset Deployment Status

A peek inside Gimel SQL

As a user, you may be reading from one dataset & writing to another dataset but the object’s (physical) structure and location is not your interest. We will look at two examples of reading from one data store and writing into another.

Reading from Kafka, persisting into HDFS/Hive in Batch mode

The processing logic is all possible via SQL, powered by Gimel:

Kafka to HDFS/Hive

Kafka to HDFS/Hive

Reading from Kafka, streaming into HBase

As you can see below, even this streaming access is easily achieved through SQL:

Kafka to HBase

Kafka to HBase

Regardless of the data store or whether it is in batch mode or streaming, you can express all the logic as SQL!

Anatomy of Data API (how does it go from SQL to native code execution)

Anatomy of Data API

Anatomy of Data API

Catalog provider modes

Under the hood, CatalogProvider provides an interface for the Gimel Library to get the dataset details from PCatalog Services. The type of the dataset is a critical configuration provided by CatalogProvider. A DataSet or DataStream Factory in turn returns a StorageDataSet. (For example, KafkaDataSet or ElasticDataSet.)

Thus, the read/write implementations with in each of these StorageDataSets can perform the core read/write operations.

CatalogProvider Modes

CatalogProvider Modes

Gimel is Open Sourced

Gimel is now open sourced and ready for your contributions. Just go over to gimel.io and provide us your inputs. For instructions to get started on your local machine, you can go to try.gimel.io.

We are convinced that Gimel Data Platform will speed up time to market, facilitate innovation, and enable a large population of analysts familiar with SQL to step into the big data world easily.

We cannot wait for you to try it!



Deepak Mohanakumar Chandramouli, Romit Mehta, Prabhu Kasinathan, Sid Anand

Deepak has over 13 years of experience in Data Engineering & 5 years of expertise building scalable data solutions in the Big Data space. He worked on building Apache Spark based Foundational Big Data Platform during the incubation of PayPal’s Data lake. He has applied experience in implementing spark based solutions across several types of No-SQL, Key-Value, Messaging, Document based & relational systems. Deepak has been leading the initiative to enable access to any type of storage on Spark via – unified Data API, SQL, tools & services, thus simplifying analytics & large-scale computation-intensive applications. Github: Dee-Pac

Romit has been building data and analytics solutions for a wide variety of companies across networking, semi-conductors, telecom, security and fintech industries for over 19 years. At PayPal he is leading the product management of core big data and analytics platform products which include a compute framework, a data platform and a notebooks platform. As part of this role, Romit is working to simplify application development on big data technologies like Spark and improve analyst and data scientist agility and ease of accessing data spread across a multitude of data stores via friendly technologies like SQL and notebooks. Github: theromit

Prabhu Kasinathan is engineering lead at PayPal, focusing on building highly scalable and distributed Analytics Platform to handle petabyte scale data at PayPal. His team focus on creating frameworks, APIs, productivity tools and services for big data platform to support multi-tenancy and large scale computation-intensive applications. Github: prabhu1984

Sid Anand currently serves as PayPal’s Chief Data Engineer, focusing on ways to realize the value of data. Prior to joining PayPal, he held several positions including Agari’s Data Architect, a Technical Lead in Search @ LinkedIn, Netflix’s Cloud Data Architect, Etsy’s VP of Engineering, and several technical roles at eBay. Sid earned his BS and MS degrees in CS from Cornell University, where he focused on Distributed Systems. In his spare time, he is a maintainer/committer on Apache Airflow, a co-chair for QCon, and a frequent speaker at conferences. Github: r39132

From Big Data to Fast Data in Four Weeks or How Reactive Programming is Changing the World – Part 2


Part 2: Lambda Architecture meets reality

Part 1 can be found here.

Fast Data

Fast forward to December 2015. We have a cross data center Kafka clusters, we have Spark adoption through the roof. All of this, however, was to fix our traditional batch platform. I’m not going to pretend we never thought about real-time stuff. We’d been gearing up toward the Lambda architecture all along, but truly we were not working specifically for the sake of the near real-time analytics.

The beauty of our current stack and skill set is that streaming just comes with it. All we needed to do is to write a Spark Streaming app connected directly to our Kafka and feed the same Teradata repo with micro batches.


Most of the stuff just worked out of the box, with some caveats.
Spark streaming requires a different thinking about resource allocation on Hadoop, dynamic memory management on Spark cache helps a lot. Long running streaming jobs also require a different approach to monitoring and SLA measurements. Our operational ecosystem needed some upgrades. For example, we had to create our own monitoring scripts to parse Spark logs to determine whether a job is still alive.
The only real nasty surprise we got, was from the downstream Tableau infrastructure Product and Marketing people were using. The Tableau server could not refresh its dashboards as fast as we were updating the metrics in Teradata.

Lambda architecture in practice

Let’s get a closer look at our version of the Lambda architecture.

There are two main scenarios inherent in streaming analytical systems, which our Lambda architecture implementation helps to overcome:

Upstream delays in the pipe

We need to remember that asynchronous near real time analytical systems are pretty sensitive to traffic fluctuations. Conceptually our pipe looks like series of buffers spread out across network zones, data centers, and nodes. Some buffers are pushing, some pulling. In total, it is highly available and protected from traffic spikes. Still, fairly often communication between different buffers can be degraded or completely lost. We will not lose events, persistent buffers take care of that. The correctness of the metric calculation, however, will suffer because of the delays. Batch system, on the other hand, provides greater correctness, but we have to artificially delay metric calculation to account for potential delays in the upstream pipe.

We tell our internal customers that they can have “close enough” numbers almost immediately and exact numbers within hours.

We use Spark for both batch and streaming. Same code base. The streaming part runs on micro-batch schedule – i.e. seconds or minutes, the Batch part runs on macro-batch schedule – hours, days. Correspondingly, streaming code calculates short duration metrics like 1 minute totals, and batch code calculates longer duration metrics like 1 hour or 1 day totals, both batches consume the same raw events.

The chained export process adds the latest available calculations into SQL data store, regardless of which part has produced the output.

Imagine some metric, say “number of clicks per page”. Micro-batch will pull new events from Kafka every minute and push totals to SQL data store. User might query last hour totals – an aggregate of the last 60 minutes.  At the same time, one or two nodes has been taken off-line due to some network issue. Some thousands of events might be still buffered (persisted) at the local node storage. Then after 30 minutes, nodes are back on-line and transmitting whatever events left in the buffer. However, at the time when the user saw the “last hour” total, late coming events were not available. Batch, which is scheduled to run on prior 4 hours of data, will include delayed records and will re-calculate hourly totals correctly. Another batch, which is scheduled to run daily, will make sure that whatever might be missed by 4-hour batch is added etc.

Interruptions or pauses in streaming calculations

Another common issue with near real time calculations is availability of the calculating process itself. What happens when our micro-batch goes down for whatever reason? (Could be some systemic Hadoop failure, bug, or planned downtime).

Spark Streaming micro-batch is a long-running Yarn process. This means it will allocate a certain amount of resources and will keep those resources forever (or until it is killed). Keeping this in mind, the amount of resources allocated (number and size of the containers as scheduled by Yarn), must be fine-tuned to fit the volume of the micro-batch pretty closely. Ideally, the duration of the single run should be very close to the micro-batch iteration time to minimize idle time between iterations.

In our case, we usually fine tune the number and size of Spark executors by determining our peak volume and running micro-batch experimentally for a number of days. Normally we would also oversubscribe resources to allow for some fluctuations in traffic. Let’s say we’ve decided to run our micro-batch every 1 minute. Yarn queue, in which our streaming job will run, will have max size to fit up to 2-3 min worth of traffic in one run, just in case of a traffic spike or quick job restart. What it means is our micro-batch will pull at most 3 minutes’ worth of traffic from Kafka in a single iteration and it will spend no more than 1 minute processing it.

What happens when micro-batch is down for an hour for cluster maintenance? When micro-batch starts after long pause and tries to pull all the events since last iteration, there will be 20 times more data to process than we have resources allocated in the queue. (Total delay is 60 min / max expected for single iteration 3 minute = 20 times). If we try to process all the backlog using our small queue it will run forever and create even more backlog.

Our approach for streaming calculations is to skip forward to the last 3 min after any interruption. This will create gaps in 1 minute totals. Any rolled up totals in this case will be off.

For this scenario as well, macro-batches, totaling raw events independently, will create correct hourly/daily/weekly totals despite any gap in 1 minute totals.


Bottom line, combining streaming micro-batch calculations with traditional macro-batch calculations will cover all user needs from rough numbers for trending during the day to exact numbers for a weekly report.

What’s next?

There are still two reactive dominoes to tumble before we achieve the full reactive nirvana across our stack.

The most obvious step is to relocate user facing dashboards to a nimbler analytical platform. We had already been using Druid for a while and developed a Spark – Druid connector. Now we can push calculation results to users at whatever velocity we need.

Familiarity with FP and Scala and understanding of the Akka role in the success of Spark and Kafka, helped us to re-evaluate our event collector architecture as well. It is written using a traditional Java Servlets container and uses lots of complicated multithreading code to consume multiple streams of events asynchronously.

Reactive approach (specifically reactive Akka Streams) and Scala is exactly what we need to drastically simplify our Java code base and to wring every last ounce of scalability from our hardware.

Thanks to our friends at the PayPal framework team, Reactive programming is now becoming mainstream at PayPal: SQUBS a new reactive way for PayPal to build applications


Is my claim that Reactive Programming is changing the world too dramatic? The cost of scalability is a key factor to consider. Frameworks like Akka are democratizing one of the most complex, hard to test, and debug domains of software development. With a solid foundation, Sparks and Kafkas of the world are evolving much faster than the prior generation of similar tools. This means more time is being spent on business critical features than on fixing inevitable (in the past) concurrency bugs.

Consider our case.

In the past, real-time analytic systems (like CEP – complex event processing) were an extreme luxury, literally costing millions of dollars in license fees just to set it up. Companies would invest in those systems for very important core business reasons only. For PayPal or Credit Card companies it would be fraud detection. Product roll-outs or marketing campaigns however, would very rarely reach the threshold of such importance.
Today we get the streaming platform practically for free as a package deal.

So yes! Reactive programming is changing the world.

Carrier Payments Big Data Pipeline using Apache Storm


Carrier payments is a frictionless payment method enabling users to place charges for digital goods directly on their monthly mobile phone bill. There is no account needed, just the phone number. Payment authorization happens by verification of a four digit PIN sent via SMS to a user’s mobile phone. After the successful payment transaction, charges will appears on user’s monthly mobile phone bill.

Historically fraud has been handled on the mobile carrier side through various types of spending caps (daily, weekly, monthly, etc.). While these spending caps were able to keep fraud at bay in the early years, as this form of payment has matured, so have the fraud attempts. Over the last year we saw an increase in fraudulent activity based around social engineering. Through social engineering fraudsters are able to trick victims into revealing their PINs.

To tackle the increasing fraud activity we decided to build carrier payments risk platform using an hadoop eco system. First step in building the risk platform is to create big data pipeline that will collect data from various sources, clean or filter the data and structure it for data repositories. The accuracy of our decision-making directly depends on variety, quantity and speed of data collection.

Data comes from various sources. Payment application generates transaction data. Refund job generates data about refunds processed. Customer service department provides customer complaints regarding wrong charge or failed payments. Browser fingerprint can help establish another identity of transacting user. All these sources generate data in different formats and at varying frequencies. Putting together a data pipeline for this is challenging.

We wanted to build a data pipeline that processes data reliably and has the ability to replay in case of failure. It should be easy to add or remove new data sources or sinks into pipeline. Data pipeline should be flexible to allow schema changes. We should be able to scale the data pipeline.

System Architecture

First challenge was to provide a uniform and flexible entry point to data pipeline. We built a micro service with single REST end point that accept data in JSON format. This makes it easy for any system to send data to pipeline. The schema consists of property “ApplicationRef” and JSON object “data”. Sending data as JSON object allows easy schema changes.

Micro service writes data into kafka which also acts as buffer for data. There is separate kafka topic for any source that sends data to pipeline. Assigning separate kafka topic brings logical separation to data, makes the kakfa consumer logic simpler and allow us to provide more resources to topics that handle large volume of data.

We set up storm cluster to process data after reading it from kafka. To do real time computation on storm, you create “topologies”. Topology is a directed acyclic graph or DAG of computation. Topology contains logic for reading data from kafka, transforming it and writing to hive. Typically each topology consists of two nodes of computation – kafkaspout and hivebolt. Some of the topologies contain intermediate node of computation for transforming complex data. We created separate topologies to process data coming from various sources. At the moment we have 4 topologies – purchase topology, customer service topology, browser topology and refund topology.

Storm allows deploying and configuring topologies independently. This makes it possible for us to add more data sources in future or change existing ones without much hassle. This also allows us to allocate resources to topologies depending on data volume and processing complexities.


Storm cluster set up and parallelism

We set up high availability storm cluster. There are two nimbus nodes with one of them acting as leader node. We have two supervisor nodes each having 2 core Intel® Xeon® CPU ES-2670 0 @ 2.60GHz.

Each topology is configured with 2-number workers, 4-number executors and 4-number tasks. The number of executors is the number of threads spawned by each worker (JVM process). Together these settings define storm’s parallelism. Changing any of the above settings can easily change storm’s parallelism.


Storm message guarantee set up

Storm’s basic abstraction provides an at-least-once processing guarantee. Messages are replayed only in case of failure. This message semantics are ideal for our use case. The only problem was, we didn’t wish to keep trying failed message indeterminately. We added the following settings to our spoutconfig to stop trying failed message if there are at least five successful messages processed after a failed one. We also increased elapsed time between retrials of failed message.

spoutConfig.maxOffsetBehind = 5
spoutConfig.retryInitialDelayMs = 1000
spoutConfig.retryDelayMultiplier = 60.0
spoutConfig.retryDelayMaxMs = 24 * 60 * 60 * 1000


Our data pipeline is currently processing payment transaction, browser, refund and customer service data at >100 tuples/min rate. The biggest convenience factor with this architecture was staggered release of topologies. It was very easy to add new topology in storm without disturbing already running topologies. Time taken to release new topology is less than a day. Architecture is proved to be very reliable and we haven’t experience any issue or loss of data so far.





From Big Data to Fast Data in Four Weeks or How Reactive Programming is Changing the World – Part 1


Part 1: Reactive Manifesto’s Invisible Hand

Let me first setup the context for my story. I’ve been with PayPal for 5-years. I’m an architect. I’m part of the team responsible for PayPal Tracking domain. Tracking is commonly and historically understood as the measurement of customer visits to web pages. With the customer’s permission our platform collects all kinds of signals from PayPal web pages, mobile apps and services, for variety of reasons. Most prominent among them are measuring new product adoptions, A/B testing, and fraud analysis.

We collect several terabytes of data on our Hadoop systems every day. This is not an extremely huge amount, but it is big enough to be one of the primary motivations for adopting Hadoop at PayPal four years ago.


This particular story starts in the middle of December 2015, a week or so before holidays. I was pulled into a meeting to discuss a couple of new projects.

At the time, PayPal was feverishly preparing to release a brand spanking new mobile wallet app. It had been in development for a long time and promised a complete change in PayPal experience on smartphones. Lots of attention was being paid to this roll out from all levels of leadership – basically, a Big Deal!

Measuring new product adoption happens to be one of the main “raison d’etre” of the tracking platform, so following roll outs is nothing new for us. This time, however, we were ready to provide adoption metrics in near real time. By the launch date in the end of February, Product managers were able to report the new app adoption progress “as of 30 min ago” instead of the customary “as of yesterday”.

This was our first full production Fast Data application. You might be wondering why the title says four weeks, though from mid-December to the end of February it is ten weeks.

It did in fact take a small team – Sudhir R., Loga V., Shunmugavel C., Krishnaprasad K. – only four weeks to make it happen.

The first full volume live test happened during the Super Bowl, when PayPal aired its ad during the game. And small volume tests happened even earlier in January when the ad was first showcased in some news channels.

Just one year back there would have been no chance we could deliver on something like that in four weeks’ time, but in December 2015 we were on a roll. We had a production quality prototype running within two weeks and a full volume live test happened two weeks later. In another six weeks, PayPal’s largest release of the last several years was measured by this analytical workflow.

So what changed during a single year?
Allow me to get really nerdy for a second. On a fundamental level, what happened is that we embraced (at first without even realizing it) the Reactive Manifesto in two key components of our platform.

What has changed between December 2014 and December 2015

The original Tracking platform design followed common industry standard from 4-5 years ago. We collected data in real-time, buffered it in memory and small files, and finally uploaded consolidated files onto Hadoop cluster. At this point a variety of MapReduce jobs would crunch the data and publish the “analyst friendly” version to Teradata repository, from which our end users could query the data using SQL.

Sounds straightforward? It is anything but. There are several pitfalls and bottlenecks to consider.
First of all is data movement. We are a large company, with lots of services and lots of different types of processing distributed across multiple datacenters. On top of all of this we are also paranoid about security of anything and everything.

We ended up keeping a dedicated Hadoop cluster in a live environment for the sole purpose of collecting data. Then we would copy files to our main analytical Hadoop cluster in a different data center.


It worked, but … . All this waiting for big enough files and scheduling of file transfers added up to hours. We have tinkered with different tools, open source and commercial, looking for the goldilocks solution to address all the constraints and challenges. At the end of the day most of those tools failed to account for the reality on the ground.

Until, that is, we discovered Kafka. That was the first reactive domino to fall.

Kafka and moving data across data centers

There are three brilliant design decisions which made Kafka the indispensable tool it is today:

  • multi-datacenter topology out of the box;
  • short-circuiting internal data movement from network socket to OS file system cache and back to network socket;
  • sticking with strictly sequential I/O reads and writes

It fits our needs almost perfectly:

  • Throughput? Moving data between datacenters? Check.
  • Can configure topology to satisfy our complicated environment? Flexible persisted buffer between real-time stream and Hadoop? Check.
  • Price-performance? High availability? Check.


Challenges? Few.
The only real trouble we’ve had with Kafka itself is the level of maturity of its producer API and MirrorMaker. There are lots of knobs and dials to figure out. It changes from release to release. Some producer API behaviors are finicky. We had to invest some time into making sure we are not losing data between our client and Kafka broker when the broker goes down or connection is lost.

MirrorMaker requires some careful configuration to make sure we are not double compressing events. It took us some experimentation in a live environment to get to a proper capacity for the MirrorMaker instances. Even with all the hiccups, considering we were using pre-GA release open-source tool, it was a pretty rewarding and successful endeavor.

Most of our troubles came not from Kafka. Our network infrastructure and especially firewall capacities were not ready for the awesome throughput Kafka has unleashed. We were able to move very high volume of data at sub second latency between geographically removed data centers, but every so many hours the whole setup would slow down to a crawl.

It took us some time to understand that it was not a Kafka or MirrorMaker problem (as we naturally suspected at first). As it happened we were the first team to attempt that kind of volume of streaming traffic between our data centers. Prevailing traffic patterns at the time were all bulk oriented: file transfer, DB import/export etc.. Sustained real-time traffic has immediately uncovered bottlenecks and temporary saturations, which were hidden before. Good learning experience for us and our Network engineers.

Personally I was very impressed with Kafka capabilities and naturally wanted to know how it was written. That led me to Scala. Long story short, I had to rediscover functional programming and with it a totally different way of thinking about concurrency. That led me to Akka and Actors model – Hello Reactive Programming!

From MapReduce to Spark

Another major challenge that plagued us at the time was MapReduce inefficiency, both in physical sense and in a developer’s productivity sense. There are, literally, books written on MapReduce optimization. We know why it is inefficient and how to deal with it – in theory.

In practice however, developers tend to use MapReduce in the simplest way possible, splitting workflows into multiple digestible steps, saving intermediate results to HDFS, and incurring horrifying amount of I/O in the process.


MapReduce is just too low level and too dumb. Mixing complex business logic with MapReduce low level optimization techniques is asking too much. In architectural lingo: the buildability of such system is very problematic. Or in English: it takes way too long to code and test. Predictable results are tons of long running, brittle, hard to manage, and inefficient workflows.

Plus it is Java! Who in their right mind would use Java for analytics? Analytics is a synonym for SQL, right? We ended up writing convoluted pipes in MapReduce to cleanup and reorganize data and then dumped it to Teradata so somebody else can do agile analytical work using SQL, while we were super busy figuring out MapReduce. Pig and Hive helped to a degree, but only just.

We have tried to explore alternatives like “SQL on Hadoop” etc. Lots of cool tools, lots of vendors. The Hadoop ecosystem is rich, but hard to navigate. None has addressed our needs comprehensively. Goldilocks solution for BigData was out of reach it seemed.

Until, that is, we stumbled into Spark. And that was the second reactive domino to fall.

By now, it has become clear to us that using MapReduce is a dead-end. But Spark? Switching to Scala? What about the existing, hard earned skill set? People just started to get comfortable with MapReduce and Pig and Hive, it is very risky to switch. All valid arguments. Nevertheless, we jumped.

What makes Spark such a compelling system for the big data?
RDD (Resilient Distributed Dataset) concept has addressed most of the MapReduce inefficiency issues, both physical and productivity.

Resilient – resiliency is a key. The main reason behind MapReduce excessive I/O disorder is the need to preserve intermediate data to protect against node crashes. Spark solved the same with minimum I/O by keeping a history of transformations (functions) instead of actual transformed data chunks. Intermediate data can still be check pointed to HDFS if the process is very long and the developer can decide if a checkpoint is needed case by case. Once a dataset is read into cache, it can stay there as long as needed or as long as the cache size allows it. And again, the developer can control the behavior case by case.

Distributed Dataset- One thing that always bugged me in MapReduce is its inability to reason about my data as a dataset. Instead you are forced to think in single key-value pair, small chunk, block, split, or file. Coming from SQL, it felt like going backwards 20 years. Spark has solved this perfectly. A developer can now reason about his data as a Scala immutable collection (RDD API) or data frame (DataFrame API) or both interchangeably (DataSet API). This flexibility and abstraction from the distributed nature of the data leads to a huge productivity gains for developers.

And finally, no more Map + Reduce shackles. You can chain whatever functions you need, Spark will optimize your logic into DAG (dynamic acyclic graph – similar to how SQL is interpreted) and run it in one swoop keeping data in memory as much as physically possible.


“Minor” detail –  nobody knew Scala here. So to get all the efficiency and productivity benefits we had to train people in Scala and do it quickly.

It is a bit more than just a language though. Scala, at the first glance, has enough of a similarity to Java that any Java coder can fake it. (Plus, however cumbersome, Java can be used as Spark language and so can Python).

The real problem was to change the mindset of the Java MapReduce developer to a functional way of thinking. We needed a clean break from old Java/Python MR habits. What habits? Most of all mutability: Java beans, mutable collections and variables, loops etc. We were going to switch to Scala and do it right.

It has started slow. We have designed a curriculum for our Java MapReduce developers. We started with couple of two-hour sessions, one for Functional programming/Scala basics and one for Spark basics. After some trial and error the FP/Scala part was extended to a four hour learning by example session, while Spark barely needed two hours if at all. As it happens, once you understand the functional side of Scala, especially immutable collections, collection functions, and lambda expressions, Spark comes naturally.

It snowballed very quickly, within a couple of months we had the majority of our MapReduce engineers re-trained and sworn off MapReduce forever. Spark was in.

Spark, naturally, is written in Scala and Akka – 2:0 for team Reactive.

Stay tuned for Part 2: “Lambda Architecture meets reality” coming soon

Part 2 Preview

  • Fast Data stack
  • Micro and Macro batches
  • Invisible hand keeps on waving