It is no secret that the volume, velocity and variety of data is exploding. As a result, the case to move workloads to the Hadoop big data ecosystem is becoming easier by the day.
With Hadoop and the big data world come the additional complexities of managing the semantics of reading from and writing to various data stores. SQL has not been the lingua franca of the analytics world so far. In this world, you need to deal with a lot of choices – compute engines like Map Reduce, Spark, Flink, Hive, etc. and data stores like HDFS/Hive, Kafka, Elasticsearch, HBase, Aerospike, Cassandra, Couchbase, etc. The code required in order to read from and write to these data stores from different compute engines is fragile and cumbersome and differs significantly between data stores.
In the big data world, it is also extremely difficult to catalog datasets across various data stores in a way that makes it easy to add to the catalog or to search and find datasets.
Besides being unable to find the correct datasets, developers, analysts and data scientists don’t want to deal with the code to read and write data, and obviously want to focus their attention and time towards dataset development, analysis and exploration.
We built the Gimel Data Platform – the centralized metadata catalog PCatalog and the unified Data API – to address these challenges and help commoditize data access in the big data world.
What is the Gimel Data Platform
At the core, the Gimel Data Platform consists of two components:
PCatalog: A centralized metadata catalog powered by Gimel Discovery Services, which register datasets found across all the data stores in the company and makes them searchable through a PCatalog portal as well as programmatically through RESTful APIs.
Data API: A unified API to read from and write to data stores, including data streaming systems like Kafka. In addition to providing the API, the Gimel platform also enables SQL access to the API so developers, analysts and data scientists can now use Gimel SQL to access data stored on any data store. The Data API is built to work in batch mode or in streaming mode.
The Gimel Data Platform is tightly integrated with the Gimel Compute Platform, which offers a framework and a set of APIs to access the compute infrastructure — the Gimel Platform is accessed through the Gimel SDK and via Jupyter notebooks.
Additionally, the Gimel ecosystem, as seen from the image above, handles logging, monitoring, and alerting automatically so that there is a robust audit trail maintained for the user as well as for the operator.
By making SQL a first-class citizen, the Gimel Data Platform significantly reduces the time it takes to prototype and build analytic and data pipelines.
User Experience with Gimel
Regardless of the type of user or use case, the experience is simplified into at most 3 steps:
- Find your datasets
- Access your datasets and analyze your data (or alternatively, access your datasets to build your prototype application)
- Schedule your analyses (or alternatively, productionalize your application)
Find your datasets: Log into PCatalog portal and search for your datasets
Once you search for and find your dataset, you can take the logical name of the dataset for the next step.
Access your datasets: Navigate to Jupyter notebooks & analyze data
Use the logical name of the dataset you found in the earlier step in your analysis.
Schedule and productionalize
In a nutshell, PCatalog and its services provide the required dataset name, which is a logical pointer to the physical object. The Gimel library, which requires only the name of the dataset, abstracts the logic to access any supported data store while providing the ability for the user to express logic in SQL.
Let’s take a look into the following pieces:
The PCatalog portal provides multiple views into the entire metadata catalog. These views, some of which are only visible to administrators, give a comprehensive view of all the physical properties of any dataset, including the data store it belongs to.
Storage System Attributes (owned and set by Storage System Owner)
Physical Object Level Attributes (owned and set by Object Owner)
Dataset Readiness for Consumption on various Hadoop Clusters
A peek inside Gimel SQL
As a user, you may be reading from one dataset & writing to another dataset but the object’s (physical) structure and location is not your interest. We will look at two examples of reading from one data store and writing into another.
Reading from Kafka, persisting into HDFS/Hive in Batch mode
The processing logic is all possible via SQL, powered by Gimel:
Reading from Kafka, streaming into HBase
As you can see below, even this streaming access is easily achieved through SQL:
Regardless of the data store or whether it is in batch mode or streaming, you can express all the logic as SQL!
Anatomy of Data API (how does it go from SQL to native code execution)
Catalog provider modes
Under the hood, CatalogProvider provides an interface for the Gimel Library to get the dataset details from PCatalog Services. The type of the dataset is a critical configuration provided by CatalogProvider. A DataSet or DataStream Factory in turn returns a StorageDataSet. (For example, KafkaDataSet or ElasticDataSet.)
Thus, the read/write implementations with in each of these StorageDataSets can perform the core read/write operations.
Gimel is Open Sourced
We are convinced that Gimel Data Platform will speed up time to market, facilitate innovation, and enable a large population of analysts familiar with SQL to step into the big data world easily.
We cannot wait for you to try it!
Deepak Mohanakumar Chandramouli, Romit Mehta, Prabhu Kasinathan, Sid Anand
Deepak has over 13 years of experience in Data Engineering & 5 years of expertise building scalable data solutions in the Big Data space. He worked on building Apache Spark based Foundational Big Data Platform during the incubation of PayPal’s Data lake. He has applied experience in implementing spark based solutions across several types of No-SQL, Key-Value, Messaging, Document based & relational systems. Deepak has been leading the initiative to enable access to any type of storage on Spark via – unified Data API, SQL, tools & services, thus simplifying analytics & large-scale computation-intensive applications. Github: Dee-Pac
Romit has been building data and analytics solutions for a wide variety of companies across networking, semi-conductors, telecom, security and fintech industries for over 19 years. At PayPal he is leading the product management of core big data and analytics platform products which include a compute framework, a data platform and a notebooks platform. As part of this role, Romit is working to simplify application development on big data technologies like Spark and improve analyst and data scientist agility and ease of accessing data spread across a multitude of data stores via friendly technologies like SQL and notebooks. Github: theromit
Prabhu Kasinathan is engineering lead at PayPal, focusing on building highly scalable and distributed Analytics Platform to handle petabyte scale data at PayPal. His team focus on creating frameworks, APIs, productivity tools and services for big data platform to support multi-tenancy and large scale computation-intensive applications. Github: prabhu1984
Sid Anand currently serves as PayPal’s Chief Data Engineer, focusing on ways to realize the value of data. Prior to joining PayPal, he held several positions including Agari’s Data Architect, a Technical Lead in Search @ LinkedIn, Netflix’s Cloud Data Architect, Etsy’s VP of Engineering, and several technical roles at eBay. Sid earned his BS and MS degrees in CS from Cornell University, where he focused on Distributed Systems. In his spare time, he is a maintainer/committer on Apache Airflow, a co-chair for QCon, and a frequent speaker at conferences. Github: r39132