Tag Archives: Engineering

The New API Transactions Dashboard

By and

The new Transactions dashboard, launched recently, is also referred to as “API call history”. It provides histories of the transactions (API calls) made by applications in the sandbox and live environments. It provides details such as the date of the transaction, type of the transaction, status, amount, as well as the details of the API call, such as the request and response messages.

The new dashboard has many features:

  • Displays history of all PayPal REST APIs.
  • Shows API call details like HTTP status code, request, response and headers to help with diagnostics.
  • Provides the ability to browse and find details quickly through pagination and filters.
  • Has 10x better performance.

Live Transactions Dashboard

An example of the live transactions dashboard is shown below

Live Transactions Dashboard

The live dashboard displays the following:

  • HTTP status
  • Resource URI that was invoked
  • Transaction ID
  • Transaction Date

You can browse transactions for all your applications and filter transactions based on the application name. The dashboard also gives you the ability to view the details of a transaction. By clicking a transaction in the table above, you will see a popup like the below:

Live Transactions Metadata

It has the following fields:

  • Metadata – Metadata about the transaction Request
  • HTTP request, including headers Response
  • HTTP response, including headers

Sandbox Transactions Dashboard

An example of the sandbox transactions dashboard is shown below:

Sandbox Transactions Dashboard

The sandbox transaction dashboard is similar to the live transaction dashboard, except that you can also filter the transactions based on the sandbox account. Since a developer can have multiple sandbox accounts associated with multiple sandbox applications, filtering on the basis of either a sandbox account email or an application can help you quickly find the transaction you are looking for.

Miscellaneous Information

The new dashboard can be accessed via the following links:

Rahul PanjrathAuthor: Rahul Panjrath
About the author: I am Software Engineer @Braintree|PayPal since April 2014, part of the team responsible for https://developer.paypal.com. Coding is my passion and I love to break things daily ;). I graduated from San Jose State University and I have been in the software industry coding for almost 9 years and I am still learning new things everyday. That’s the best part which makes me love the work I do. I specialize in web programming, REST APIs development, TDD etc.

I can be reached at:

PayPal’s API Style Guide

By and

Jason HarmonAbout the author: Jason is the former head of the API Design team at PayPal, helping development teams design high quality, usable APIs across the platform. He blogs at apiux.com, and has a Youtube channel API Workshop (https://www.youtube.com/channel/UCKK2ir0jqCvfB-kzBGka_Lg).

Since 2013, PayPal has been developing a new generation of APIs, using REST semantics. While our public API developer community has seen the outward effects of this, internally we’ve been using the same strategy. Since 2013, we’ve defined most of the PayPal platform using REST APIs.

As part of the team guiding this engineering-wide project (we call it PPaaS aka “PayPal as a Service”), our API Design team has had the privilege to work with a huge number of development teams. We consult with development teams on API design to ensure the broadest consistency, sound usability, and a myriad of other concerns.

During the process of collaborating on hundreds of API designs, we’ve developed a detailed set of internal standards. With the size of our team it’s important to provide some level of detail. This provides our developers building APIs clear guidance. However with all the detail we’ve provided, it can get a little tough to get started learning what good API designs look like. In an attempt to capture the basics, and provide an overview of our Standards, we’ve composed our API Style Guide. Rather than keeping this within our internal developer community, we removed any internal or proprietary references, and made it something anybody could use as a set of API design guidelines. A handful of other organizations, such as Heroku and The White House have shared their standards as well.

The hope is that more organizations that are passionate about APIs will share their design guidelines. This can only improve the consistency of the API space. While there are many books on the subject, looking at popular APIs is often the first place new API developers get started. Often this leads to guessing why a design works the way it does. By providing Standards or a Style Guide, new API developers can get a better sense of the rationale behind a functioning design.

We’ve published PayPal’s API Style Guide on Github. Most of the examples provided are based on our REST APIs, which you can find out more about on developer.paypal.com.

Deep Learning on Hadoop 2.0

By

The Data Science team in Boston is working on leveraging cutting edge tools and algorithms to optimize business actions based on insights hidden in user data. Data Science heavily exploits machine algorithms that can help us identify and exploit patterns in the data. Obtaining insights from Internet scale data is a challenging undertaking; hence being able to run the algorithms at scale is a crucial requirement. With the explosion of data and accompanying thousand-machine clusters, we need to adapt the algorithms to be able to operate on such distributed environments. Running machine learning algorithms in general purpose distributed computing environment possesses its own set of challenges.

Here we discuss how we have implemented and deployed Deep Learning, a cutting edge machine-learning framework, in one of the Hadoop clusters. We provide the details on how the algorithm was adapted to run in a distributed setting. We also present results of running the algorithm on a standard dataset.

Deep Belief Networks

Deep Belief Networks (DBN) are graphical models that are obtained by stacking and training Restricted Boltzmann Machines (RBM) in a greedy, unsupervised manner [1]. DBNs are trained to extract deep hierarchical representation of the training data by modeling the joint distribution between observed vectors x and the l hidden layers hk as follows, where distribution for each hidden layer is conditioned on a layer immediately preceding it [4]:

DBN Distribution

Equation 1: DBN Distribution

The relationship between the input layer and the hidden layers can be observed in the figure below. At a high level, the first layer is trained as an RBM that models the raw input x. An input is a sparse binary vector representing the data to be classified, for e.g. a binary image of a digit. Subsequent layers are trained using the transformed data (sample or mean activations) as training examples from the previous layers. Number of layers can be determined empirically to obtain the best model performance, and DBNs support arbitrary many layers.

DBN layers

Figure 1: DBN layers

The following code snippet shows the training that goes into an RBM. For the input data supplied to the RBM, there are a multiple number of predefined epochs. The input data is divided into small batches and the weights, activations, and deltas are computed for each layer:

1

 

 

 

 

 

 

 

 

After all layers are trained, the parameters of the Deep Network are fine-tuned using a supervised training criterion. The supervised training criterion, for instance, can be framed as a classification problem, which then allows using the deep network to solve a classification problem. More complex supervised criterion can be employed which can provide interesting results such as scene interpretation, for instance explaining what objects are present in a picture.

Infrastructure

Deep learning has received large attention not only because of the fact that it can deliver results superior to some of the other learning algorithms, but also because it can be run on a distributed setting, allowing processing of large scale datasets. Deep networks can be parallelized in two major levels – at the layer level, and at the data level [6]. For layer level parallelization, many implementations use GPU arrays to compute layer activations in parallel and frequently synchronize them. However, this approach is not suitable for clusters where data can reside across multiple machines connected by a network, because of high network costs. Data level parallelization, in which the training is parallelized to subsets of data, is more suitable for these settings. Most of the data at Paypal is stored in Hadoop clusters; hence being able to run the algorithms in those clusters is our priority. Dedicated cluster maintenance and support is also an important factor for us to consider. However, since deep learning is inherently iterative in nature, a paradigm such as MapReduce is not well suited for running these algorithms. But with the advent of Hadoop 2.0 and Yarn based resource management, we can write iterative applications as we can finely control the resources the application is using. We adapted IterativeReduce [7], a simple abstraction for writing iterative algorithms in Hadoop YARN, and were able to deploy it in one of the PayPal clusters running Hadoop 2.4.1.

Methodology

We implemented the core deep learning algorithm by Hinton, reference in [2]. Since distributing the algorithm for running on multi-machine cluster is our requirement, we adapted their algorithm for such a setting. For distributing the algorithm across multiple machines, we followed the guidelines proposed by Grazia, et al. [6]. A high level summary of our implementation is given below:

  1. Master node initializes the weights of the RBM
  2. Master node pushes the weights and the splits to the worker nodes.
  3. The worker trains a RBM layer for 1 dataset epoch, i.e. one complete pass through the entire split, and sends back the updated weights to the master node.
  4. The master node averages the weights from all workers for a given epoch.
  5. Steps 3-5 are repeated for a predefined set of epochs (50 in our case).
  6. After step 6 is done, one layer is trained. The steps are repeated for subsequent RBM layers.
  7. After all layers are trained, the deep network is fine-tuned using error back-propagation.

The figure below describes a single dataset epoch (steps 3-5) while running the deep learning algorithm. We note that this paradigm can be leveraged to implement a host of machine learning algorithms that are iterative in nature.

Single dataset epoch for training

Figure 2: Single dataset epoch for training

 

The following code snippet shows the steps involved in training a DBN in a single machine. The dataset is first divided into multiple batches. Then multiple RBM layers are initialized and trained sequentially. After the RBMs are trained, they are passed through a fine-tune phase which uses error back propagation.

 

2

 

 

 

 

 

 

 

 

 

We adapted the IterativeReduce [7] implementation for much of the YARN “plumbing”. We did a major overhaul of the implementation to make it useable for our deep learning implementation. The IterativeReduce implementation was written for Cloudera Hadoop distribution, which we re-platformed to adapt it to standard Apache Hadoop distribution. We also rewrote the implementation to use the standard programming models described in [8]. In particular, we used YarnClient API for communication between the client application and the ResourceManager. We also used the AMRMClient and AMNMClient for interaction between ApplicationMaster and ResourceManager and NodeManager.

We first submit an application to the YARN resource manager using the YarnClient API:

3

 

 

 

 

After the application is submitted, YARN resource manager launches the application master. The application master is responsible for allocating and releasing the worker containers as necessary. The application master uses AMRMClient to communicate with the resource manager.

4

 

 

 

 

The application master uses the NMClient API to run commands in the containers it received from the application master.

5

 

 

 

Once the application master launches the worker containers it requires, it sets up a port to communicate with the workers. For our deep learning implementation, we added the methods required for parameter initialization, layer-by-layer training, and fine-tune signaling to the original IterativeReduce interface. IterativeReduce uses Apache Avro IPC for Master-Worker communication.

The following code snippets shows the series of steps involved in Master-worker nodes for distributed training. The master sends the initial parameters to the workers, and then the worker trains its RBM on its portion of the data. After worker is done training, it sends back the results to master, which then combines the results. After the iterations are completed, master completes the process by starting the back propagation fine-tune phase.

6

 

 

 

 

 

 

 

Results

We evaluated the performance of the deep learning implementation using the MNIST handwritten digit recognition [3]. The dataset contains manually labeled hand written digits ranging from 0-9. The training set consists of 60,000 images and the test set consists of 10,000 images.

In order to measure the performance, the DBN was first pre-trained and then fine-tuned on the 60,000 training images. After the above steps, the DBN was then evaluated on the 10,000 test images. No pre-processing was done on the images during training or evaluation. The error rate was obtained as a ratio between total number of misclassified images and the total number of images on the test set.

We were able to achieve the best classification error rate of 1.66% when using RBM with 500-500-2000 hidden units in each RBM, and using a 10-node distributed cluster setting. The error rate is comparable with the error rate of 1.2% reported by authors of the original algorithm (with 500-500-2000 hidden units) [2], and with some of the results with similar settings reported in [3]. We note that original implementation was on a single machine, and our implementation is on a distributed setting. The parameter-averaging step contributes to slight reduction in performance, although the benefit of distributing the algorithm over multiple machines far outweighs the reduction. The table below summarizes the error rate variation per the number of hidden units in each layer while running on a 10-node cluster.

MNIST performance evaluation

Table 1: MNIST performance evaluation

Further thoughts

We had a successful deployment of a distributed deep learning system, which we believe will prove useful in solving some of the machine learning problems. Furthermore, the iterative reduce abstraction can be leveraged to distribute any other suitable machine learning algorithm. Being able to utilize the general-purpose Hadoop cluster will prove highly beneficial for running scalable machine-learning algorithms on large datasets. We note that there are several improvements we would like to make to the current framework, chiefly around reducing the network latency and having more advanced resource management. Additionally we’d like to optimize the DBN framework so that we can minimize inter-node communication. With fine-grained control of cluster resources, Hadoop Yarn framework provides us the flexibility to do so.

References

[1] G. E. Hinton, S. Osindero, and Y. Teh. A fast learning algorithm for deep belief nets. Neural Computations, 18(7):1527–1554, 2006.

[2] G. E. Hinton and R. R. Salakhutdinov. Reducing the Dimensionality of Data with Neural Networks. Science, 313(5786):504–507, 2006.

[3] Y. LeCun, C. Cortes, C. J.C. Burges. The MNIST database of handwritten digits.

[4] Deep Learning Tutorial. LISA lab, University of Montreal

[5] G. E. Hinton. A Practical Guide to Training Restricted Boltzmann Machines. Lecture Notes in Computer Science Volume 7700: 599-619, 2012.

[6] M. Grazia, I. Stoianov, M. Zorzi. Parallelization of Deep Networks. ESANN, 2012

[7] IterativeReduce, https://github.com/jpatanooga/KnittingBoar/wiki/IterativeReduce

[8] Apache Hadoop YARN – Enabling Next Generation Data Applications, http://www.slideshare.net/hortonworks/apache-hadoop-yarn-enabling-nex

10 Myths of Enterprise Python

By

2016 Update: Whether you enjoy myth busting, Python, or just all enterprise software, you will also likely enjoy Enterprise Software with Python, presented by the author of the article below, and published by O’Reilly.

PayPal enjoys a remarkable amount of linguistic pluralism in its programming culture. In addition to the long-standing popularity of C++ and Java, an increasing number of teams are choosing JavaScript and Scala, and Braintree‘s acquisition has introduced a sophisticated Ruby community.

One language in particular has both a long history at eBay and PayPal and a growing mindshare among developers: Python.

Python has enjoyed many years of grassroots usage and support from developers across eBay. Even before official support from management, technologists of all walks went the extra mile to reap the rewards of developing in Python. I joined PayPal a few years ago, and chose Python to work on internal applications, but I’ve personally found production PayPal Python code from nearly 15 years ago.

Today, Python powers over 50 projects, including:

  • Features and products, like RedLaser
  • Operations and infrastructure, both OpenStack and proprietary
  • Mid-tier services and applications, like the one used to set PayPal’s prices and check customer feature eligibility
  • Monitoring agents and interfaces, used for several deployment and security use cases
  • Batch jobs for data import, price adjustment, and more
  • And far too many developer tools to count

In the coming series of posts I’ll detail the initiatives and technologies that led the eBay/PayPal Python community to grow from just under 25 engineers in 2011 to over 260 in 2014. For this introductory post, I’ll be focusing on the 10 myths I’ve had to debunk the most in eBay and PayPal’s enterprise environments.

Myth #1: Python is a new language

What with all the start-ups using it and kids learning it these days, it’s easy to see how this myth still persists. Python is actually over 23 years old, originally released in 1991, 5-years before HTTP 1.0 and 4-years before Java. A now-famous early usage of Python was in 1996: Google’s first successful web crawler.

If you’re curious about the long history of Python, Guido van Rossum, Python’s creator, has taken the care to tell the whole story.

Myth #2: Python is not compiled

While not requiring a separate compiler toolchain like C++, Python is in fact compiled to bytecode, much like Java and many other compiled languages. Further compilation steps, if any, are at the discretion of the runtime, be it CPython, PyPy, Jython/JVM, IronPython/CLR, or some other process virtual machine. See Myth #6 for more info.

The general principle at PayPal and elsewhere is that the compilation status of code should not be relied on for security. It is much more important to secure the runtime environment, as virtually every language has a decompiler, or can be intercepted to dump protected state. See the next myth for even more Python security implications.

Myth #3: Python is not secure

Python’s affinity for the lightweight may not make it seem formidable, but the intuition here can be misleading. One central tenet of security is to present as small a target as possible. Big systems are anti-secure, as they tend to overly centralize behaviors, as well as undercut developer comprehension. Python keeps these demons at bay by encouraging simplicity. Furthermore, CPython addresses these issues by being a simple, stable, and easily-auditable virtual machine. In fact, a recent analysis by Coverity Software resulted in CPython receiving their highest quality rating.

Python also features an extensive array of open-source, industry-standard security libraries. At PayPal, where we take security and trust very seriously, we find that a combination of hashlib, PyCrypto, and OpenSSL, via PyOpenSSL and our own custom bindings, cover all of PayPal’s diverse security and performance needs.

For these reasons and more, Python has seen some of its fastest adoption at PayPal (and eBay) within the application security group. Here are just a few security-based applications utilizing Python for PayPal’s security-first environment:

  • Creating security agents for facilitating key rotation and consolidating cryptographic implementations
  • Integrating with industry-leading HSM technologies
  • Constructing TLS-secured wrapper proxies for less-compliant stacks
  • Generating keys and certificates for our internal mutual-authentication schemes
  • Developing active vulnerability scanners

Plus, myriad Python-built operations-oriented systems with security implications, such as firewall and connection management. In the future we’ll definitely try to put together a deep dive on PayPal Python security particulars.

Myth #4: Python is a scripting language

Python can indeed be used for scripting, and is one of the forerunners of the domain due to its simple syntax, cross-platform support, and ubiquity among Linux, Macs, and other Unix machines.

In fact, Python may be one of the most flexible technologies among general-use programming languages. To list just a few:

  1. Telephony infrastructure (Twilio)
  2. Payments systems (PayPal, Balanced Payments)
  3. Neuroscience and psychology (many, many, examples)
  4. Numerical analysis and engineering (numpy, numba, and many more)
  5. Animation (LucasArts, Disney, Dreamworks)
  6. Gaming backends (Eve Online, Second Life, Battlefield, and so many others)
  7. Email infrastructure (Mailman, Mailgun)
  8. Media storage and processing (YouTube, Instagram, Dropbox)
  9. Operations and systems management (Rackspace, OpenStack)
  10. Natural language processing (NLTK)
  11. Machine learning and computer vision (scikit-learn, Orange, SimpleCV)
  12. Security and penetration testing (so many and eBay/PayPal
  13. Big Data (Disco, Hadoop support)
  14. Calendaring (Calendar Server, which powers Apple iCal)
  15. Search systems (ITA, Ultraseek, and Google)
  16. Internet infrastructure (DNS) (BIND 10)

Not to mention web sites and web services aplenty. In fact, PayPal engineers seem to have a penchant for going on to start Python-based web properties, YouTube and Yelp, for instance. For an even bigger list of Python success stories, check out the official list.

Myth #5: Python is weakly-typed

Python’s type system is characterized by strong, dynamic typing. Wikipedia can explain more.

Not that it is a competition, but as a fun fact, Python is more strongly-typed than Java. Java has a split type system for primitives and objects, with null lying in a sort of gray area. On the other hand, modern Python has a unified strong type system, where the type of None is well-specified. Furthermore, the JVM itself is also dynamically-typed, as it traces its roots back to an implementation of a Smalltalk VM acquired by Sun.

Python’s type system is very nice, but for enterprise use there are much bigger concerns at hand.

Myth #6: Python is slow

First, a critical distinction: Python is a programming language, not a runtime. There are several Python implementations:

  1. CPython is the reference implementation, and also the most widely distributed and used.
  2. Jython is a mature implementation of Python for usage with the JVM.
  3. IronPython is Microsoft’s Python for the Common Language Runtime, aka .NET.
  4. PyPy is an up-and-coming implementation of Python, with advanced features such as JIT compilation, incremental garbage collection, and more.

Each runtime has its own performance characteristics, and none of them are slow per se. The more important point here is that it is a mistake to assign performance assessments to a programming languages. Always assess an application runtime, most preferably against a particular use case.

Having cleared that up, here is a small selection of cases where Python has offered significant performance advantages:

  1. Using NumPy as an interface to Intel’s MKL SIMD
  2. PyPy‘s JIT compilation achieves faster-than-C performance
  3. Disqus scales from 250 to 500 million users on the same 100 boxes

Admittedly these are not the newest examples, just my favorites. It would be easy to get side-tracked into the wide world of high-performance Python and the unique offerings of runtimes. Instead of addressing individual special cases, attention should be drawn to the generalizable impact of developer productivity on end-product performance, especially in an enterprise setting.

A comparison of C++ to Python,. The Python is approximately a tenth the size.

C++ vs Python,. Two languages, one output, as apples to apples as it gets.

Given enough time, a disciplined developer can execute the only proven approach to achieving accurate and performant software:

  1. Engineer for correct behavior, including the development of respective tests
  2. Profile and measure performance, identifying bottlenecks
  3. Optimize, paying proper respect to the test suite and Amdahl’s Law, and taking advantage of Python’s strong roots in C.

It might sound simple, but even for seasoned engineers, this can be a very time-consuming process. Python was designed from the ground up with developer timelines in mind. In our experience, it’s not uncommon for Python projects to undergo three or more iterations in the time it C++ and Java to do just one. Today, PayPal and eBay have seen multiple success stories wherein Python projects outperformed their C++ and Java counterparts, with less code (see right), all thanks to fast development times enabling careful tailoring and optimization. You know, the fun stuff.

Myth #7: Python does not scale

Scale has many definitions, but by any definition, YouTube is a web site at scale. More than 1 billion unique visitors per month, over 100 hours of uploaded video per minute, and going on 20 pecent of peak Internet bandwidth, all with Python as a core technology. Dropbox, Disqus, Eventbrite, Reddit, Twilio, Instagram, Yelp, EVE Online, Second Life, and, yes, eBay and PayPal all have Python scaling stories that prove scale is more than just possible: it’s a pattern.

The key to success is simplicity and consistency. CPython, the primary Python virtual machine, maximizes these characteristics, which in turn makes for a very predictable runtime. One would be hard pressed to find Python programmers concerned about garbage collection pauses or application startup time. With strong platform and networking support, Python naturally lends itself to smart horizontal scalability, as manifested in systems like BitTorrent.

Additionally, scaling is all about measurement and iteration. Python is built with profiling and optimization in mind. See Myth #6 for more details on how to vertically scale Python.

Myth #8: Python lacks good concurrency support

Occasionally debunking performance and scaling myths, and someone tries to get technical, “Python lacks concurrency,” or, “What about the GIL?” If dozens of counterexamples are insufficient to bolster one’s confidence in Python’s ability to scale vertically and horizontally, then an extended explanation of a CPython implementation detail probably won’t help, so I’ll keep it brief.

Python has great concurrency primitives, including generators, greenlets, Deferreds, and futures. Python has great concurrency frameworks, including eventlet, gevent, and Twisted. Python has had some amazing work put into customizing runtimes for concurrency, including Stackless and PyPy. All of these and more show that there is no shortage of engineers effectively and unapologetically using Python for concurrent programming. Also, all of these are officially support and/or used in enterprise-level production environments. For examples, refer to Myth #7.

The Global Interpreter Lock, or GIL, is a performance optimization for most use cases of Python, and a development ease optimization for virtually all CPython code. The GIL makes it much easier to use OS threads or green threads (greenlets usually), and does not affect using multiple processes. For more information, see this great Q&A on the topic and this overview from the Python docs.

Here at PayPal, a typical service deployment entails multiple machines, with multiple processes, multiple threads, and a very large number of greenlets, amounting to a very robust and scalable concurrent environment (see figure below). In most enterprise environments, parties tends to prefer a fairly high degree of overprovisioning, for general prudence and disaster recovery. Nevertheless, in some cases Python services still see millions of requests per machine per day, handled with ease.

Sketch of a PayPal Python server worker

A rough sketch of a single worker within our coroutine-based asynchronous architecture. The outermost box is the process, the next level is threads, and within those threads are green threads. The OS handles preemption between threads, whereas I/O coroutines are cooperative.

Myth #9: Python programmers are scarce

There is some truth to this myth. There are not as many Python web developers as PHP or Java web developers. This is probably mostly due to a combined interaction of industry demand and education, though trends in education suggest that this may change.

That said, Python developers are far from scarce. There are millions worldwide, as evidenced by the dozens of Python conferences, tens of thousands of StackOverflow questions, and companies like YouTube, Bank of America, and LucasArts/Dreamworks employing Python developers by the hundreds and thousands. At eBay and PayPal we have hundreds of developers who use Python on a regular basis, so what’s the trick?

Well, why scavenge when one can create? Python is exceptionally easy to learn, and is a first programming language for children, university students, and professionals alike. At eBay, it only takes one week to show real results for a new Python programmer, and they often really start to shine as quickly as 2-3 months, all made possible by the Internet’s rich cache of interactive tutorials, books, documentation, and open-source codebases.

Another important factor to consider is that projects using Python simply do not require as many developers as other projects. As mentioned in Myth #6 and Myth #9, lean, effective teams like Instagram are a common trope in Python projects, and this has certainly been our experience at eBay and PayPal.

Myth #10: Python is not for big projects

Myth #7 discussed running Python projects at scale, but what about developing Python projects at scale? As mentioned in Myth #9, most Python projects tend not to be people-hungry. while Instagram reached hundreds of millions of hits a day at the time of their billion dollar acquisition, the whole company was still only a group of a dozen or so people. Dropbox in 2011 only had 70 engineers, and other teams were similarly lean. So, can Python scale to large teams?

Bank of America actually has over 5,000 Python developers, with over 10 million lines of Python in one project alone. JP Morgan underwent a similar transformation. YouTube also has engineers in the thousands and lines of code in the millions. Big products and big teams use Python every day, and while it has excellent modularity and packaging characteristics, beyond a certain point much of the general development scaling advice stays the same. Tooling, strong conventions, and code review are what make big projects a manageable reality.

Luckily, Python starts with a good baseline on those fronts as well. We use PyFlakes and other tools to perform static analysis of Python code before it gets checked in, as well as adhering to PEP8, Python’s language-wide base style guide.

Finally, it should be noted that, in addition to the scheduling speedups mentioned in Myth #6 and #7, projects using Python generally require fewer developers, as well. Our most common success story starts with a Java or C++ project slated to take a team of 3-5 developers somewhere between 2-6 months, and ends with a single motivated developer completing the project in 2-6 weeks (or hours, for that matter).

A miracle for some, but a fact of modern development, and often a necessity for a competitive business.

A clean slate

Mythology can be a fun pastime. Discussions around these myths remain some of the most active and educational, both internally and externally, because implied in every myth is a recognition of Python’s strengths. Also, remember that the appearance of these seemingly tedious and troublesome concerns is a sign of steadily growing interest, and with steady influx of interested parties comes the constant job of education. Here’s hoping that this post manages to extinguish a flame war and enable a project or two to talk about the real work that can be achieved with Python.

Keep an eye out for future posts where I’ll dive deeper into the details touched on in this overview. If you absolutely must have details before then, or have corrections or comments, shoot me an email at mahmoud@paypal.com. Until then, happy coding!

Building Data Science at Scale

By

As part of the Boston-based Engineering group, the Data Science team’s charter is to enable science-based personalization and recommendation for PayPal’s global users. As companies of all sizes are starting to leverage their data assets, data science has become indispensable in creating relevant user experience. Helping fulfill PayPal’s mission to build the Web’s most convenient payment solution, the team works with various internal partners and strives to deliver best-in-class data science.

Technology Overview

At the backend of the data science platform reside large-scale machine learning engines that continuously learn and predict, from transactional, behavioral, and other datasets. An example of a question we might try to answer is: if someone just purchased a piece of software, does it increase his likelihood to purchase electronics in near future? It is no wonder that the answer lies in the huge amounts of transaction data, in that what people bought in the past is predictive of what they might consider buying next.

We leverage state-of-the-art machine learning technologies to make such predictions. Machine learning itself has been quickly evolving in recent years. New advances including large-scale matrix factorization, probabilistic behavioral models, and deep learning are no strangers to the team. To make things work at large scale, we leverage Apache Hadoop and its rich ecosystem of tools to process large amounts of data and build data pipelines that are part of the data science platform.

Tackling Data Science

Companies take different approaches in tackling data science. While some companies define a data scientist as someone who performs statistical modeling, we at PayPal Engineering have chosen to take a combined science & engineering approach. Our data scientists are “analytically-minded, statistically and mathematically sophisticated data engineers” [1]. There are of course more science-inclined, and more engineering-inclined individuals on the team, but there is much more of a blend of expertise than a marked distinction between these individuals. This approach to data science allows us to quickly iterate and operationalize high-performing predictive models at scale.

The Venn diagram below, which bears similarity to Conway’s diagram [2], displays the three cornerstones pivotal to the success of the team. Science, which entails machine learning, statistics and analytics, is the methodology by which we generate actionable predictions and insights from very-large datasets. The Engineering component, which includes Apache Hadoop and Spark, makes it possible for us to do science and analytics at scale and deliver results with quick turnaround. Last but not least, I cannot emphasize more the importance of understanding the business for a data scientist. None of the best work in this area that I know of is done in isolation. It is through understanding the problem domain that a data scientist may come up with a better predictive model among other results.

DS-VennDiagram

Scaling Data Science

There are multiple dimensions to scaling data science, which at a minimum involves the team, technology infrastructure, and the operating process. While each of these is worth thoughtful discussion in its own right, I will focus on the operating process since it is critical to how value is delivered. A typical process for the team could break down into these steps:

  • Identify business use case;
  • Understand business logic and KPIs;
  • Identify and capture datasets;
  • Proof of concept and back-test;
  • Operationalize predictive models;
  • Measure lift, optimize and iterate.

Let me explain some of the key elements. First, we see scaling data science as an ongoing collaboration with Product and Business teams. Understanding product and business KPIs and using data science to optimize the same is an essential ingredient of our day-to-day. Second, we follow the best practice in data science, whereby each predictive model is fully back-tested before operationalization. This practice guarantees the effectiveness of the data science platform. Third, the most powerful predictive modeling often requires iterative measurement and optimization. As a concrete example, putting this process into practice along with PayPal Media Network, we were able to achieve excellent results based on:

  • Lookalike modeling: Merchants can reach consumers who look like the merchant’s best existing customers.
  • Purchase intent modeling: Merchants can engage consumers who have a propensity to spend within specific categories.

While it is challenging to crunch the data and create tangible value, it is interesting and rewarding work. I hope to discuss more details of all the fun things we do as data scientists in the future.

References:

[1] http://www.forbes.com/sites/danwoods/2011/10/11/emc-greenplums-steven-hillion-on-what-is-a-data-scientist

[2] http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram