Tag Archives: PayPal

Acceptance of FIDO 2.0 Specifications by the W3C accelerates the movement to end passwords

By

What if we could eliminate, or at least significantly mitigate the risk of passwords? On February 17, 2016, the World Wide Web Consortium (W3C) announced that the creation of the Web Authentication Working Group to move the world closer to this goal. The mission of the group is to define a client-side API that provides strong authentication functionality to Web Applications. The group’s technical work will be accelerated by the acceptance of the FIDO 2.0 Web APIs. This specification, whose co- authors include PayPal’s Hubert Le Van Gong and Jeff Hodges, help simplify and improve the security of authentication.  As the steward for the Web platform, the W3C is uniquely positioned to focus the attention of Web infrastructure providers and developers on the shortcomings of passwords and the necessity of their replacement.

The FIDO2.0 protocol employs public key cryptography, relying on users’ devices to generate key pairs during a registration process. The user’s device retains the generated private key and delivers the public key to the service provider. The service provider retains this key, associates it with a user’s account, and when a login request is received, issues a challenge that must be signed by the private key holder as a response.

When challenged, the FIDO implementation stack signals the user to authenticate using the mechanism employed at registration time. This might be via PIN, biometric reader, or an alternative modality. A local comparison of the current authentication request is made to the stored registration value. A successful match unlocks the associated private key; the challenge is signed and returned to the service provider.

This approach dramatically alters the economics of attacks on service providers and their password stores. For each service provider that a user interacts with, a unique private/public key pair is generated. Not only does this ensure that service providers are unable to use protocol artifacts to collude in user-unwanted ways, it renders the public key store of little to no value to fraudsters. Attacks at scale through exfiltration of passwords are no longer a viable means of generating revenue – the ultimate goal of fraudsters.

Early version of the FIDO protocols, UAF and U2F, were developed with deployments by PayPal and others. Much has been learned through the process with the Fido 2.0 specifications designed to bring together the best features of U2F and UAF. With the contribution of the Fido 2.0 APIs to the W3C, they will be sedimented into the Web Platform enabling authoring tools, hosting providers, and others to interoperate will a broad range of devices that will support the Fido 2.0 protocols. 

At PayPal, we are committed to a more secure and privacy-respecting web experience for all internet users. We realize that an easy-to-use, secure, and privacy-respecting means of authentication benefits everyone and having the same protections regardless of the site enhances the overall security of the Web. We look forward to actively participating in the W3C Web Authentication Working Group to continue our pursuit of ubiquitous, simple, secure, and privacy-respecting authentication on the Web.

 

Secure Authentication Proposal Accepted by W3C

By

Today the World Wide Web Consortium (W3C) accepted a submission of proposed technical work from W3C members PayPal, Google, Microsoft, and NokNok Labs. This submission consists of three draft initially specifications developed by the FIDO Alliance to facilitate browser support for replacing passwords as a means of authentication on the Web with something more secure. It is expected that the W3C will take these draft documents as a starting point and, through its standard process, evaluate, enhance, and publish them as W3C Recommendations (link to W3C recommendations page).  The goal is for the final specification to be implemented by Web browsers. With a common framework available in all browsers, Web developers will be able to rely on a secure, easy-to-use, and privacy-respecting mechanism for passwordless authentication.

As a catalyst for this work, the username/password paradigm for authentication has well-known issues (see links below) that have become exacerbated with its widespread use by Web sites. Millions of users of various companies across the world have been subjected to account takeovers, fraud, and identity theft as a direct result. While more secure methods of authentication are available, they have proven too expensive and/or too difficult to use to garner widespread use. The members of the Fido Alliance recognized the need for an authentication paradigm shift and have developed a framework and specifications to support eliminating passwords.

From the outset, the Fido Alliance recognized that significant, multistakeholder support would be required in order to effect Internet-scale change. The organization worked diligently to convince relying parties, technology vendors, and hardware manufactures of the need to work cooperatively to address the challenge of replacing passwords. Today the Fido Alliance includes 250 members and, with today’s acceptance by the W3C, the organization is delivering on its promise to enable platforms with open, free to use specifications for passwordless authentication.

The journey is far from over, but the development of the specifications and their acceptance by the W3C are important steps toward improved, easy-to-use, secure authentication. This is yet another example of how we continually strive to improve security not just for our own customers, but for all users of the Web.

 

References:
http://www.darkreading.com/stolen-passwords-used-in-most-data-breaches/d/d-id/1204615

http://www.verizonenterprise.com/DBIR/2015/

http://blog.trendmicro.com/trendlabs-security-intelligence/password-insecurity-revisited/

 

Recycle, Reuse, Reharm: How hackers use variants of known malware to victimize companies and what PayPal is doing to eradicate that capability

By and

“No need to reinvent the wheel.” We’ve heard it. We’ve used it. Is it a mark of laziness or leaving something as is because it’s effective? In the case of malware, it’s most certainly the latter. Several high profile hacks have received extensive news coverage in recent years. Seeing these attacks happen repeatedly leads cybersecurity experts to look for common threads in attack vectors and execution modes. Through years of data analysis, one trend is clear- while attacks vary many elements of the malware source codes are identical and are successfully reused. Below are some examples of attacks you may recognize but were unaware were related to previously known attack vectors that are still actively exploited:

  • 2011: Cyber espionage attacks that were launched on RSA and Lockheed-Martin, resulting in grave financial and operational damage used a variant of Poison Ivy, a well-known malware in use since 2006. To date, the cyber security community is still fighting derivatives.
  • 2013 and 2014: The infamous Target and Home Depot hacks were both carried out using variants of BlackPoS, a malware installed on point of sale systems designed to glean data from cards when swiped at the register.
  • 2014: The Sony hack was executed using Destrover, malware heavily reliant on code taken from Shamoon (used in the 2012 attack on Saudi Aramco) and DarkSeoul (used in the 2013 attack on South Korean banks and TV broadcasters).

How does it work?

There are three aspects of malware reuse that we’ve identified as particularly dangerous:

  • Code Reuse: Incorporate code from one malware into another
  • Semantic Equivalence: Make changes to the code but preserve the functions
  • Different Path, Same Target: Change/add functions to evade heuristics

To demonstrate a process malware developers use, we’ll pull an example from BlackPoS. Using code from a January 2015 blog post by Nick Hoffman, we can figuratively explain how reuse of existing techniques enabled the developers of BlackPoS to create a new variant and continue attacking a few months after the malware was discovered attacking Target. The process consists of at least three steps:

  • Once it is downloaded, it runs the CreateToolhelp32Snapshot API call, followed by Process32First in order to discover all the running processes on the computer. This is the exact process used by the Zeus “SpyEye” variant, PoS malware Alina, and PoS malware Backoff. The original BlackPoS variant used a different API call – EnumProcesses, yet the variant of BlackPoS discovered in August 2014 did change to this one. It is quite possible it copied this from Alina or Backoff.
  • Using the OpenProcess and ReadProcessMemory calls, the malware then reads processes that do not appear on its whitelist searching for credit card track data. This technique appeared in the original BlackPoS variant, as well as in Zeus “SpyEye” and Alina. Again, it is likely that this technique was copied from one of them.
  • The last step searches for certain identifiers of track 1 and track 2 data in the data of the processes read with ReadProcessMemory. The same step is seen in all PoS RAM scraping malware, such as Backoff.

Malware.Nov 2015.fw

It’s that easy. This malware does not include C&C communication, stealth, or persistence capabilities. All it needs in order to create these capabilities is copy code and techniques from similar malware that’s already out there. This process is a the base of all malware families, from PoS to ICS.

Cool story. What is PayPal doing about it?

At PayPal’s Center of Excellence in Bee’rsheva, Israel, we approach malware detection with the mentality that by identifying similar elements embedded in the code of existing malware like those illustrated above; we can predict attributes of future bugs and thwart them before they make it into the wild. Our goal is to render the reuse of existing malware ineffective and minimize scalable attacks. By forcing the total recreation of malware each time an attack is launched, we’re lengthening the time between attacks and dramatically increasing the cost of creating a vector.

Our predictive engine creates variants from existing malicious binary samples via an evolutionary algorithm. The results are tested in simulated environments and evaluated, thus supporting a machine learning training process for malware detectors, enabling them to locate hundreds of thousands of future malware variants in the wild, and head off future attacks.

While integrating the existing PayPal infrastructure  into the CyActive predictive detection model, PayPal’s Shlomi Boutnaru and Liran Tancman have been researching this method since 2012. It became a reality in 2013 when they launched CyActive, the cybersecurity company that PayPal acquired in 2015. The product integration is currently in progress and is expected to be a game changer in the risk management arena.  Shlomi’s January 2015 article on Techcrunch highlighted the dangers of malware reuse coupled with human error as key tactics expected to be used by hackers in 2015. We’re doing everything we can to ensure this prediction is eradicated in the foreseeable future.

PayPal Sponsors First of Its Kind Intel Capture the Flag Contest at DEFCON 23

By

DEFCON routinely presents the coolest and most thought provoking topics in the hacking community and this year did not disappoint, partially due to the first PayPal-sponsored Intel Capture the Flag (CTF) virtual manhunt contest. IntelCTF events challenge players to utilize their open source intelligence (OSINT) forensic skills in order to identify malicious actors intent on Internet mayhem. Players find strategically placed “flags” that are planted across the Internet as breadcrumbs, allowing them to solve the e-case of whodunit by simply connecting the virtual dots.

This contest, (rated Beginner/Intermediate) which is the first of several that are scheduled for release in the near future, tasked participants with identifying an actor who defaced a rather “popular” webpage. Team participants may use any means necessary to track and identify the perpetrator. It touted 17 flags of increasing difficulty and asked questions such as the timestamp of when certain posts were made, how a website was hacked, the owner of the proxy service the actor was using, and finally, the defacer’s real identity.

Various members of the PayPal Information Security team in Scottsdale, Arizona partnered with several alpha/beta testers to run the 6 hour, 24-team event. The event started a bit slowly as players adjusted to the gaming style but at around the 30-minute mark the competition heated up! The Attribution-Team and Killjoys were neck-in-neck for flags 11 and 12. The Attribution-Team ended up capturing the 13th flag before Killjoys in the final hour. After six long hours of competition, the scoring engine was shut down and the event concluded. The Attribution-Team submitted a write-up detailing their investigative process. It was then reviewed by the IntelCTF team confirming no cheating or flag brute forcing occurred. The IntelCTF team confirmed that the Attribution-Team was the winning team, capturing thirteen out of seventeen flags before the others and winning the $500 USD prize. Below is a snapshot of the top five contenders and the final scoreboard for all teams:

  1. Attribution-Team – 13 flags and $500 USD prize
  2. Killjoys – 13 flags
  3. BAMFBadgers – 12 flags
  4. I tried… but failed – 12 flags
  5. StenoPlasma – 11 flags

Screen Shot 2015-08-19 at 11.48.45 AM

This event was different than past contests because IntelCTF is the first of its kind! Traditional CTFs focus more on WebApp pentesting, reverse-engineering, forensics, and programming challenges where as IntelCTF immerses participants into simulated scenarios of tracking down information about the tools, techniques, capability, and identity of malicious individuals. It generated a lot of interest and excitement and feedback from participants was very positive. Not only did participants thoroughly enjoy the immersive challenge but also asked when IntelCTF would be running the next event and where.

Think you have what it takes to capture the flag? Check out these upcoming IntelCTF challenges and keep your eye out for more PayPal-sponsored information security events.

 

Implementing a Fast and Light Weight Geo Lookup Service

By

Fast Geo lookup of city and zip code for given latitude/longitude

Problem description

Geo lookup is a commonly required feature used by many websites and applications. Most smartphone applications can send latitude and longitude to server applications. Then the server applications use the latitude and longitude to perform Geo lookup.

Geo lookup falls into two categories:

  1. For a given latitude and longitude, retrieve full postal address including street, city, zip code.
  2. For a given latitude and longitude, retrieve nearest city with zip code.

Overwhelming majority of websites and applications require city and zip code only. The scope of this software is for server applications requiring to extract the nearest city with zip code for a given latitude and longitude.

This document describes a light weight, in-memory, fast lookup of zip codes for a given latitude/ longitude. This blog explains in detail how the inherent nature of zip code data source can be exploited to generate the cache through pre-processing. And the generated cache can be served from in-memory system to accomplish a very fast Geo lookup. Developers and product managers can implement it using the technique described in this post.

Illustration and reference implementation is provided for the US only. However it can be extended internationally.

High level Geo Lookup process

Prerequisite

Data source has a record per zip code. Each record has the following elements: zip code and its corresponding state, city, state, latitude and longitude.

Use case

  • Client apps such as smart phone apps send user’s latitude and longitude.
  • Web service reads latitude and longitude used. Web service Geo Lookup implementation retrieves the nearest zip code for the user’s latitude/ longitude.

This article describes how to implement and set up a fast Geo look up service using neither paid data source nor paid software.

Third party data and software providers

There are many providers who do sell data and software. Software can be either a library or a hosted solution. However there are some free data sources. Check out a few data resources here:

Sample records from csv

  • US|94022|Los Altos|California|CA|Santa Clara|085|37.3814|-122.1258|
  • US|94023|Los Altos|California|CA|Santa Clara|085|37.1894|-121.7053|
  • US|94024|Los Altos|California|CA|Santa Clara|085|37.3547|-122.0862|
  • US|94035|Mountain View|California|CA|Santa Clara|085|37.1894|-121.7053|
  • US|94039|Mountain View|California|CA|Santa Clara|085|37.1894|-121.7053|
  • US|94040|Mountain View|California|CA|Santa Clara|085|37.3855|-122.088|

Geo look up algorithms

All these data sources typically have around 49,000 records for the US and is available in CSV format. Each record contains latitude, longitude, state, city and zip code. There will be one record per zip code.

The Geo lookup process typically involves the following steps:

  1. User request is received with latitude and longitude by the web service or application.
  2. Query the data source and retrieve the nearest zip code calculated using nearest Geo distance algorithm.

Nearest point calculation involves selecting a set of points (around x Km radius) from the CityZipLatLong database around the given latitude/longitude and determine the point with minimum distance among the distances from each selected point. For the US, assuming 20 km radius is an optimal choice and assuming 20+ points selected for minimum distance calculation, it might take 50+ milliseconds.

The geographical distance between two latitude/ longitude is explained here: http://en.wikipedia.org/wiki/Geographical_distance. You can implement this algorithm or use open source software that supports Geo point data. Elastic search server is one such open source server that can be used for this purpose.

  1. You can implement this algorithm or use open source software to retrieve the nearest point.
  2. Ingest the zip code data into SQL database. Implement SQL client to retrieve the nearest point.
  3. Ingest data into elastic search server. Use Geo point query to retrieve the nearest point.

Typical industry implementations

Typical implementations use either paid or in house software. These implementations use on-demand caching. The caching will be done for each latitude/ longitude. However, on-demand caching does have a few disadvantages:

  • Requires use of services to retrieve nearest point for requests with new latitude/ longitude.
  • Requires provisioning software (service) for new points and make on-demand request. Need to maintain this service and software.
  • Cache size may end up being huge since latitude/ longitude values have 4-digit precision. For example, number of points in US will be 250,000 (valid latitudes between 24.0000 and 49.9999) times 500,000 (valid longitude between -122.0000 and -70.0000) equaling 125 billion.

Alternative perspective of data and requirement

The goal of this approach is to enable a simple and ultra fast Geo lookup service. This approach does not use Geo distance calculation services during user request at run time without affecting the the accuracy of nearest point calculation. This implies that we need to create cache through pre-processing. In addition, cache size should be smaller.

In effect, the implementation should involve the following steps:

  1. One time pre-processing: Generate the Geo lookup cache for all possible latitude and longitude. The cache size should not less than few GB for world.
  2. In memory lookup during user request.

Data and requirement analysis

  • For the US, we have 49,000 records (one per zip code).
  • Total US area is 9.8 million square kms. Typical 95 percentile distance between any two adjacent zip codes is more than 8 kms.
  • Latitude for the US ranges from 24 to 50 while longitude ranges from -124 to -65.
  • A 0.01 difference in latitude is around 1 km and a 0.01 difference in longitude is around 1.5 km.
    Given this, the percentage of points among all possible latitude/ longitude in the US that may spill to an adjacent zip code could be less than 0.5 percent. Given the usage context, using two decimal precisions should be acceptable for most websites and applications.
  • With two decimal precisions being good enough, we can determine the worst-case count of latitude/ longitude values for US contiguous states.
  • Two decimal precisions result in 15 million (2600 times 6000) possible latitude/ longitude values. Nearest zip code can be calculated for 15 million latitude/ longitude through one time pre-processing.

Cache Data storage and optimization

The process of generating the cache involves the following:

  • For each possible 2-decimal precision and radius boundary, request for nearest zip code. Given the vast area of the US, with a 20 km radius, only 7 million latitude/ longitude have nearest city/zip code.
  • Radius boundary can start from 10 km and if nearest zip code is not found, then we can progressively increase radius (say 20 km, 30 km, 40 km) until the nearest zip code is found. Using this approach, we can map all possible latitude/ longitude in the US with nearest zip code. (See Fig. 1 for illustration.) For places like New York city, 10 km is enough to get the nearest zip code. For many locations in Nevada state, 40 km radius will be required to get nearest point.
  • With only 49,000 points available in the US data source, an average of 250 latitude/ longitude points in cache will have the same zip code. When you extend this cache for all countries in the world, it would seem necessary to reduce cache size.
  • One technique will be storing one record in cache instead of 10 records for 10 successive latitude or longitude points with the same nearest zip code. For example for a given latitude and longitude values from -89.16 till -89.24, if the nearest zip codes are same, then we need to store only latitude and -89.20. In the fast lookup implementation, for a user request with say -89.19, both values for both -89.19 and -89.20 will be fetched. If -89.19 is not found in cache, value for -89.20 will be used.
    • Example illustration
      • Data set 1: Calculated nearest zip code for given latitude (37.12) and different longitudes (-89.16….-89.24) = 12346
      • Data set 2: Calculated nearest zip code for given latitudes (42.12) and different longitudes (-67.46….-67.54) = 52318
      • Instead of storing 10 keyValue entries for each of the above data set, we can store just one keyValue entry per data set. These will be
        • <(37.12:-89.20), 12346> and <(42.12:-67.50), 52318>
        • Good example for these data set 1 and 2 will be locations in Nevada, Arizona, etc. (See territory map below in Fig 1.)
  • Instead of storing latitude/ longitude as float, use 100 times the value and store them as integer. As an additional optimization latitude/ longitude can be stored in a single 32 bit integer with latitude as 16 bit MSB and longitude as 16 bit LSB.

Image Map of United States

Figure 1 : US territory map from fema.gov

Pseudo code

Pseudo code for cache generation

  • For each country/region determine the latitude/longitude boundary. Ex: the US boundaries are latitude range 24 to 49 and -122 to -69.
    • For each 2-decimal precision latitude/ longitude
      • Set radius = 10Km
      • While (nearest zip code not found AND radius < 50 km)
        • Find nearest zip code
        • If (nearest zip code) found
          • Add latitude/ longitude, zip code data as (K,V) in look up cache.
          • break
        • If not found, set radius = radius + 10

In above steps, aggregation of latitude/ longitude with same zip codes values can be performed and resulting in reduced cache entries. Format of the cache entry record. There will be two fields as (k,v) pair. Latitude and longitude will be stored as key. For storage optimization,  latitude and longitude together will be stored in 32 bit integer with latitude in 16 bit MSB and longitude in 16 bit LSB. Both latitude and longitude will be stored with 100 times its values as integer. Since 16 bit integer can hold -32768 to +32768, full range latitude and as well as longitude values (-18000 to +18000) can be stored. The value v in (k,v) holds the seq Id for zip data.

Pseudo code for zip look up for user request

This is typically executed in a web service.

On server start up

Use in-memory kv map to load the lookup cache. Alternate options could be services like CouchBase memcached.

User request processing

  1. Receive user request with latitude/ longitude
  2. Convert latitude/ longitude to 2-decimal precision’s and convert them to 100 times their values.
    1. Form the key in 32 bit: (latitude << 16) | longitude
    2. Note: Latitude is stored in 16 bit MSBB and longitude in 16 bit LSB. Shift latitude left 16 times and bit OR with longitude.
  3. Look up in GeoLookupCache to retrieve the zip code.

Reference implementation for Fast Geo lookup for the US

Currently FastGeoLookup is implemented only for the US. This is preliminary implementation. Provided only as reference implementation and is not tested.

https://github.com/Vish-Ram/GeoLookup

References

Data sources

  1. GeoNames.org: Refer to compliance requirements if data from this web site is used.
  2. OpenGeocode: Refer to compliance requirements if data from this web site is used.
  3. http://en.wikipedia.org/wiki/Geographical_distance.
  4. US territory map from fema.gov.: https://www.fema.gov/status-map-change-requests/status-map-change-requests/status-map-change-requests/status-map-change

Template specialization with Krakenjs Apps

By

Template Specialization is a mechanism to dynamically switch partials in your webpage at render time, based on the context information. Very common scenarios where you would need this is when you want different flavors of the same page for:

  • Country specific customization
  • A/B testing different designs
  • Adapting to various devices

Paypal runs into the above cases quite often. Most web applications in Paypal are using our open-sourced framework krakenjs on top of express/node.js. So, the mechanism for template specialization was very much inspired by the config driven approach in kraken apps. If you are not familiar with krakenjs, I’d recommend you take a quick peek at the krakenjs github repo and/or generate a kraken app to understand what I mean by ‘config driven approach’.  My example will be using dust templates to demonstrate the feature.

The main problems to solve were:

  1. A way to specify template maps for a set of context rules. I ended up writing a simple rule parsing module, Karka, which, when provided a json-based rule spec, and the context information at run-time, will resolve a partial to another when the rules match.
  2. A way to integrate the above mentioned rules into the page render workflow in the app, so that the view engine will give a chance to switch the partial in the page if a set of rules match.

After some experiments, I arrived at what I call the 3 step recipe to including specialization in the render workflow. The recipe should be applicable to any view engine (with or without kraken) in express applications (Check out my talk at JSConf 2014 on how to approach specialization without kraken)

  1. Intercept the render workflow by adding a wrapper for the view engine.
  2. Using karka (or any rule parser of your choice) generate the template map at request time using the context information, then stash it back into the context.
  3. Using the hook in your templating engine (which will let you know whenever it encounters partials and gives you an opportunity to do any customization before rendering it ), switch the template if a mapping is found.

If you are using kraken + dustjs, the above recipe has already been implemented and available ready-made for use. With some simple wiring you can see it working.

Lets see how to wire it up into an app using kraken@1.0 with the following super simple example.

  1. Generate a kraken 1.0 app (The example below uses generator-kraken@1.1.1)
  2. Add a simple karka rule spec into the file ‘config/specialization.json’ in the generated app.
{
     "whoami": [
         {
             "is": "jekyll",
             "when": {
                 "guy.is": "good",
                 "guy.looks": "respectable",
                 "guy.known.for": "philanthropy"
             }
         },
         {
             "is": "hyde",
             "when": {
                 "guy.is": "evil",
                 "guy.looks": "pre-human",
                 "guy.known.for": "violence"
             }
         }
     ]
}

Interpreting the rule spec:  ‘whoami’  will be replaced by ‘jekyll’  or ‘hyde’ when the corresponding ‘when’ clause is satisfied.

3. Wire the specialization rule-spec to be read by the view engine, by adding the following line into config/config.json file.

"specialization": "import:./specialization"

4. Change your ‘public/templates/index.dust’ to look like the following:

{>"layouts/master" /}

{<body}
 <h1>{@pre type="content" key="greeting"/}</h1>
 {>"whoami" /}
{/body}

Add  ‘public/templates/whoami.dust’ , ‘public/templates/jekyll.dust’ , ‘public/templates/hyde.dust’

{! This is whoami.dust !}
<div>
Who AM I ?????
</div>
{! This is jekyll.dust !}
<div>
 I am the good one
</div>
{! This is hyde.dust !}
<div>
 I am the evil one
</div>

5.  You want your request context to have the following to be able to see the specialization of ‘whoami’ to ‘jekyll’ :

{
    .....
    .....
    guy : {
        is: 'good',
        looks: 'respectable',
        known: {
            for: 'philanthropy'
        }
    }
    ..... 
    .....  
}

So lets try setting this context information in our sample app in  controllers/index.js.

 router.get('/', function (req, res) {
     model.guy = {
         is: 'good',
         looks: 'respectable',
         known: {
             for: 'philanthropy'
         }
     };
     res.render('index', model);
 });

What I am doing above is leveraging the model that I pass to express ‘res.render’ to set the context information to see specialization for jekyll.dust. Internally express merges res.locals and the model data to pass to the view engine. So you can instead set the values in ‘res.locals’ as well. You can also try to set the rules for ‘hyde’ in model.guy above to see specialization to hyde.dust.

Now you are ready to see it working. Open a terminal window and start the app.

$cd path/to/app
$node .

Hit ‘http://localhost:8000’ on your browser and you will see the specialized partial, per the context information. Here is the sample I created, with the exact same steps above.

The above example does not demonstrate more practical things like styling partials differently, or specializing while doing a client side render. A more comprehensive example here.

PayPal products span multiple devices and hundreds of locales and hence specialization could be the way to solve customization requirements in the views cleanly.

Deep Learning on Hadoop 2.0

By

The Data Science team in Boston is working on leveraging cutting edge tools and algorithms to optimize business actions based on insights hidden in user data. Data Science heavily exploits machine algorithms that can help us identify and exploit patterns in the data. Obtaining insights from Internet scale data is a challenging undertaking; hence being able to run the algorithms at scale is a crucial requirement. With the explosion of data and accompanying thousand-machine clusters, we need to adapt the algorithms to be able to operate on such distributed environments. Running machine learning algorithms in general purpose distributed computing environment possesses its own set of challenges.

Here we discuss how we have implemented and deployed Deep Learning, a cutting edge machine-learning framework, in one of the Hadoop clusters. We provide the details on how the algorithm was adapted to run in a distributed setting. We also present results of running the algorithm on a standard dataset.

Deep Belief Networks

Deep Belief Networks (DBN) are graphical models that are obtained by stacking and training Restricted Boltzmann Machines (RBM) in a greedy, unsupervised manner [1]. DBNs are trained to extract deep hierarchical representation of the training data by modeling the joint distribution between observed vectors x and the l hidden layers hk as follows, where distribution for each hidden layer is conditioned on a layer immediately preceding it [4]:

DBN Distribution

Equation 1: DBN Distribution

The relationship between the input layer and the hidden layers can be observed in the figure below. At a high level, the first layer is trained as an RBM that models the raw input x. An input is a sparse binary vector representing the data to be classified, for e.g. a binary image of a digit. Subsequent layers are trained using the transformed data (sample or mean activations) as training examples from the previous layers. Number of layers can be determined empirically to obtain the best model performance, and DBNs support arbitrary many layers.

DBN layers

Figure 1: DBN layers

The following code snippet shows the training that goes into an RBM. For the input data supplied to the RBM, there are a multiple number of predefined epochs. The input data is divided into small batches and the weights, activations, and deltas are computed for each layer:

1

 

 

 

 

 

 

 

 

After all layers are trained, the parameters of the Deep Network are fine-tuned using a supervised training criterion. The supervised training criterion, for instance, can be framed as a classification problem, which then allows using the deep network to solve a classification problem. More complex supervised criterion can be employed which can provide interesting results such as scene interpretation, for instance explaining what objects are present in a picture.

Infrastructure

Deep learning has received large attention not only because of the fact that it can deliver results superior to some of the other learning algorithms, but also because it can be run on a distributed setting, allowing processing of large scale datasets. Deep networks can be parallelized in two major levels – at the layer level, and at the data level [6]. For layer level parallelization, many implementations use GPU arrays to compute layer activations in parallel and frequently synchronize them. However, this approach is not suitable for clusters where data can reside across multiple machines connected by a network, because of high network costs. Data level parallelization, in which the training is parallelized to subsets of data, is more suitable for these settings. Most of the data at Paypal is stored in Hadoop clusters; hence being able to run the algorithms in those clusters is our priority. Dedicated cluster maintenance and support is also an important factor for us to consider. However, since deep learning is inherently iterative in nature, a paradigm such as MapReduce is not well suited for running these algorithms. But with the advent of Hadoop 2.0 and Yarn based resource management, we can write iterative applications as we can finely control the resources the application is using. We adapted IterativeReduce [7], a simple abstraction for writing iterative algorithms in Hadoop YARN, and were able to deploy it in one of the PayPal clusters running Hadoop 2.4.1.

Methodology

We implemented the core deep learning algorithm by Hinton, reference in [2]. Since distributing the algorithm for running on multi-machine cluster is our requirement, we adapted their algorithm for such a setting. For distributing the algorithm across multiple machines, we followed the guidelines proposed by Grazia, et al. [6]. A high level summary of our implementation is given below:

  1. Master node initializes the weights of the RBM
  2. Master node pushes the weights and the splits to the worker nodes.
  3. The worker trains a RBM layer for 1 dataset epoch, i.e. one complete pass through the entire split, and sends back the updated weights to the master node.
  4. The master node averages the weights from all workers for a given epoch.
  5. Steps 3-5 are repeated for a predefined set of epochs (50 in our case).
  6. After step 6 is done, one layer is trained. The steps are repeated for subsequent RBM layers.
  7. After all layers are trained, the deep network is fine-tuned using error back-propagation.

The figure below describes a single dataset epoch (steps 3-5) while running the deep learning algorithm. We note that this paradigm can be leveraged to implement a host of machine learning algorithms that are iterative in nature.

Single dataset epoch for training

Figure 2: Single dataset epoch for training

 

The following code snippet shows the steps involved in training a DBN in a single machine. The dataset is first divided into multiple batches. Then multiple RBM layers are initialized and trained sequentially. After the RBMs are trained, they are passed through a fine-tune phase which uses error back propagation.

 

2

 

 

 

 

 

 

 

 

 

We adapted the IterativeReduce [7] implementation for much of the YARN “plumbing”. We did a major overhaul of the implementation to make it useable for our deep learning implementation. The IterativeReduce implementation was written for Cloudera Hadoop distribution, which we re-platformed to adapt it to standard Apache Hadoop distribution. We also rewrote the implementation to use the standard programming models described in [8]. In particular, we used YarnClient API for communication between the client application and the ResourceManager. We also used the AMRMClient and AMNMClient for interaction between ApplicationMaster and ResourceManager and NodeManager.

We first submit an application to the YARN resource manager using the YarnClient API:

3

 

 

 

 

After the application is submitted, YARN resource manager launches the application master. The application master is responsible for allocating and releasing the worker containers as necessary. The application master uses AMRMClient to communicate with the resource manager.

4

 

 

 

 

The application master uses the NMClient API to run commands in the containers it received from the application master.

5

 

 

 

Once the application master launches the worker containers it requires, it sets up a port to communicate with the workers. For our deep learning implementation, we added the methods required for parameter initialization, layer-by-layer training, and fine-tune signaling to the original IterativeReduce interface. IterativeReduce uses Apache Avro IPC for Master-Worker communication.

The following code snippets shows the series of steps involved in Master-worker nodes for distributed training. The master sends the initial parameters to the workers, and then the worker trains its RBM on its portion of the data. After worker is done training, it sends back the results to master, which then combines the results. After the iterations are completed, master completes the process by starting the back propagation fine-tune phase.

6

 

 

 

 

 

 

 

Results

We evaluated the performance of the deep learning implementation using the MNIST handwritten digit recognition [3]. The dataset contains manually labeled hand written digits ranging from 0-9. The training set consists of 60,000 images and the test set consists of 10,000 images.

In order to measure the performance, the DBN was first pre-trained and then fine-tuned on the 60,000 training images. After the above steps, the DBN was then evaluated on the 10,000 test images. No pre-processing was done on the images during training or evaluation. The error rate was obtained as a ratio between total number of misclassified images and the total number of images on the test set.

We were able to achieve the best classification error rate of 1.66% when using RBM with 500-500-2000 hidden units in each RBM, and using a 10-node distributed cluster setting. The error rate is comparable with the error rate of 1.2% reported by authors of the original algorithm (with 500-500-2000 hidden units) [2], and with some of the results with similar settings reported in [3]. We note that original implementation was on a single machine, and our implementation is on a distributed setting. The parameter-averaging step contributes to slight reduction in performance, although the benefit of distributing the algorithm over multiple machines far outweighs the reduction. The table below summarizes the error rate variation per the number of hidden units in each layer while running on a 10-node cluster.

MNIST performance evaluation

Table 1: MNIST performance evaluation

Further thoughts

We had a successful deployment of a distributed deep learning system, which we believe will prove useful in solving some of the machine learning problems. Furthermore, the iterative reduce abstraction can be leveraged to distribute any other suitable machine learning algorithm. Being able to utilize the general-purpose Hadoop cluster will prove highly beneficial for running scalable machine-learning algorithms on large datasets. We note that there are several improvements we would like to make to the current framework, chiefly around reducing the network latency and having more advanced resource management. Additionally we’d like to optimize the DBN framework so that we can minimize inter-node communication. With fine-grained control of cluster resources, Hadoop Yarn framework provides us the flexibility to do so.

References

[1] G. E. Hinton, S. Osindero, and Y. Teh. A fast learning algorithm for deep belief nets. Neural Computations, 18(7):1527–1554, 2006.

[2] G. E. Hinton and R. R. Salakhutdinov. Reducing the Dimensionality of Data with Neural Networks. Science, 313(5786):504–507, 2006.

[3] Y. LeCun, C. Cortes, C. J.C. Burges. The MNIST database of handwritten digits.

[4] Deep Learning Tutorial. LISA lab, University of Montreal

[5] G. E. Hinton. A Practical Guide to Training Restricted Boltzmann Machines. Lecture Notes in Computer Science Volume 7700: 599-619, 2012.

[6] M. Grazia, I. Stoianov, M. Zorzi. Parallelization of Deep Networks. ESANN, 2012

[7] IterativeReduce, https://github.com/jpatanooga/KnittingBoar/wiki/IterativeReduce

[8] Apache Hadoop YARN – Enabling Next Generation Data Applications, http://www.slideshare.net/hortonworks/apache-hadoop-yarn-enabling-nex

10 Myths of Enterprise Python

By

2016 Update: Whether you enjoy myth busting, Python, or just all enterprise software, you will also likely enjoy Enterprise Software with Python, presented by the author of the article below, and published by O’Reilly.

PayPal enjoys a remarkable amount of linguistic pluralism in its programming culture. In addition to the long-standing popularity of C++ and Java, an increasing number of teams are choosing JavaScript and Scala, and Braintree‘s acquisition has introduced a sophisticated Ruby community.

One language in particular has both a long history at eBay and PayPal and a growing mindshare among developers: Python.

Python has enjoyed many years of grassroots usage and support from developers across eBay. Even before official support from management, technologists of all walks went the extra mile to reap the rewards of developing in Python. I joined PayPal a few years ago, and chose Python to work on internal applications, but I’ve personally found production PayPal Python code from nearly 15 years ago.

Today, Python powers over 50 projects, including:

  • Features and products, like RedLaser
  • Operations and infrastructure, both OpenStack and proprietary
  • Mid-tier services and applications, like the one used to set PayPal’s prices and check customer feature eligibility
  • Monitoring agents and interfaces, used for several deployment and security use cases
  • Batch jobs for data import, price adjustment, and more
  • And far too many developer tools to count

In the coming series of posts I’ll detail the initiatives and technologies that led the eBay/PayPal Python community to grow from just under 25 engineers in 2011 to over 260 in 2014. For this introductory post, I’ll be focusing on the 10 myths I’ve had to debunk the most in eBay and PayPal’s enterprise environments.

Myth #1: Python is a new language

What with all the start-ups using it and kids learning it these days, it’s easy to see how this myth still persists. Python is actually over 23 years old, originally released in 1991, 5-years before HTTP 1.0 and 4-years before Java. A now-famous early usage of Python was in 1996: Google’s first successful web crawler.

If you’re curious about the long history of Python, Guido van Rossum, Python’s creator, has taken the care to tell the whole story.

Myth #2: Python is not compiled

While not requiring a separate compiler toolchain like C++, Python is in fact compiled to bytecode, much like Java and many other compiled languages. Further compilation steps, if any, are at the discretion of the runtime, be it CPython, PyPy, Jython/JVM, IronPython/CLR, or some other process virtual machine. See Myth #6 for more info.

The general principle at PayPal and elsewhere is that the compilation status of code should not be relied on for security. It is much more important to secure the runtime environment, as virtually every language has a decompiler, or can be intercepted to dump protected state. See the next myth for even more Python security implications.

Myth #3: Python is not secure

Python’s affinity for the lightweight may not make it seem formidable, but the intuition here can be misleading. One central tenet of security is to present as small a target as possible. Big systems are anti-secure, as they tend to overly centralize behaviors, as well as undercut developer comprehension. Python keeps these demons at bay by encouraging simplicity. Furthermore, CPython addresses these issues by being a simple, stable, and easily-auditable virtual machine. In fact, a recent analysis by Coverity Software resulted in CPython receiving their highest quality rating.

Python also features an extensive array of open-source, industry-standard security libraries. At PayPal, where we take security and trust very seriously, we find that a combination of hashlib, PyCrypto, and OpenSSL, via PyOpenSSL and our own custom bindings, cover all of PayPal’s diverse security and performance needs.

For these reasons and more, Python has seen some of its fastest adoption at PayPal (and eBay) within the application security group. Here are just a few security-based applications utilizing Python for PayPal’s security-first environment:

  • Creating security agents for facilitating key rotation and consolidating cryptographic implementations
  • Integrating with industry-leading HSM technologies
  • Constructing TLS-secured wrapper proxies for less-compliant stacks
  • Generating keys and certificates for our internal mutual-authentication schemes
  • Developing active vulnerability scanners

Plus, myriad Python-built operations-oriented systems with security implications, such as firewall and connection management. In the future we’ll definitely try to put together a deep dive on PayPal Python security particulars.

Myth #4: Python is a scripting language

Python can indeed be used for scripting, and is one of the forerunners of the domain due to its simple syntax, cross-platform support, and ubiquity among Linux, Macs, and other Unix machines.

In fact, Python may be one of the most flexible technologies among general-use programming languages. To list just a few:

  1. Telephony infrastructure (Twilio)
  2. Payments systems (PayPal, Balanced Payments)
  3. Neuroscience and psychology (many, many, examples)
  4. Numerical analysis and engineering (numpy, numba, and many more)
  5. Animation (LucasArts, Disney, Dreamworks)
  6. Gaming backends (Eve Online, Second Life, Battlefield, and so many others)
  7. Email infrastructure (Mailman, Mailgun)
  8. Media storage and processing (YouTube, Instagram, Dropbox)
  9. Operations and systems management (Rackspace, OpenStack)
  10. Natural language processing (NLTK)
  11. Machine learning and computer vision (scikit-learn, Orange, SimpleCV)
  12. Security and penetration testing (so many and eBay/PayPal
  13. Big Data (Disco, Hadoop support)
  14. Calendaring (Calendar Server, which powers Apple iCal)
  15. Search systems (ITA, Ultraseek, and Google)
  16. Internet infrastructure (DNS) (BIND 10)

Not to mention web sites and web services aplenty. In fact, PayPal engineers seem to have a penchant for going on to start Python-based web properties, YouTube and Yelp, for instance. For an even bigger list of Python success stories, check out the official list.

Myth #5: Python is weakly-typed

Python’s type system is characterized by strong, dynamic typing. Wikipedia can explain more.

Not that it is a competition, but as a fun fact, Python is more strongly-typed than Java. Java has a split type system for primitives and objects, with null lying in a sort of gray area. On the other hand, modern Python has a unified strong type system, where the type of None is well-specified. Furthermore, the JVM itself is also dynamically-typed, as it traces its roots back to an implementation of a Smalltalk VM acquired by Sun.

Python’s type system is very nice, but for enterprise use there are much bigger concerns at hand.

Myth #6: Python is slow

First, a critical distinction: Python is a programming language, not a runtime. There are several Python implementations:

  1. CPython is the reference implementation, and also the most widely distributed and used.
  2. Jython is a mature implementation of Python for usage with the JVM.
  3. IronPython is Microsoft’s Python for the Common Language Runtime, aka .NET.
  4. PyPy is an up-and-coming implementation of Python, with advanced features such as JIT compilation, incremental garbage collection, and more.

Each runtime has its own performance characteristics, and none of them are slow per se. The more important point here is that it is a mistake to assign performance assessments to a programming languages. Always assess an application runtime, most preferably against a particular use case.

Having cleared that up, here is a small selection of cases where Python has offered significant performance advantages:

  1. Using NumPy as an interface to Intel’s MKL SIMD
  2. PyPy‘s JIT compilation achieves faster-than-C performance
  3. Disqus scales from 250 to 500 million users on the same 100 boxes

Admittedly these are not the newest examples, just my favorites. It would be easy to get side-tracked into the wide world of high-performance Python and the unique offerings of runtimes. Instead of addressing individual special cases, attention should be drawn to the generalizable impact of developer productivity on end-product performance, especially in an enterprise setting.

A comparison of C++ to Python,. The Python is approximately a tenth the size.

C++ vs Python,. Two languages, one output, as apples to apples as it gets.

Given enough time, a disciplined developer can execute the only proven approach to achieving accurate and performant software:

  1. Engineer for correct behavior, including the development of respective tests
  2. Profile and measure performance, identifying bottlenecks
  3. Optimize, paying proper respect to the test suite and Amdahl’s Law, and taking advantage of Python’s strong roots in C.

It might sound simple, but even for seasoned engineers, this can be a very time-consuming process. Python was designed from the ground up with developer timelines in mind. In our experience, it’s not uncommon for Python projects to undergo three or more iterations in the time it C++ and Java to do just one. Today, PayPal and eBay have seen multiple success stories wherein Python projects outperformed their C++ and Java counterparts, with less code (see right), all thanks to fast development times enabling careful tailoring and optimization. You know, the fun stuff.

Myth #7: Python does not scale

Scale has many definitions, but by any definition, YouTube is a web site at scale. More than 1 billion unique visitors per month, over 100 hours of uploaded video per minute, and going on 20 pecent of peak Internet bandwidth, all with Python as a core technology. Dropbox, Disqus, Eventbrite, Reddit, Twilio, Instagram, Yelp, EVE Online, Second Life, and, yes, eBay and PayPal all have Python scaling stories that prove scale is more than just possible: it’s a pattern.

The key to success is simplicity and consistency. CPython, the primary Python virtual machine, maximizes these characteristics, which in turn makes for a very predictable runtime. One would be hard pressed to find Python programmers concerned about garbage collection pauses or application startup time. With strong platform and networking support, Python naturally lends itself to smart horizontal scalability, as manifested in systems like BitTorrent.

Additionally, scaling is all about measurement and iteration. Python is built with profiling and optimization in mind. See Myth #6 for more details on how to vertically scale Python.

Myth #8: Python lacks good concurrency support

Occasionally debunking performance and scaling myths, and someone tries to get technical, “Python lacks concurrency,” or, “What about the GIL?” If dozens of counterexamples are insufficient to bolster one’s confidence in Python’s ability to scale vertically and horizontally, then an extended explanation of a CPython implementation detail probably won’t help, so I’ll keep it brief.

Python has great concurrency primitives, including generators, greenlets, Deferreds, and futures. Python has great concurrency frameworks, including eventlet, gevent, and Twisted. Python has had some amazing work put into customizing runtimes for concurrency, including Stackless and PyPy. All of these and more show that there is no shortage of engineers effectively and unapologetically using Python for concurrent programming. Also, all of these are officially support and/or used in enterprise-level production environments. For examples, refer to Myth #7.

The Global Interpreter Lock, or GIL, is a performance optimization for most use cases of Python, and a development ease optimization for virtually all CPython code. The GIL makes it much easier to use OS threads or green threads (greenlets usually), and does not affect using multiple processes. For more information, see this great Q&A on the topic and this overview from the Python docs.

Here at PayPal, a typical service deployment entails multiple machines, with multiple processes, multiple threads, and a very large number of greenlets, amounting to a very robust and scalable concurrent environment (see figure below). In most enterprise environments, parties tends to prefer a fairly high degree of overprovisioning, for general prudence and disaster recovery. Nevertheless, in some cases Python services still see millions of requests per machine per day, handled with ease.

Sketch of a PayPal Python server worker

A rough sketch of a single worker within our coroutine-based asynchronous architecture. The outermost box is the process, the next level is threads, and within those threads are green threads. The OS handles preemption between threads, whereas I/O coroutines are cooperative.

Myth #9: Python programmers are scarce

There is some truth to this myth. There are not as many Python web developers as PHP or Java web developers. This is probably mostly due to a combined interaction of industry demand and education, though trends in education suggest that this may change.

That said, Python developers are far from scarce. There are millions worldwide, as evidenced by the dozens of Python conferences, tens of thousands of StackOverflow questions, and companies like YouTube, Bank of America, and LucasArts/Dreamworks employing Python developers by the hundreds and thousands. At eBay and PayPal we have hundreds of developers who use Python on a regular basis, so what’s the trick?

Well, why scavenge when one can create? Python is exceptionally easy to learn, and is a first programming language for children, university students, and professionals alike. At eBay, it only takes one week to show real results for a new Python programmer, and they often really start to shine as quickly as 2-3 months, all made possible by the Internet’s rich cache of interactive tutorials, books, documentation, and open-source codebases.

Another important factor to consider is that projects using Python simply do not require as many developers as other projects. As mentioned in Myth #6 and Myth #9, lean, effective teams like Instagram are a common trope in Python projects, and this has certainly been our experience at eBay and PayPal.

Myth #10: Python is not for big projects

Myth #7 discussed running Python projects at scale, but what about developing Python projects at scale? As mentioned in Myth #9, most Python projects tend not to be people-hungry. while Instagram reached hundreds of millions of hits a day at the time of their billion dollar acquisition, the whole company was still only a group of a dozen or so people. Dropbox in 2011 only had 70 engineers, and other teams were similarly lean. So, can Python scale to large teams?

Bank of America actually has over 5,000 Python developers, with over 10 million lines of Python in one project alone. JP Morgan underwent a similar transformation. YouTube also has engineers in the thousands and lines of code in the millions. Big products and big teams use Python every day, and while it has excellent modularity and packaging characteristics, beyond a certain point much of the general development scaling advice stays the same. Tooling, strong conventions, and code review are what make big projects a manageable reality.

Luckily, Python starts with a good baseline on those fronts as well. We use PyFlakes and other tools to perform static analysis of Python code before it gets checked in, as well as adhering to PEP8, Python’s language-wide base style guide.

Finally, it should be noted that, in addition to the scheduling speedups mentioned in Myth #6 and #7, projects using Python generally require fewer developers, as well. Our most common success story starts with a Java or C++ project slated to take a team of 3-5 developers somewhere between 2-6 months, and ends with a single motivated developer completing the project in 2-6 weeks (or hours, for that matter).

A miracle for some, but a fact of modern development, and often a necessity for a competitive business.

A clean slate

Mythology can be a fun pastime. Discussions around these myths remain some of the most active and educational, both internally and externally, because implied in every myth is a recognition of Python’s strengths. Also, remember that the appearance of these seemingly tedious and troublesome concerns is a sign of steadily growing interest, and with steady influx of interested parties comes the constant job of education. Here’s hoping that this post manages to extinguish a flame war and enable a project or two to talk about the real work that can be achieved with Python.

Keep an eye out for future posts where I’ll dive deeper into the details touched on in this overview. If you absolutely must have details before then, or have corrections or comments, shoot me an email at mahmoud@paypal.com. Until then, happy coding!

Building Data Science at Scale

By

As part of the Boston-based Engineering group, the Data Science team’s charter is to enable science-based personalization and recommendation for PayPal’s global users. As companies of all sizes are starting to leverage their data assets, data science has become indispensable in creating relevant user experience. Helping fulfill PayPal’s mission to build the Web’s most convenient payment solution, the team works with various internal partners and strives to deliver best-in-class data science.

Technology Overview

At the backend of the data science platform reside large-scale machine learning engines that continuously learn and predict, from transactional, behavioral, and other datasets. An example of a question we might try to answer is: if someone just purchased a piece of software, does it increase his likelihood to purchase electronics in near future? It is no wonder that the answer lies in the huge amounts of transaction data, in that what people bought in the past is predictive of what they might consider buying next.

We leverage state-of-the-art machine learning technologies to make such predictions. Machine learning itself has been quickly evolving in recent years. New advances including large-scale matrix factorization, probabilistic behavioral models, and deep learning are no strangers to the team. To make things work at large scale, we leverage Apache Hadoop and its rich ecosystem of tools to process large amounts of data and build data pipelines that are part of the data science platform.

Tackling Data Science

Companies take different approaches in tackling data science. While some companies define a data scientist as someone who performs statistical modeling, we at PayPal Engineering have chosen to take a combined science & engineering approach. Our data scientists are “analytically-minded, statistically and mathematically sophisticated data engineers” [1]. There are of course more science-inclined, and more engineering-inclined individuals on the team, but there is much more of a blend of expertise than a marked distinction between these individuals. This approach to data science allows us to quickly iterate and operationalize high-performing predictive models at scale.

The Venn diagram below, which bears similarity to Conway’s diagram [2], displays the three cornerstones pivotal to the success of the team. Science, which entails machine learning, statistics and analytics, is the methodology by which we generate actionable predictions and insights from very-large datasets. The Engineering component, which includes Apache Hadoop and Spark, makes it possible for us to do science and analytics at scale and deliver results with quick turnaround. Last but not least, I cannot emphasize more the importance of understanding the business for a data scientist. None of the best work in this area that I know of is done in isolation. It is through understanding the problem domain that a data scientist may come up with a better predictive model among other results.

DS-VennDiagram

Scaling Data Science

There are multiple dimensions to scaling data science, which at a minimum involves the team, technology infrastructure, and the operating process. While each of these is worth thoughtful discussion in its own right, I will focus on the operating process since it is critical to how value is delivered. A typical process for the team could break down into these steps:

  • Identify business use case;
  • Understand business logic and KPIs;
  • Identify and capture datasets;
  • Proof of concept and back-test;
  • Operationalize predictive models;
  • Measure lift, optimize and iterate.

Let me explain some of the key elements. First, we see scaling data science as an ongoing collaboration with Product and Business teams. Understanding product and business KPIs and using data science to optimize the same is an essential ingredient of our day-to-day. Second, we follow the best practice in data science, whereby each predictive model is fully back-tested before operationalization. This practice guarantees the effectiveness of the data science platform. Third, the most powerful predictive modeling often requires iterative measurement and optimization. As a concrete example, putting this process into practice along with PayPal Media Network, we were able to achieve excellent results based on:

  • Lookalike modeling: Merchants can reach consumers who look like the merchant’s best existing customers.
  • Purchase intent modeling: Merchants can engage consumers who have a propensity to spend within specific categories.

While it is challenging to crunch the data and create tangible value, it is interesting and rewarding work. I hope to discuss more details of all the fun things we do as data scientists in the future.

References:

[1] http://www.forbes.com/sites/danwoods/2011/10/11/emc-greenplums-steven-hillion-on-what-is-a-data-scientist

[2] http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

Alternative Design for Versioning of Services – Chain of Responsibility

By

Change is inevitable and interfaces (particularly of remote services) are no exception. Such interfaces go through various revisions for reasons such as enhanced features, bug fixes and so on. One must use versioning of type Major.Minor.Patch in order to avoid surprises to clients who have already integrated. Two tasks that vary at the interface layer based on versions are: (1) parsing the request parameters and (2) composing the response parameters back. What happens in between these two, the business logic processing, is not touched upon in this blog as it requires a separate discussion altogether. Also, this blog is focused on the code maintainability aspect only and does not get in to service routing concerns.

Conventional approach

Service Framework owners have conventionally used multi-level inheritance as the class design to parse the request and compose the response parameters for each version. In the diagram below, every new version demands a new class that must be inherited from the class of the most recent version. Each class delegates to its immediate base class (the most recent version) after performing its own task, be it parsing the request or composing the response or both.

Though this design has been tested and proved for many years, it has this tiny hitch called multi-level inheritance. Is there a better alternative?

Better approach (Backward Chain of Responsibility)

One thing we can’t get away from in the conventional approach is the fact that new classes need to be created for every version, for the benefit of isolation. Having everything in one class makes it less modular and less cohesive. So, it is only the type of relationship that can be different from the conventional one (multi-level inheritance). An alternative could be a slight variation of the Gang of Four (GoF) behavioral pattern called Chain of Responsibility. I am calling it “Backward Chain of Responsibility” to emphasize the variation aspect.

The diagram below explains the design well, focusing just on the response composition task. You can also check out a sample application written to demonstrate this design here: RamkumarManavalan/APIVersioningClassDesign

As mentioned earlier, a new class is created for every new version. These classes inherit from a base concrete class that has the previousInChain member (of the same type). These classes are chained together through the previousInChain member in the super class. For instance, the previousInChain data member of version2 object points to version1. In other words, version1 is positioned before version2 in the chain. Each version class performs its task before calling super class’s method that in turn passes the control to the previous member in the chain.

Given a version number, you can choose to either start from the beginning of the chain or go directly to the member in the chain. For instance, if there are 10 versions and version eight (8) is asked for, you can either start from 10 and go on (where 10 and nine (9) would simply be passed up), or, you can directly start from eight (8). The choice really depends on the frequency of use of the individual versions. In the code example I wrote, I chose to write a factory that starts from the corresponding version class directly, for performance reasons.

Where this design fairs better than the conventional approach is in the absence of multi-level hierarchy. The hierarchy level is fixed to one (1) through the use of delegation/composition.

Sample code: RamkumarManavalan/APIVersioningClassDesign

Other alternative

A discussion with my colleagues on this topic yielded Decorator chain as an alternative. Though it solves the problem, I am still with Chain of Responsibility because each member in the chain (versioning) is better executed in some fixed order (upward or downward), whereas in a decorator, the order does not really matter. Also, decorator works on something “core” which is really not present in the versioning use case.

Courtesy: the diagrams have been created using an application that uses plantUML library.