June 5, 2014

Choosing a Commercial R Distribution Over Open Source R

RevR organised a meet-up in San Jose on 2nd June before the Hadoop Summit 2014, which I attended to catch up with what was happening in the commercial R world. This blog covers a subset of topics discussed in the meet up punctuated with my own opinion.

When we think of data science and analytics, R and SAS are the two software solutions, which often come to our mind. SAS has been around for a couple of decades as the old guard of data science even before this word was invented and the Big Data became a buzzword in the enterprises and Internet powered businesses.

R is an open source, community driven project. Hundreds of contributors from the universities and research institutions from all over the world contribute to R, which makes R a rich and popular technology for data science, machine learning and predictive analytics. R was created in 1991 by building upon the previous work done at AT&T in developing a data analysis language called S since 1977. You can read the full story of S and R here:

http://www.research.att.com/articles/featured_stories/2013_09/201309_SandR.html?fbid=Yxy4qyQzmMa

In the overcrowded space of machine learning and predictive analytics tools, also known as data science tools, Revolution R (RevR) is a vendor that sells a commercial version of R, known as Revolution R Enterprise (RRE), which is a superset of open source R. It comes bundled with the after sales support that enterprises need when they use a technology to run their business.

A common argument in the favour of R is that it has a community of 2 million active users. A large and active community means rapid innovation and abundance of skilled professionals. R is also taught in the universities, which would mean that young data scientists will have preference to solve data science problems in R. However, given the extensive choice of data science tools in the market, both open source and commercial ones, there is a good chance that young data scientist might also prefer a tool other than R during their studies such as MLlib or Matlab.

Open Source R is does not scale very well because it is single-threaded and memory bound. RRE solves the scalability problem by making it multi threaded by linking R with the multi-threaded math kernel libraries of Intel. ScaleR, the Big Data predictive analytics library included with RRE, provides fast and scalable, parallel algorithms for analyzing very large data sets or large models with many independent variables, or both. ScaleR is particularly suited for analyzing the Big Data datasets such as web clickstreams or tweets. As per RevR, people use RRE when they want performance and innovative features that are not available in open source R.

Deployment of dat science algorithms on the operational systems is a major challenge faced by the enterprises because the architecture of the systems where the algorithms are developed is often different from the operational systems. For example, if you develop a predictive model in R to calculate the best offers to the customers, this predictive model can not be deployed directly in the web applications, which are often based upon Java, .NET or some other technology that does not support direct integration with R. To solve this problem, RevR offers DeployR, which allows you to expose models developed in R using webservices. Once model is exposed as a webservice, integrating it with web applications is not very hard task.

Overall RRE is an interesting product which should be in the candidate list of the enterprises that are planning to do serious data science work with R. RevR has made a comparison of various flavours of R on their website, which should be a good point to start your R selection journey.

http://www.revolution-computing.com/which-r-is-right-for-me

Kudos

Choosing a Commercial R Distribution Over Open Source R

Now read this

Low Latency Query Frameworks for Hadoop