Revolution Analytics is a commercial provider of software and services related to enterprise implementations of the open source language R. At its base level, R is a programming language built by statisticians for statistical analysis, data mining and predictive analytics. In a broader sense, it is data analysis software used by data scientists to access data, develop and perform statistical modeling and visualize data. The R community has a growing user base of more than two million worldwide, and more than 4,000 available applications cover specific problem domains across industries. Both the R Project and Revolution Analytics have significant momentum in the enterprise and in academia.
Revolution Analytics provides value by taking the most recent release from the R community and adding scalability and other functionality so that R can be implemented and seamlessly work in a commercial environment. Revolution R provides a development environment so that data scientists can write and debug R code more effectively, and web service APIs that integrate with other BI tools and dashboards so that R can work with business intelligence tools and visual discovery tools. In addition, Revolution Analytics makes much of its money through professional and support services.
In the big data analytics context, speed and scale are critical drivers of success, and Revolution R appears to deliver on both. It is built with the Intel Math Kernel Library, so that processing is streamlined for multithreading at the processor level and it can leverage multiple cores simultaneously. In test cases on a single node, R was only able to scale to observations of about 400,000 in a linear regression model, while Enterprise R was able to go into the millions. With respect to speed, Revolution R 6.1 was able to conduct a principal component analysis in about 11 seconds versus 142 seconds with version R-2 14.2.
Companies are collecting enormous amounts of data, but few have active big data analytics strategies. Our big data benchmark research shows that more than 50 percent of companies in our sample maintain more than 10TB of data, but often they cannot analyze the data due to scale issues. Furthermore, our research into predictive analytics says that integrating into the current architecture is the biggest obstacle facing the implementation of predictive analytics.
Revolution Analytics helps address these challenges in a few ways. It can perform file-based analytics, where a single node orchestrates commands across a cluster of commodity servers and delivers the results back to the end user. This is an on-premise solution that runs on Linux clusters or Microsoft HPC clusters. A perhaps more exciting use case is alignment with the Hadoop MapReduce paradigm, where Revolution Analytics allows for direct manipulation of the HDFS file system, can submit a job directly to the Hadoop jobtracker, and can directly define and manipulate analytical data frames through Hbase database tables. When front-ended with a visualization tool such as Tableau, this ability to work with data directly in Hadoop becomes a powerful tool for big data analytics. A third use case has to do with the parallelization of computations within the database itself. This in-database approach is gaining a lot of traction for big data analysis primarily because it is the most efficient way to do analytics on very large structured datasets without moving a lot of data. In this scenario, data scientists build a model within the database utilizing the R library for data exploration, or use a prebuilt model to quickly scour the entire dataset. IBM’s PureData System for Analytics (IBM’s new name for its MPP Netezza appliance) uses the in-database approach with an R instance running on each processing unit in the database, each of which is connected to an R server via ODBC. The analytics are invoked as the data is served up to the processor such that the algorithms run in parallel across all of the data.
So why does all of this matter? Our benchmark research into predictive analytics shows that companies that are able to score and update their models more efficiently show higher maturity and gain greater competitive advantage. From an analytical perspective, we can now squeeze more value out of our large data sets. We can analyze all of the data and take a census approach instead of a sampling approach which in turn allows us to better understand the errors that exist in our models and identify outliers and patterns that are not linear in nature. The ability to identify outliers is probably the most important capability since seeing the data anomalies often leads to the biggest insights and competitive advantage. Most importantly, from a business perspective, we can apply models to understand things such as individual customer behavior, a company’s risk profile or how to take advantage of market uncertainty through tactical advantage.
I’ve heard that Revolution R isn’t always the easiest software to use and that the experience isn’t exactly seamless. While our research shows that usability is important, it can be argued that in the cutting-edge field of big data analytics, a challenging environment is to be expected. If Revolution Analytics can address some of these challenges, it may find its pie growing even faster than it is now. Regardless, I anticipate that Revolution Analytics will continue its fast growth (already its headcount is doubling year-over-year). Furthermore, I anticipate that in-database analytics will become the de-facto approach to big data analytics and that companies that take full advantage of that trend will reap benefits.