Datameer , a Hadoop-based analytics company, had a major presence at recent Hadoop Summit, led by CEO Stephan Groschupf’s keynote and panel appearance. Besides announcing its latest product release, which is an important advance for the company and its users, Datameer’s outspoken CEO put forth contrarian arguments about the current direction of some of the distributions in the Hadoop ecosystem.
The challenge for the growing ecosystem surrounding Hadoop, the open source processing paradigm, has been in accessing data and building analytics that serve business uses in a straightforward manner. Our benchmark research into big data shows that the two most pressing challenges to big data analytics are staffing (79%) and training (77%). This so-called skills gap is at the heart of the Hadoop debate since it often takes someone with not just domain skills but also programming and statistical skills to derive value from data in a Hadoop cluster. Datameer is dedicated to addressing this challenge by integrating its software directly with the various Hadoop distributions to provide analytics and access tools, which include visualization and a spreadsheet interface. My coverage of Datameer from last year covers this approach in more detail.
At the conference, Datameer made the announcement of version 3.0 of its namesake product with a celebrity twist. Olympic athlete Sky Christopherson presented a keynote telling how the U.S. women’s cycling team, a heavy underdog, used Datameer to help it earn a silver medal in London. Following that introduction, Groschupf, one of the original contributors to Nutch (Hadoop’s predecessor), discussed features of Datameer 3.0 and what the company is calling “Smart” analytics, which include a variety of advanced analytic techniques such as clustering, decision trees, recommendations and column dependencies.
Our benchmark research into predictive analytics shows that classification trees (used by 69% of participants), association rules (49%) and k-nearest neighbor (36%) are the techniques used most often; all are included in the Datameer product. Both on stage and in a private briefing, company spokespeople downplayed the specific techniques in favor of the usability aspects and examples of business use for each of them. Clustering of Hadoop data allows marketing and business analytics professionals to view how data groups together naturally while decision trees help analysts see how sets group and deconstruct from a linear subset perspective rather than from a framed Venn diagram perspective. In this regard clustering is more of a bottom-up approach and decision trees more of a top-down approach. For instance, in a cluster analysis, the analyst combines multiple attributes at one time to understand the dimensions upon which the data group. This can inform broad decisions about strategic messaging and product development. In contrast, with a decision tree, one can look, for instance, at all sales data to see which industries are most likely to buy a product, then follow the tree to see what size of companies within the industry are the best prospects, and then the subset of buyers within those companies who are the best targets.
Datameer’s column dependencies can show analysts relationships between different column variables. The output appears much like a correlation matrix, but uses a technique called Mutual Information. The key benefit of this technique over a traditional correlation approach is that it allows comparison between different types of variables, such as continuous and categorical variables. However, there is a trade-off in usability: The numeric output is not represented by the correlation coefficient with which many analysts are familiar. (I encourage Datameer to give analysts a quick reference of some type to help interpret the numbers associated with this less-known output.) Once the output is understood, it can be useful in exploring specific relationships and testing hypotheses. For instance, a company can test the hypothesis that it is more vertically focused than competitors by looking at industry and deal close rates. If there is no relationship between the variables, the hypothesis may be dismissed and a more horizontal strategy pursued.
The other technique Datameer spoke of is recommendation, also known as next best offer analysis; it is a relatively well known technique that has been popularized by Amazon and other retailers. Recommendation engines can help marketing and sales teams increase share of wallet through cross-sell and up-sell opportunities. While none of these four techniques is new to the world of analytics, the novelty is that Datameer allows this analysis directly on Hadoop, which incorporates new forms of data including Web behavior data and social media data. While many in the Hadoop ecosystem focus on descriptive analysis related to SQL, Datameer’s foray into more advanced analytics pushes the Hadoop envelope.
Aside from the launch of Datameer 3.0, Groschupf and his team used Hadoop Summit to espouse the position that the SQL approach of many Hadoop vendors is a mistake. The crux of the argument is that Hadoop is a sequential access technology (much like a magnetic cassette tape) in which a large portion of the data must be read before the correct data can be pulled off the disk. Groschupf argues that this is fundamentally inefficient and that current MPP SQL approaches do a much better job of processing SQL-related tasks. To illustrate the difference he characterized Hadoop as a freight train and an analytic appliance database as a Ferrari; each, of course, has its proper uses. Customers thus should decide what they want to do with the data from a business perspective and then chose the appropriate technology.
This leads to another point Groschupf made to me: that the big data discussion is shifting away from the technical details to a business orientation. In support of this point, he showed me a comparison of the Google search terms “big data” and “Hadoop.” The latter was more common in the past few years, when it was almost synonymous with big data, but now generic searches for big data are more common. Our benchmark research into business technology innovation shows a similar shift in buying criteria, with about two-thirds (64%) of buyers naming usability as the most important priority. By the way, a number of Ventana Research blogs including this one have focused on the trend of outcome based buying and decision making.
For organizations curious about big data and what they can do to take advantage of it, Datameer can be a low-risk place to start exploring. The company offers a free download version of its product so you can start looking at data immediately. The idea of time-to-value is critical with big data, and this is a key value proposition for Datameer. I encourage users to test the product with an eye to uncover interesting data that was never available for analysis before. This will help build the big data business use case especially in a bootstrap funding environment where money, skills and time are short.