Earlier this week I attended Hadoop World in New York City. Hosted by Cloudera, the one-day event was by almost all accounts a smashing success. Attendance was approximately double that of last year. There were five tracks filled mostly with user presentations. According to Mike Olson, CEO of Cloudera, the conference’s tweet stream (#hw2010) was one of the top 10 trending topics of that morning. Cloudera did an admirable job of organizing the event for the Hadoop community rather than co-opting it for its own purposes. Certainly, this was not done out of altruism, but it was done well and in a way that respected the time and interests of those attending.
If you are not familiar with Hadoop, it is an open source software framework used for processing “big data” in parallel across a cluster of industry-standard servers. Hadoop is largely synonymous with MapReduce, but the Hadoop framework has a variety of components including a distributed file system, a scripting language, a limited set of SQL operations and other data management tools.
By the way, the name Hadoop comes from a stuffed toy – a yellow elephant – belonging to Doug Cutting’s son, which made an appearance at the event. Doug created Hadoop and is now part of Cloudera’s management team.
How big is “big data”? In his opening remarks, Mike shared some statistics from a survey of attendees. The average Hadoop cluster among respondents was 66 nodes and 114 terabytes of data. However there is quite a range. The largest in the survey responses was a cluster of 1,300 nodes and more than 2 petabytes of data. (Presenters from eBay blew this away, describing their production cluster of 8,500 nodes and 16 petabytes of storage.) Over 60 percent of respondents had 10 terabytes or less, and half were running 10 nodes or less.
The one criticism of the event I heard repeatedly was that the sessions were too short for the presenters to get into the meat of their applications. John Kreisa, VP of Marketing at Cloudera, told me he agreed and indicated that the sessions likely will be longer next year.
What is it that makes Hadoop an elephant in the room? Over the past 12 to 18 months Hadoop has gone mainstream. A year ago, you could still say it was a fringe technology, but this week’s event and the development of a strong ecosystem around Hadoop make it clear that it is a force to be reckoned with. Many of the analytic database vendors have announced some type of support for Hadoop. Aster Data, Greenplum, Netezza and Vertica were sponsors of the event. Data integration and business intelligence vendors also have announced support for Hadoop, including event sponsors Pentaho and Talend. An ecosystem of development, administration and management tools is emerging as well, as shown by announcements from Cloudera and Karmasphere.
My colleague wrote about Cloudera Version 3 when it was announced back in June. You can expect to see expect to see new Cloudera Distributions for Hadoop (CDH) annually. Cloudera Enterprise – the bundling of CDH, plus Cloudera’s Management Tools – will be released semi-annually. Version 3.0 is in beta now. Version 3.5 is planned for the first quarter of 2011 and includes real time activity monitoring and an expanded file browser among other things.
If you work with big data but don’t know about Hadoop, you should spend some time learning about it. Our research is already finding the need for simpler and more cost effective methods to manage and use big data for analytics, business intelligence and information applications. If you want to understand some of the ways in which Hadoop is being used, I have another blog coming that will discuss its value for your business.