If you enjoyed my previous blog, “Hadoop Is the Elephant in the Room,” perhaps you’d be interested in what your organization might do with Hadoop. As I mentioned, the Hadoop World event this week showcased some of the biggest and most mature Hadoop implementations, such as those of eBay, Facebook, Twitter and Yahoo. Those of you who need 8,500 processors and 16 petabytes of storage like eBay likely already know about Hadoop. But is Hadoop relevant to organizations with less data that is still a lot?
For those not yet familiar with Hadoop, it is an open source software project with two key components: the Hadoop Distributed File System (HDFS) and a data processing and job scheduling technique called MapReduce. There are as many as nine other components depending on which distribution you use and other complementary tools or products from proprietary and open source software companies. In this post I’ll concentrate on why you might be interested in learning more about Hadoop and its components rather than explaining what each of the components does.
I see three common use cases for Hadoop:
1) To store and analyze large amounts of data without having to load the data into an RDBMS
2) To convert large amounts of unstructured or semistructured data (such as log files) into structured data so it can be loaded into an RDBMS
3) Or to perform complex analytics that are hard to express in SQL such as graph analysis and data mining.
Generally the factor that prompts organizations to consider Hadoop is data volume. Hadoop is designed to process large batches of data quickly. Several presenters at the conference said it enables them to do analyses that they couldn’t do previously. Often there is no real alternative to Hadoop to complete such analyses in a reasonable timeframe. The other initial attraction is cost savings derived from Hadoop being an open source technology, which holds down or eliminates software licensing and upgrade fees.
Hadoop World offered 45 breakout sessions. By far the largest market segment represented was Web-related businesses such as AOL, eBay, Facebook, Mozilla, Stumble Upon, Twitter, Yahoo and others. These organizations have to deal with large volumes of log files, search strings and social network data. Other market segments represented included media and advertising, financial services, healthcare and government intelligence.
In the media and advertising space, organizations are using Hadoop to perform best-ad-offer analysis and analyze performance of online videos to determine, for example, factors behind viewer abandonment. I was surprised that only a handful of the 900+ attendees identified themselves as being part of the financial services industry. Bank of America gave a presentation, but it didn’t go into a lot of detail on how it is using Hadoop. Chicago Mercantile Exchange speakers talked about how they analyze the daily streams of transaction data. As well I know of at least two firms (not part of the event) that are analyzing trade data with Hadoop back-testing trading algorithms. One chose Hadoop because it can express complex algorithms more easily than SQL. The other chose Hadoop to replace an RDBMS because of its cost advantages.
In the healthcare space, one presentation discussed analyzing the intersection of mountains of electronic health records, treatment protocols and clinical outcomes. I also know of pharmaceutical organizations using Hadoop in the drug discovery process. And while I also know of Hadoop being used in the intelligence community, if I told about it I’d have to kill you. However it is easy to imagine that the intelligence community would be interested in social network analysis, digital image analysis and other analyses involving large amounts of data and/or complex algorithms that would be difficult to express in SQL.
For more use cases and examples of the popularity of Hadoop, see http://wiki.apache.org/hadoop/PoweredBy where close to 200 organizations have voluntarily listed information about it.
Having discussed the virtues of the technology, I also want to point out some caveats about it. First, Hadoop is not a real-time processing environment but a batch processing environment with response times measured in minutes or hours depending on data volumes. I heard several times at the event that just to start up a Hadoop job takes around 30 seconds. Your mileage may vary, but the point is it doesn’t provide subsecond or even few-seconds response times.
As well, Hadoop is not a database environment in the traditional sense. However, it can be used to store large amounts of data such as source files or detailed data that generally is not accessed on a frequent basis. Shifting some of this type of data to Hadoop can help reduce licensing costs of a traditional RDBMS. Frequently accessed data (typically the results of a Hadoop job) would be stored in an RDBMS for any type of ad-hoc or frequent query and analysis.
Whether your motivation is to achieve scalability, cost savings or complex analytics, Hadoop is a technology worth considering. At this point there are plenty of examples of its use you can draw upon to understand how it could be relevant to your organization.