Last week I attended Spark Summit East 2016 at the New York Hilton Midtown. It revealed several ways in which Spark technology might impact the big data market.
Apache Spark is an open source data processing engine designed for large-scale computing. Spark is often used in conjunction with the open source Apache Hadoop, but it can be used with other data sources as well such as Cassandra, MongoDB and Amazon S3. The creators of Spark founded Databricks, which drives the roadmap for Spark and leads community evangelism including organizing the Spark Summit events. According to Arsalan Tavakoli-Shiraji, VP of customer engagement and business development for Databricks, the company contributes approximately 75 percent of the code to the Spark project.
If you are wondering why Spark matters, consider that in June 2015, IBM announced that it would put more than 3,500 researchers and developers on Spark-related projects and would donate its IBM SystemML machine-learning technology to the Spark open source ecosystem. This scale of investment is likely to attract others in the industry, and interest in Spark is growing elsewhere, too. In his keynote presentation Matei Zaharia, co-founder and CTO of Databricks, claimed that summit attendees (across multiple locations) grew from 1,100 in 2014 to 3,900 in 2015. I was told by event staff that the New York event alone, including preconference training sessions, exhibitors and sponsors, had approximately 1,500 attendees, doubling attendance from the previous year.
This is where it starts to feel like, in the words of Yogi Berra, déjà vu all over again. In 2010 and 2011 I attended and wrote about events held in the very same venue. At the time, interest was growing in something called “Hadoop,” which, according to our recent benchmark research on big data analytics, has now grown to the point where it is used by 37 percent of organizations to process their big data analytics.
Recently there has been a lot of buzz around Spark, including speculation that Spark could replace Hadoop. Just search for “Spark replace Hadoop” and see how many pages come up. However, this is off target: Spark won’t replace Hadoop, in particular because it is not designed to store data. Rather it is designed to be used in conjunction with data storage tools such as the Hadoop Distributed File System (HDFS). Spark does, however, address a shortcoming in one part of Hadoop: MapReduce. MapReduce was designed to process very large amounts of data in parallel to speed up processing, but it was designed for batch processing. As big data has grown in popularity, more and more users are accessing it and demanding interactive access to data regardless of its volume. More than half (54%) of participants in our research rated right-time and real-time analytics as important, and nearly half (48%) rated visual and data discovery important. Spark’s fundamental performance advantage over MapReduce lies in being designed to process data in memory to the extent possible.
Spark can be used to provide real-time analytic capabilities on big data. Spark has several components that provide those capabilities. Spark Streaming provides real-time capabilities for processing streaming data, that is, data being generated constantly. Spark SQL provides interactive query capabilities on structured data. And Spark ML provides machine-learning capabilities that can be used to build analytical models including predictive analytics. Our research shows that predictive analytics and statistics are the most important area of big data analytics, cited by more than three-fourths (78%) of participants.
At the conference Databricks announced Spark 2.0, planned for release in April or May. It will have three primary sets of enhancements. What is presently called Tungsten phase 2 improves performance of Spark with native memory management and better run-time code generation. Structured streaming combines real-time processing with batch and interactive queries. And the merging of APIs for DataFrame and DataSets will enable a single programming paradigm to be used for streaming and structured data.
There is a rich ecosystem developing around Spark. A variety of vendors attended the show including several of the Hadoop distribution vendors (Cloudera, Hortonworks, IBM and MapR) demonstrating and talking about how they support Spark. Analytic vendors including Alpine Data, Platfora, SAP, Stratio and Zoomdata demonstrated products that use Spark to provide interactive analytics over large amounts of data. Full-day training sessions that were part of the agenda were oversubscribed, with packed rooms and registration spots sold out on the event’s website.
The event was clearly focused on a technical audience, and a look through conference keynotes reveals many code snippets. However, this technical focus does not suggest a lack of business purpose in Spark applications. Use cases were also represented in the conference agenda, and future events may attract more business users. The popularity of the event and the acceptance of Spark by the Hadoop community and vendors suggest that Spark is here to stay – at least for the foreseeable future. If you are working with or plan to work with big data, it would be worthwhile to understand how Spark can be incorporated into your architecture, either directly as part your own programming efforts or via third-party tools to use in conjunction with big data.
SVP & Research Director