In Too Big to Ignore, I wrote about the increasing importance of technologies and systems designed to handle non-relational data. Yes, the structured information on employees, sales, customers, inventory, and the like still matter. But the story doesn’t end with Small Data. There’s a great deal of value to be gleaned from the petabytes of unstructured data lying outside of organizations’ walls. Hadoop is just one tool that can help realize that value.
But no one ever said that Hadoop was perfect or even ideal. The first major iteration of any important technology or application never is.
To that end, data Geeks like me could hardly contain their excitement with the announcement that Hadoop 2.0 is now generally available.
The biggest change to Apache Hadoop 2.2.0, the first generally available version of the 2._x_ series, is the update to the MapReduce framework to Apache YARN, also known as MapReduce 2.0. MapReduce is a big feature in Hadoop—the batch processor that lines up search jobs that go into the Hadoop distributed file system (HDFS) to pull out useful information. In the previous version of MapReduce, jobs could only be done one at a time, in batches, because that’s how the Java-based MapReduce tool worked.
With the available update, MapReduce 2.0 will enable multiple search tools to hit the data within the HDFS storage system at the same time.
Hadoop and Platforms
I asked my friend Scott Kahler about Hadoop 2.0 and he was nothing short of effusive. “Yes, it’s huge deal. YARN will make Hadoop a distributed app platform and not just a Big-Data processing engine,” Kahler told me. “YARN is enabling things like graph databases (Giraph) and event processing engines (Storm) to get instantiated much easier on common distributed system infrastructure.”
I know a thing or two about platforms, and Hadoop 2.0 underscores the fact that it is becoming a de facto ecosystem for Big Data developers across the globe. Got an idea for a new app or web service? Build it on top of Hadoop. Take the core product in a different direction. If others find that app or web service useful, expect further development on top of your work.
Simon Says: We’re Just Getting Started
Hadoop naysayers abound. For all I know, Hadoop isn’t the single best way of handling Big Data. Still, it’s hard to argue that the increased functionality of its second major iteration isn’t a big deal. As it continues to evolve and improve, the benefits begin to exceed its costs.
Yes, many if not most organizations will still resist Big Data for all sorts of reasons. An increasingly developer-friendly Hadoop, though, means great things for enterprises willing to jump into the Big Data fray.
Feedback
What say you?