Report from the 2012 Hadoop Summit - SmartData Collective

Abstract: One of the hottest technology trends in 2012 is Big Data Analytics. Nowhere was that more evident than at the 2012 Hadoop Summit in San Jose, which brought together a powerhouse lineup of commercial and open-source technology companies in the heart of Silicon Valley to discuss the state of Big Data, shared stewardship of the Hadoop platform, and hotly debate where the industry is going. This article will introduce Hadoop and the concept of Big Data, and demonstrate the importance of installing a Hadoop platform to drive competitive advantage and customer profitability. Several use cases from a wide range of industries are also included.

Introduction to Hadoop and Big Data

In June of 2003, as a Stanford grad named Doug Cutting was working on an indexer project called Lucene and an open-source search engine crawler project called Nutch, he developed a distributed file system that would later be called Hadoop (named after his son’s toy elephant). Large internet search engine companies at the time like Google and Yahoo who were generating far more data than commercial relational databases could efficiently handle were forced to develop their own home-grown solutions to deal with the new variety, volume and velocity of data generated by their sites. Two such solutions at Google were the Google File System (GFS) in 2003 and MapReduce in 2004, and Hadoop is derived from both of these systems.

Today Hadoop is an open-source Apachesoftware project that is used and developed in the Java programming language by a decentralized group of technologists around the world. Yahoo! is both one of the largest contributors as well as end-users of core Hadoop utilizing over 42,000 servers in their search engine index. Facebook has one of the largest Hadoop clusters, storing over 32 petabytes of data.

Hadoop is essentially a system that handles distributed and massively parallel data processing. The Hadoop Distributed File System (HDFS) can store any type of what is commonly referred to as ‘Big Data’. That list includes data such as web logs, images, video, audio, and sensor data. Justin Borgman from Hadapt said in his ‘Analyzing Multi-Structured Data with Hadoop’ Summit presentation that he best heard it described as a ‘supervised landfill’ that was scalable, inexpensive and limitless in terms of the types and amount of data that can be stored there. Prior to Hadoop, most businesses were reliant on commercial and expensive server and storage options and this resulted in a competitive edge to the larger companies with budgets to make large scare technology investments. Because Hadoop is open source, large scale data analysis has been democratized. A Hadoop cluster can be set up quickly, and open source MapReduce jobs used to process the data stored on the HDFS.

Hadoop, or the Hadoop Project, is a series of open source projects that include core Hadoop (HDFS, MapReduce) and various top-level Apache subprojects. Hive is an open-source data warehouse that sits on top of Hadoop and gets around the requirement to write MapReduce jobs in low level programming languages by allowing users to query using SQL-type languages called HiveQL. Pig is another high-level programming language that makes MapReduce programming similar to SQL and allows for User Defined Functions (UDF) writing in Python and Java/script., HBase is a highly scalable Hadoop database that allows real-time read/write access, and a data serialization system based on schemas called Avro.

The Hadoop Summit and What’s Ahead

The 2012 Hadoop Summit held on June 13^th-14^th2012 is now in its fifth year, hosted by Yahoo! and Hortonworks. Hortonworks is an independent entity of programmers and key contributors that was formed in 2011 by Yahoo! and Benchmark Capital with a mission to further global awareness and adoption of Apache Hadoop. Their goal is to make Hadoop the go-to Big Data platform and see half the world’s data processed by Hadoop in the next five years. Although framework alternatives to Hadoop do exist, Hadoop is at the epicenter of the Big Data movement and what was repeatedly referred to as next-generation architecture. New Hortonworks CEO Rob Bearden introduced the Keynotes by explaining how we are now witnessing a data explosion that will continue over the next 12 to 18 months. Companies will have the opportunity to move from a reactive post-transactional analysis focus to a preactive and interactive strategy with customers and suppliers. This shift is enabled by actionable and real-time intelligence derived from Big Data generated from social, mobile, machines and web by Hadoop platform solutions.

Shaun Connolly, VP of Corporate Strategy at Hortonworks followed up explaining that we are at a time in history where we are at the forefront of something really special, even by innovation standards of Silicon Valley, and how excited he is to be part of the Big Data wave. In 2010 the Big Data market potential was sized at $66 Billion and more recently Bank of America projected the total addressable Big Data Market at $100 Billion, with the Hadoop-specific portion of that total is estimated at $14 Billion.

Typically Big Data refers to the three ‘V’s’: Volume, Velocity and Variety. Shaun, and many other speakers and presentations echoed this theme, believes that Big Data scope also encompasses traditional transactional structured data as it intersects with all of the social media, user generated content, audio, video, sensor and observational data, gaming data. All of this data gains value when it is integrated and aggregated to fuel analytics efforts such as A/B testing, behavioral analysis and develop specific use cases. Several themes were stressed throughout the summit. Technologists are encouraged to view themselves as stewards of the open-source cause and continue to build on top of the Hadoop platform to extend and improve its capabilities and further its adoption. Projects such as Pig and Hive are excellent examples of projects that sit on top of Hadoop and new ones are emerging that further the scope, like Cascading and Avro, Chuckwa, Whim, Snappy, Spark, OpenMI, Giraph, and Accumulo.

Hortonworks CTO Eric Baldeschwieler discussed how Apache Hadoop 1.0 is an enterprise-ready stable release, and Hadoop 2.0 has been in development for three years and is currently in alpha, and represents a complete rewrite. The extended capability of Hadoop 1.0 releases introduce Hflush and greater support for HBase and website virtualization, security enhanced via Keberos, MapReduce limits that are included to add more stability on Hadoop clusters and prevent individual users from interfering with the entire system performance. Hadoop 2.0 is available in alpha, and includes YARN and Federationto support 10,000 and greater clusters. YARN is an add-on pluggable framework that offers compute models to make MapReduce optional, and provides real-time machine learning. The new release allows HDFS to have many name nodes to share storage, with rolling upgrades and wire compatible API’s that eliminate cluster downtime.

Bringing Big Data to the Masses

Following on the heels of the Hadoop Summit is the Hortonworks Data Platform (HDP) 1.0 release on June 15^th, 2012. HDP is an open-source platform based on core Hadoop components for superior data management. It incorporates an easy-to-use UI, an Ambari-based walk through to install Hadoop clusters, a powerful cluster-monitoring dashboard and custom alerts. HDP also offers Talend Open Studio for point and click based data integration capabilities, as well as HCatalog for application data sharing. The HDP release represents a concentrated effort by the open-source Hadoop community to further democratize Big Data, one step closer to making this technology accessible to those without deep technical knowledge or programming skills.

The HDP platform directly addresses some of the key challenges that currently exist to hamper the mission of expanding the Hadoop Platform and its user base and adoption. With HDP, the Hadoop ecosystem becomes easier to install and utilize, reducing the need for reliance on programmers and data scientists who are in short supply and high demand. Other challenges exist, such as slow enterprise-wide adoption of Big Data technology. Corporations have made large investments in infrastructure and are unlikely to abandon legacy RDBMS systems and replace them with Hadoop technology. For this reason it is essential to develop highly integrated solutions that coexist with open-source and commercial legacy systems to combine transactional data with new Big Data sources from the Hadoop platform to expand the scope of analysis far past what is possible at most companies today.

Who’s Who Snapshot at the 2012 Summit

The exhibition hall was used as a meeting point for the 2,100+ summit attendees. The Community Showcase included 49 vendors and a Dev Café, featuring open source and commercial vendors alike. Most of the large commercial vendors also offer a community open-source version of their Big Data offerings, and more powerful and secure enterprise versions. There was wide-representation from household name brands like IBM and Facebook, along with companies such as Datameer, Hadapt, Drawn to Scale, Dataguise, Splunk, Dropbox, MapR, and many others.

Numerous exciting announcements were released to time with the summit. Mountain view-based Zettaset, a firm recently designated a ‘cool company in security’ by Gartner, announced a new licensing agreement with IBMand Hyve to bring secure big data to midmarket clients with Infosphere Big Insights software. Another data security and protection solution, Dataguise, announced the release of an enterprise-grade risk assessment solution for Hadoop.

Use Cases of Big Data Analytics

Whereas past Summits were geared more towards a developer audience, this year sessions included many application examples (use cases) to demonstrate what is now not only possible, but in place at companies, through the integration of transactional and huge-scale interactional data via the HDP and Big Data solutions. Sessions also focused on new technological advancements that sit on top of Hadoop and extended its capabilities in a non-abstract way. The summit carefully outlined business cases and advantages using these technologies in such a way to demonstrate tangible value, no doubt resulting in the erosion of some existing enterprise-wise hesitation so far towards embracing Hadoop-based Big Data solutions. The following is a summary of some of the many topics covered in concurrent sessions at the Summit.

Nathan Marz from Twitter gave an interesting technical presentation on ‘Real-time analytics with Storm and Hadoop’. Open-source Storm is a fault-tolerant, real-time computation system that was developed by BackType (now part of Twitter) and written in Clojure. Storm handles real-time computations and Storm clusters are run in parallel to Hadoop clusters, and allows for precomputations. Bolts are components of Storm that rely on topology that figure out reach and map functiond, and take readings every five seconds from the database, while another component called Spout emits a tuple, and Bolt handles computations and updating.

Another interesting session in the Analytics track was ‘Analyzing Multi-Structured Data with Hadoop’ by Hadapt. While the session highlighted their Hadoop-based analytical platform that allows data analysts a platform to directly access underlying data without needing the programming skills to write MapReduce jobs, the material presented highlighted obstacles the entire Hadoop community is facing with a shortage of talent to adequately release the potential of the technology and the need for this to become more accessible and mainstream to enterprises through layers that allow SQL query type mechanisms and to some extent hide the complexity of the big data underneath. Since most companies are not yet deploying Big Data solutions enterprise-wide finding a way to integrated traditional RDBMS database architecture with technology layers that facilitate easy integration and similar conceptual functionality will result in faster adoption and integration. Technology is bridging that gap, getting closer to a vision of Hadoop platform as a core engine in what is presented as a more familiar data source addition higher in the technology stack to the end users.

A couple of business applications were addressed, to answer the question of why Big Data is important and how can it be useful to a company. The music industry is slated to be fertile ground for Big Data. Retail and point-of-sale (POS) data will provide rich information available related to location, items purchased, brand loyalty (measured through repeat purchases), customer rewards number, payment method, items purchased together, etc. that resides on a receipt. All of this information was used to determine online retail strategy with a client. The business question was ‘Should we run a promotion on Lady Gaga or Justin Bieber fans?’ By pulling shopping cart and browsing behavior data together in Hadoop with a database, tightly integrating that solution with an interface that allows SQL querying of Big Data, the process resulting in a solution where business analysts drove analytics and not a machine learning program, or group of Java programmers.

Another business case that was introduced was a company that wished to increase customer loyalty and trust. The solution was a micro-segmentation approach that helped with allocating marketing spending and fraud detection that led to a more personalized customer experience and greater customer security, improving two key inputs to the customer loyalty equation.

Chris Mattmann gave a presentation on ‘Big Data Challenges at NASA’, citing difficulty comparing petabytes of climate data, planetary data and other sensor type data and applying algorithms to make sense of the massive amounts of data. File management is an issue, many of the data elements are images and maps and instruments can be likened to a fire hose of information, all of which NASA is required to archive for 50 years, and data that must be searchable. NASA data has been expanding over time; in 2012 over 100 gigabytes of images are produced each second that need to be maintained and analyzed. In one case, for example, a Snow Hydrology Project is ongoing where Modisdata is pulled from a land process deck, to measure global surface reflections to determine amounts of snowfall. These measurements can be impacted when dust or black carbon covers snow and results in a bad prediction, and this in turn has critical implications for water management system planning.

Netflix had an interesting series of business examples in a talk on ‘Hadoop & Cloud @Netflix: Timing the Social Data Firehouse’. Many people are familiar with the Netflix prize in 2008, and since then the size of data has increased exponentially. In fact the winning algorithm at the time was never implemented because of challenges that Hadoop and Big Data overcome today. Netflix stores data on a NoSQL platform including SimpleDB, Hadoop/Hbase and Cassandra. Much of the content on the Netflix site is driven by algorithms. For example, there is a tab ‘Popular on Facebook’ where all Facebook friends connect and data based on friend’s preferences is introduced into a recommendation algorithm to determine which movies the user will also like. The user is presented with content similar to what his preferences are and his friends preferences are, and these recommendation algorithms are continuously refined though machine learning processes that combine Markov Chains, collaborative filtering, Large Scale match, clusters and row selection algorithms. Why so much emphasis on these recommendation algorithms? Netflix has determined that recommendations matter; of the 2 billion hours of streaming movies watched in Q4, 2011, 75% of these were selected based on recommendations. In terms of technology, Hype is used on the front end, combined with Hadoop, Java and pig, Mahout and a machine learning library, all assembled to crack the recommendation code that will drive more hours of content consumption by customers.

Summary

One thing is certain, by the time the sixth annual Hadoop Summit comes around next year, Big Data will be bigger. Business applications that are emerging now will be furthered as more enterprises incorporate big data analytics and HDP solutions into their architecture. New solutions in fields like Healthcare with disease detection and coordination of patient care will become more main stream. Crime detection and prevention will benefit as the industry further harnesses the new technology. Hadoop and Big Data promise not only to result in greatly enhanced marketing and product development. It also holds the power to drive positive global social impact around improved wellness outcomes and security, and many other areas. This, when you think about it, fits perfectly with the spirit of the Summit which calls for continued stewardship of the Hadoop Platform and promotion of associated technology by open-source and commercial entities.