Hadoop, which as named after the elephant toy of the inventor of Hadoop, was developed because the existing data storage and processing tools appeared to be inadequate to handle all the large amounts of data that started to appear after the internet bubble. First it was Google who developed the paradigm MapReduce to be able to cope with the flow of data that came via its mission to organize the world’s information and make it universally accessible and useful. Yahoo in turn developed Hadoop in 2005 as an implementation of MapReduce.
Hadoop, which as named after the elephant toy of the inventor of Hadoop, was developed because the existing data storage and processing tools appeared to be inadequate to handle all the large amounts of data that started to appear after the internet bubble. First it was Google who developed the paradigm MapReduce to be able to cope with the flow of data that came via its mission to organize the world’s information and make it universally accessible and useful. Yahoo in turn developed Hadoop in 2005 as an implementation of MapReduce. It was released as an open source tool in 2007 under the Apache license.
Over the years, Hadoop has converted into an operating system at a very large scale especially focused on distributed and parallel processing of the vast amounts of data created nowadays. As is with any ‘normal’ operating system, Hadoop consists of a file system, is able to write programs, can manage distributing those programs and return the results afterwards.
Hadoop supports data-intensive distributed applications that can run simultaneously on large clusters of normal, commodity, hardware. It is licensed under the Apache v2 license. A Hadoop network is reliable and extremely scalable and it can be used to query massive data sets. Hadoop is written in the Java programming language, meaning it can run on any platform, and is used by a global community of distributors and big data technology vendors who have built layers on top of Hadoop.
The feature that makes Hadoop so useful is that the Hadoop Distributed File System (HDFS). This is the storage system of Hadoop that is able to break down the data that it processes into smaller pieces, which are called blocks. These blocks are subsequently distributed throughout a cluster. This distributing of the data allows the map and reduce functions to be executed on smaller subsets instead of on one large data set. This increase efficiency, processing time and it enable the scalability necessary for processing vast amounts of data.
MapReduce is a software framework and model that can process and retrieve the vast amounts of data stored in parallel on the Hadoop system. The MapReduce libraries have been written in many programming languages and it therefore can work with all of them. MapReduce can work with structured and unstructured data.
MapReduce works in two steps. The first step is the “Map-phase”, which divides the data into smaller subsets and distributes those subsets over the different nodes in a cluster. Nodes within the system can do this again, resulting in a multi-level tree structure that divides the data in ever-smaller subsets. At those nodes, the data is processed and the answer is passed back to the “master node”. The second step is the “Reduce-phase”. The master node collects all the returned data and combines them into some sort of output that can be used again. The MapReduce framework manages all the various tasks in parallel and across the system and forms the heart of Hadoop.
With the combination of these technologies, massive amounts of data can be easily stored, processed and analyzed in a fraction of a second. In the past years, Hadoop has proven very successful for the Big Data ecosystem and it looks like it this will remain in the future. With the development of Hadoop 2.0, it now uses an entirely new job-processing framework which is called YARN. YARN stands for Yet Another Resource Negotiator and this is the module that manages the computational resources, again in clusters, for application scheduling. YARN enables multiple data processing engines such as interactive SQL, real-time streaming, data science and batch processing to handle data stored in a single platform, creating an entirely new approach to analytics.
Hadoop is a powerful tool and since 2005, over 25% organizations currently use Hadoop to manage their data, up from 10% in 2012. There are several reasons why organizations use Hadoop, being:
- Low cost;
- Computing power;
- Scalability;
- Storage flexibility;
- Data protection.
It is being used in almost any industry ranging from retail to government to finance. The below infographic, which as created by Solix, offers a more in-depth on Hadoop along with some interesting predictions.
I really appreciate that you are reading my post. I am a regular blogger on the topic of Big Data and how organizations should develop a Big Data Strategy. If you wish to read more on these topics, then please click ‘Follow’ or connect with me viaTwitter or Facebook.
You might also be interested in my book: Think Bigger – Developing a Successful Big Data Strategy for Your Business.
This article originally appeared on Datafloq.