Data warehouses are a critical component for enterprises seeking to gain insights from the data they collect, but as the volume of data businesses collect continues to grow, the traditional data warehouse is increasingly becoming too expensive to maintain.
Data warehouses are a critical component for enterprises seeking to gain insights from the data they collect, but as the volume of data businesses collect continues to grow, the traditional data warehouse is increasingly becoming too expensive to maintain. On top of this, the majority of data being created today is unstructured data, which a traditional database is unable to collect and store unless the data is converted into a structured form. Due to this, many businesses have turned to Apache Hadoop as a long-term storage and ETL tool. While many articles have been written acknowledging Hadoop’s value as an analysis platform for Big Data, it is also worthy of consideration as a storage platform, i.e. a data hub. Here’s why.
A More Affordable Option
As mentioned above, storing large amounts of data in a traditional database becomes increasingly too expensive. The average data warehouse requires $50k- $100k per TB of data. With data from tweets, emails, and Facebook posts as well as Machine-generated log files, sensor and clickstream data pouring in at an exponential rate, the cost of transforming and storing this data is huge, not to mention the cost of expanding the warehouse’s hardware to increase capacity.
To reduce storage costs, many companies store only samples of their raw data based on pre-determined assumptions or priorities. This means that as priorities change or new business questions come up the raw data is no longer available to analyze, leaving room for costly mistakes and missed opportunities that the data could make visible.
Hadoop, on the other hand, stores data at less than $1K per TB, a cost savings of 50x-100x. In addition, Hadoop’s scalability keeps the size of the data storage system in check, further reducing the cost of data storage.
Data Pre-Processing
In discussion about Hadoop, some have started to refer to Hadoop as the big data hub. In other words, Hadoop is seen as a common repository for all types of data: structured, semi-structured and unstructured. Hadoop is used as a landing zone of sorts to collect all of the data being produced before it is sorted into the traditional database or left as is for long-term storage. This is incredibly valuable as there is a wealth of business insights within multi-structured data that Hadoop now allows companies to extract.
Flexibility
Hadoop is an open source project, but some vendors, like MapR have started adding innovations on top of the core which has made Hadoop completely enterprise-ready and appropriate to use for a repository for all its data. The enterprise-grade version of Hadoop offers enhanced security, full data protection and disaster recovery, as well as rolling upgrades.
As Hadoop becomes increasingly more sophisticated, businesses will have hard time looking past its ability to save the company money and to store, analyze and compute all types of data as they consider their data storage options. It seems Hadoop may have not only rocked the boat when it comes to data analytics but also how that data is stored in the first place.