As to be expected, Anne, your arguments against building a data lake are both persuasive and passionate. You’ve made some great points, my friend, but you’re making this way too easy for me. Before I jump into my rebuttal [my next post], I’d like to clarify a few things that you brought up. I’ve boiled it down to three questions. What say you?
As to be expected, Anne, your arguments against building a data lake are both persuasive and passionate. You’ve made some great points, my friend, but you’re making this way too easy for me. Before I jump into my rebuttal [my next post], I’d like to clarify a few things that you brought up. I’ve boiled it down to three questions. What say you?
Hadoop does provide a fantastic data storage opportunity, but it does not require us to abandon all of our existing structured data environments. Copying existing structured data to a data lake (especially transactional data) would be a duplication of effort and storage and would create additional risk for the organization. Moving operational data would be an enormous event, as it would require applications throughout the organization to undergo a significant coding/design overhaul which is not going to be a popular idea in any business unit.
The ideal scenario is to leave existing data where it lives today and use Hadoop as the storage repository for the data that previously could not be stored because of constraints presented by volume, variety or velocity. Organizations can take advantage of data virtualization tools where not only is the integration coding challenge eliminated but other advantages such as centralized security and governance are gained. The data is queried, transformed and structured as needed and provisioned to business users through virtual views. No dumping of data – just purposeful access, integration and use.
Historically, in business, unstructured data sources were managed within the scope of knowledge management or content management. The vast storage capabilities that Hadoop presents allows the documents, emails and other unstructured sources to be centrally stored and the content is now considered accessible data. While it is true, the sources can now be accessed through Hadoop to glean the content as ingestible data, it is not the storage and access that brings the advantage. The advantage is in the insights derived from the analysis of the data. Regardless of the type of data (structured, semi-structured or unstructured) or how and where the data is stored, organizations can take full advantage of any and all data by generating value when processing or analyzing it within a specific business context.
Previously in the Data Lake Debate:
- The Introduction – by Jill Dyche
- Pro’s Up First – by Tamara Dull
- Questioning the Pro – by Anne Buff and Tamara Dull
- Negative Puts a Stake in the Ground – by Anne Buff