Tamara, Tamara, Tamara…We have known each other for quite a while and I cannot believe we are having the same conversation AGAIN! Technology is not the answer for every data issue. I get it – Hadoop and the concept of data lakes are hot topics. However, just because they are trending in the world of technology does not mean that they will solve critical business issues such as taking full advantage of an organization’s data. That being said, I have a few questions for you about the definitions and your arguments.
Okay, so you weren’t fond of my use of the term information in my data definition. That’s fair. It’s confusing and somewhat circular. My point was that the data in data lake is digital in nature. Can we agree on that?
As for your question, do you think I’m suggesting that an organization create a big black box, slap a label of “data lake” on it, and then start filling it up with any and all data—without any context or purpose? As crazy as that sounds (and there are some who are saying this), it is not what I’m suggesting. What I am saying is now that we have the technology to build a proper data lake, it’s time to consider it—not in a “build it and they will come” haphazard fashion, but in a strategic, methodical manner.
Will all the data that comes into the data lake have context and purpose? Absolutely not. Even though that’s the ideal, it’s not realistic. Context and purpose will need to be added as the data is processed and pushed/pulled downstream to other repositories and applications.
Now that we can—with big data technologies like Hadoop—the question is now shifting to “Should we?” Some are saying, “Sure! Grab it all and throw it in the data lake!” while others are convinced that grabbing it all will only result in a big ol’ smelly data swamp. The correct answer lies somewhere in between these two extremes for an organization.
But make no mistake: The data lake is not a geographical cure. If your organization is already doing a crummy job of not governing and managing the data in your current systems, then moving any data—existing or new—to a data lake is not going to solve this core shortcoming. Your bad data and data practices will follow you.
The data lake inquirer can now apply her own lens to the data as she sees fit—as she’s “reading” and integrating the data from this complex, ever-evolving data lake. Why is this important? First, this allows the inquirer to be extremely agile and go with the flow, if you will. And second, she can start getting value from her data “now”—instead of waiting for it to go through the more traditional schema-on-write process.
As for the rest of the business users: Give ‘em an app! Just kidding—sort of. Since the data lake opens the door to “more questions and better answers,” provide better solutions for business users—whether it be employees, customers or partners—to ask these questions, and maybe even explore some of the answers themselves (where it’s safe to swim). Some of your best questions (and answers) may be resting with this crowd.
Previously in the Data Lake Debate:
- The Introduction – by Jill Dyche
- Pro’s Up First – by Tamara Dull