The Data Lake Debate: Negative Puts a Stake in the Ground

Anne Buff While the idea of a data lake sounds like fun, don’t go jumping in just yet. There are critical factors to consider before taking the plunge and saying that A data lake is essential for any organization to take full advantage of its data. In presenting the following arguments, I not only contend that a data lake is not essential for any organization, I also argue that creating a data lake will in fact be detrimental for those who do so prematurely.

REVISITING DEFINITIONS:

While I have accepted the definitions presented in the original argument with the agreement to consider data as digital in nature (who wouldn’t at this point?), I think it is important to contemplate additional words in the resolution at hand.

Essential as defined by Merriam-Webster is “extremely important and necessary.”

Any as used in this context and defined by Merriam-Webster is “one, some or all indiscriminately of whatever quantity: all – used to indicate a maximum or whole.”

Full as used in this context and defined by Merriam Webster is “complete especially in detail, number or duration.”

It is not possible to successfully argue a storage repository (regardless of size or type) is extremely important and necessary for all organizations to take complete advantage of their data. What executives and business leaders will argue is extremely important when it comes to technology are the tools and services that increase the effectiveness of decision making, increase revenue, save costs or reduce risks, none of which a data lake does.

WHY A DATA LAKE IS NOT ESSENTIAL

Argument 1. Data storage alone has no impact on the effectiveness of business decisions

The movement and storage of data is a valid part of larger enterprise data architecture discussion. Access and processing capabilities certainly have an impact on how fast information can be provisioned to reporting and analytics applications. When it comes to data however, the greatest impact for effective decision making comes from ensuring the right data is available to the right people at the right time. Many technologies can help make this happen, but the real success factor is found in strong data management capabilities under the umbrella of a mature data governance program.

In addition to data management and governance, the most successful data initiatives, big data or traditional data, are those that are linked to strategic business initiatives. If a data lake is built and data is collected and stored without purpose, there is no value to the organization. If a man-made fishing lake is built, but it is not stocked with fish, those who come to fish will catch whatever has collected in the lake, probably that old black boot for sure. A few disappointing fishing trips and your fishing folks will go elsewhere (i.e. back to the rogue databases under their desks.)

There is one scenario where a data lake can prove beneficial for organizations. Serving as a sandbox for seasoned data scientists, the data lake is an ideal environment. Most organizations are not at a point where building a data discovery playground makes business sense because even if they have folks with technical chops to find the new “golden nugget” of knowledge by sifting through gobs of “unstructured data with possibility”, the organization most often does not have the resources to act on the newfound knowledge. There is no value in insight on which you cannot act. Before organizations start down the path of discovering capabilities within a data lake, they should first turn to taking full advantage of their current data.

Argument 2. Inexpensive storage is not infinite or limitless.

One of the most common arguments made for embracing big data open source technologies is that they are free or significantly lower in costs than other vendor-provided options. While this is true, it does not mean that the implementation of these technologies is free or low cost at all. There are many ancillary costs that grow exponentially as the “low-cost” options expand.

Ancillary costs not detailed in the free or low-cost price tags of the data lake

Physical footprint – Yes, Hadoop can run on commodity hardware. However, commodity hardware still has a physical presence to consider. We can continue to add boxes or blades as needed, but eventually we will run out of physical space. (At this point, I should also mention the significant increase in asset management headaches.) If you change this argument to a cloud based argument, the scalability of the cloud infrastructure comes with a higher price tag. Either way the footprint cost rises.

Resource utilization – There is a staggering increase in network traffic for adding machine count, widening a Hadoop cluster and having these individual systems communicate in a Hadoop ecosystem. The data lake environment by its nature creates extensive network chatter even before the consideration of the data scientists cowboys running wild their fantastic new queries. Without careful planning, the development of a data lake could bring your network to a crawl. When adding new servers, the electricity powering our favored commodity hardware dramatically rises, along with the cooling requirements, which of course also uses increased power. While this may not seem like much at the onset, these costs add up quick enough to justify the costs of high-end comprehensive analytic processing platforms or appliances. And, if you argue that you won’t need to worry about network capabilities because you will take advantage of in-memory processing, without some serious budgetary consideration, this will come with severe sticker shock.

Management and support – The manpower to support these disparate commodity hardware systems for a hodge-podge data lake will quickly exceed what already burdened IT departments can provide. I can assure you – faster, better data access will not come from a tired, frustrated IT unit.

Specialized Skill Sets – Besides the sheer support of these systems, the specific skills needed to access and query data in a Hadoop environment are not abundantly available. Organizations will need to outsource for the skills, provide training to develop them in house or seriously up the ante for bringing these skills sets on board. The best and the brightest are not a dime a dozen. In fact, given the current demand getting a dozen right now might be an impossible feat.

Because these additional costs and challenges create finite barriers for organizations, it is not true that an organization can store any and all data. Organizations can store data in a repository to the limits to which it can maintain. For many, these limits are hit much quicker than expected. Momma always said “Nothing good in life comes for free.” One day, when we grow up, we will believe her.

SUMMARY

I agree with Tamara, the time is now for organizations to begin planning strategies to take advantage of their data. However, building a data lake is not essential and without purpose, it is not even valuable. In order for an organization to take full advantage of its data, the organization must first develop a strategic, enterprise level understanding and use of the data it has. As the organization matures in its approach to data within the context of key business initiatives, it will develop a resilient, sustainable data governance program that will clearly inform the value of the data lake or whatever other hot technology concept is floating around when they are ready.

Previously in the Data Lake Debate: