Data Lakes are among the most complex and sophisticated data storage and processing facilities we have available to us today as human beings. Analytics Magazine notes that data lakes are among the most useful tools that an enterprise may have at its disposal when aiming to compete with competitors via innovation. These massive storage pools of data are among the most non-traditional methods of data storage around and they came about as companies raced to embrace the trend of Big Data Analytics which was sweeping the world in the early 2010s. There were a lot of promises made about Big Data that fell at the feet of data scientists to make happen. Sometimes they did, sometimes they didn’t, but the overall feeling when it came to Big Data was still positive because of the potential it had for delivering insights to the business world.
The Thrust for Data Lake Creation
According to Forbes in 2011, the idea of the Data Lake was already gaining traction as companies started to consider the idea of moving their data from off-site repositories to cloud-accessible online storage, a reality that was further cemented by the cheap availability of cloud storage. Big Data was set up to be the most important game-changer since Edison’s lightbulb, but yet there were some cracks emerging in the architecture and the implementation. From the excitement of goals set by CEO’s and CIO’s about what their Big Data lakes would be able to do, data scientists were starting to find it difficult to use them in real-world applications. Data lakes were designed to be agile and provide analytics data on the fly while processing incoming data at a remarkable speed. There were a handful of problems that bogged the system down and made it extremely difficult for data scientists to replicate their test bed results in a real-world environment. While most engineers understand that the real-world applications of a theory are seldom the way it’s applied in a lab, data scientists had to learn the hard way by encountering problems with their data lake deployment.
The First Problem – Data Ingestion
A data lake is only as good as the data it takes in. When dealing with an offline test case of data, efficiency in loading and processing that data is a lot less important than doing so in real-time while the system is live. Big Data is, well…big. Loading large sets of data into the system to analyze it can be a time-consuming process, especially if the system isn’t used to handling rapidly changing data. There is likely to be a lag between data updating and new insights being produced and the more convoluted the system, the longer that lag-time. A clever way of working around this limitation is termed Change Data Capture (CDC). Based on Microsoft’s discussion of the topic, CDC makes it much easier for a data store to accept changes within a database as it only updates the changed records of the database instead of reloading the entire tables that were affected. While CDC does take care of updating records, those records need to be re-merged to the main database taking into account changed schemas that may happen between database backups.
The Second Problem – Quickly Querying Data
The primary reason data lakes were so attractive to companies was the promise of agile processing of data in order to provide real-time (or near real-time) results on data sets. In order for this to even be possible, the data visualization aspect needs to be streamlined to show exactly what the user wants to see. Because of the types of databases that made their way into adoption during the nascent days of Big Data, we now have the problem of streamlining databases running on Hive or NoSQL that were never meant to process data sets as large as what our data lake holds. The way to work around this shortcoming is to use OLAP cubes or data models generated within memory, but these will take time to develop and test, especially since they need to be scalable to the level of use in a data lake.
The Third Problem – Preparation of Data
Most data lakes exist with the idea that disparate bits of data will be added to the cloud which will process and clean and arrange the data as it so requires. The problem arises when all this data is lumped in with a programmer that only has a vague idea of what needs to be linked to what, and the types of insights the business is looking to be advised upon. The combination of the object-oriented design of data structures combined with top-down design for the processing pipelines that relates these data structures across tables is a key aspect of the coding process for a data lake’s embedded cleaning and relational system. Sadly, many companies are unable to determine these goals from the onset, leading to confusion for the programmers and issues for the data lake when it comes to automated processing of raw data. The way around this hiccup in automation is to have clear goals in mind for what the data lake is supposed to look at.
The Fourth Problem – Standard Operation Across Multiple Platforms
How a data lake generates insights
is through ad hoc analytics, where a set of data is selected, assessed, and from the generated results, decisions made. Data scientists will be putting this data lake through its paces many times per hour searching for things to make the business more competitive or to drive customer adoption, but to truly make the data lake a useful addition to the data scientist’s arsenal, it must be able to perform these tasks consistently and efficiently. This can be resolved with the creation of data pipelines that allow data scientists to run their queries on data sets that make up a subset of the available data within the lake. They should be able to copy that process to use different data sets, and by comparing the results over a series of iterations make better judgment calls on the metrics they find lacking. Additionally, since the lake is likely to be accessing data from multiple cloud sources, these pipelines must be able to play well with these different source materials.
Automation Is Around the Corner
While the daunting task of running a data lake and preventing it from becoming a data swamp is one that is challenging, help is right around the corner. While many companies and startups have been focused on the development of data lakes, others have sought to develop systems to reduce the intricacy in running a data lake. At the moment though, being aware of how automation can help a data lake clean itself up is as good as it gets until these products start becoming available commercially. This sort of thinking helps a data lake from becoming bogged down and unusable because of poor architecture decisions at implementation.