Way back in 1991, when IBM announced the Information Warehouse Framework, one aspect of the content came as a shock to most people who were promoting data warehousing then. (There were not too many of us at the time to be shocked…
Way back in 1991, when IBM announced the Information Warehouse Framework, one aspect of the content came as a shock to most people who were promoting data warehousing then. (There were not too many of us at the time to be shocked… as far as I know, no one had yet claimed paternity of data warehousing, and the first popular book on the topic was still a year away.) The shock was that the announcement included the concept of access to heterogeneous data, to be supported through an alliance with Information Builders Inc., using their product EDA/SQL. The accepted wisdom in data warehousing at the time and for many years since was that heterogeneous data must be cleansed, reconciled and loaded into the warehouse via ETL tooling and accessed from there. Information Warehouse was not a great success for IBM, and access to heterogeneous data largely faded from awareness among data warehousing professionals. That made a lot of sense back then. Heterogeneous data was very heterogeneous, very complex and very susceptible to performance problems when accessed in an unplanned manner. “Leave it alone!” was the sensible advice.
Ten years later, I was again faced with the concept of heterogeneous data access as IBM began the work that was later announced as IBM DB2 Information Integrator. This time, the starting point was federated access to data, initially across relational systems, but with a clear direction to include all types of data. Again, the market wasn’t really ready for the concept, although there was a wider degree of acceptance of the idea and a number of early adopters began to experiment seriously with implementation. Most data warehousing experts still shook their heads in disbelief…
Fast forward another ten years and access to heterogeneous data is back on the agenda big time, this time under the name virtualization and the launch of Composite 6 yesterday ups the ante again. (It may be of interest to note that Composite was founded in 2002, right in the middle of the last wave of interest.) While there is lots of fascinating stuff in the release about improving performance, caching and governance, my attention was drawn particularly to the inclusion of “big data” integration support. And my concern was how Composite could understand and reliably use the variety of data types, elements and so on, which are typically present in Hadoop files.
My contention is that over the years since 1991, heterogeneous data sources have, generally speaking, become better defined, less complex in terms of structure and content, more easily accessed, and less prevalent. Until the advent of big data, that is. In data management terms, big data is like a giant step backwards to the Wild West from modern suburbia: schema–why bother? Metadata–who needs it, it will be out of date in a day? Governance–programmers can handle it!
But when I put the question to Dave Besemer, CTO of Composite, the answer I got proved very enlightening. Not just about Composite’s approach but also about what is going on, perhaps somewhat by stealth, in the world of big data. Basically, Dave said that Composite accesses big data only via Hive, which provides the basic structural metadata required for virtualization. And Hive? Well, Hive defines itself on its own website as, wait for it: “…a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL…”
So, as Robert Browning wrote “God’s in his heaven, all’s right with the world” if you are a data management fan. The big data folks do recognize the value of data management (Hive has been around since 2009) despite some of the NoSQL hype that still continues to turn up in the press. That’s not to say that Hive needs to be put in front of every set of Hadoop files. There’s a whole world of distributed Hadoop data that is so transient and/or so specialized that the only sensible way to use it is via a programmatic interface. But, Composite isn’t going after that stuff; they are focusing on the better defined and managed segment of big data. And that makes perfect sense.
But there is still a question in my mind that the broader IT community needs to answer: How are we going to manage and handle the other, much larger segment of big data? Pat Helland’s article “If You Have Too Much Data, then ‘Good Enough’ Is Good Enough” in the ACM Journal provides some food for thought.
Oh, and by the way, there are a few data warehouse eminences grises who still proclaim that virtualization is evil and that all data has to go through the data warehouse… Perhaps they’re waiting for the fourth wave?