We live in a time when data volumes are growing faster than Moore’s Law and the variety of structures and sources has expanded far beyond those that IT has experience of managing. It is simultaneously an era when our businesses and our daily lives have become intimately dependent on such data being trustworthy, consistent, timely and correct. And yet, our thinking about and tools for managing data quality in the broadest sense of the word remain rooted in a traditional understanding of what data is and how it works. It is surely time for some new thinking.
A fascinating discussion with Dan Graham of Teradata over a couple of beers in February last at Strata in Santa Clara ended up in a picture of something called a “Data Equalizer” drawn on a napkin. As often happens after a few beers, one thing led to another…
The napkin picture led me to take a look at the characteristics of data in the light of the rapid, ongoing change in the volumes, varieties and velocity we’re seeing in the context of Big Data. A survey of data-centric sources of information revealed almost thirty data characteristics considered interesting by different experts. Such a list is too cumbersome to use and I narrowed it down based on two criteria. First was the practical usefulness of the characteristic: how does the trait help IT make decisions on how to store, manage and use such data? What can users expect of this data based on its traits? Second, can the trait actually be measured?
The outcome was seven fundamental traits of data structure, composition and use that enable IT professionals to examine existing and new data sources and respond to the opportunities and challenges posed by new business demands and novel technological advances. These traits can help answer fundamental questions about how and where data should be stored and how it should be protected. And they suggest how it can be securely made available to business users in a timely manner.
So what is the “Data Equalizer”? It’s a tool that graphically portrays the overall tone and character of a dataset, IT professionals can quickly evaluate the data management needs of a specific set of data. More generally, it clarifies how technologies such as relational databases and Hadoop, for example, can be positioned relative to one another and how the data warehouse is likely to evolve as the central integrating hub in a heterogeneous, distributed and expanding data environment.
Understanding the fundamental characteristics of data today is becoming an essential first step in defining a data architecture and building an appropriate data store. The emerging architecture for data is almost certainly heterogeneous and distributed. There is simply too large a volume and too wide a variety to insist that it all must be copied into a single format or store. The long-standing default decision–a relational database–may not always be appropriate for every application or decision-support need in the face of these surging data volumes and growing variety of data sources. The challenge for the evolving data warehouse will be to ensure that we retain a core set of information to ensure homogeneous and integrated business usage. For this core business information, the relational model will remain central and likely mandatory; it is the only approach that has the theoretical and practical schema needed to link such core data to other stores.
Seven Faces of Data: rethinking data’s basic characteristics – new White Paper by Dr. Barry Devlin (sponsored by Teradata)