Recently, by chance, I happened to view a Mary Meeker slide presentation dealing with the current state of the web. The presentation was fairly lengthy but, fortunately, it was interesting enough to hold my attention through the 100+ slides and as with all presentations, there was much to agree and disagree with.
What really caught my attention was a slide that I still don’t normally see these days – a slide that mentioned data quality.
That got me thinking about the “SOON…” side of slide and how critical data quality will be in the “Re-Imagination” of data and how much impact “Data Obesity” will have on all of us sooner or later.
Now it’s no secret that many of us have been dealing with a form a data obesity for quite some time – after all data de-duplication has been a key component of data quality since the keypunch days. In a traditional sense, data de-duplication has always been thought of more along the lines of customer and household identification and de-duplication.
However when it comes to the looming problem of the “store everything because we can do it inexpensively” mentality of today’s internet and applications, we’re all going to have to think out of the box as to how do we put that data on a diet and how do we successfully find those needles in that haystack? Quite obviously we’re all going to need some tools to accomplish that – and those tools had better not be centered solely on traditional customer name and address information.
Locating the “fat” will require tools that analyze the data respective of its origin, language, character sets and many other known and unknown characteristics. We will need to be able to identify redundant information using a wide variety of existing and newly developed matching and de-duplication rules. Not only are we going to have to be able to locate and de-duplicate that information in batch processes but increasingly we’ll need to accomplish that in real time.
How soon will all of this happen? Well, I’m already dealing with several corporations who are cleansing and removing “the fat” from millions of global real time transactions per day. Only 2% of the public had tablets or e-readers three years ago – the figure is 29% now. I’d venture to say that the rate of data obesity is growing nearly as fast.
Bigger is not always better and we’re all going to be right in the middle of getting our data lean and mean again. Data obesity does kill people and sooner or later it will kill your applications.