Digital Universe Study: Extracting Value from Chaos

I’ve just been reading the 5th annual Digital Universe study from IDC, released by EMC last month. This year’s study seems to have attracted less media attention than previous versions. Perhaps we’ve grown blasé about the huge numbers of bytes involved – 1.8 ZB (zettabytes, or 1.8 trillion gigabytes) in 2011 – or perhaps the fact that the 2011 number is exactly the same as predicted in the previous study is not newsworthy. However, the subtitle of this year’s study, “Extracting Value from Chaos”, should place it close to the top of every BI strategist’s reading list. Here, and in my next blog entry, are a few of the key takeaways, some of which have emerged in previous versions of the study, but all of which together reemphasize that business intelligence needs to undergo a radical rebirth over the next couple of years.

1.8 ZB is a big number, but consider that it’s also a rapidly growing number, more than doubling every two years. That’s faster than Moore’s Law. By 2015, we’re looking at 7.5-8 ZB. More than 90% of this information is already soft (aka unstructured) and that percentage is growing. Knowing that the vast majority of this data is generated by individuals and much of that consists of video, image and audio, you may ask: what does this matter to my internal BI environment? The answer is: it matters a lot! Because in that vast, swirling and ever-changing cosmic miasma of data there are hidden the occasional nuggets of valuable insight. And whoever gets to them first – you or your competitors – will potentially gain significant business advantage.

With such volumes of information and such rapid growth, it is simply impossible to examine (never mind analyse) it manually. This demands an automated approach. Such tools are emerging – for example, facial recognition of photos on Facebook and elsewhere or IBM Watson’s extraction of Jeopardy answers from the contents of the Internet. Conceptually, what such tools do is generate data about data, which, as we know and love in BI, means metadata. According to IDC, metadata is growing at twice the rate of the digital universe as a whole. That’s more than doubling every year!

So, while we may well ask what you’re doing about managing and storing soft information, an even more pressing question is what are you going to do about metadata? Of course, the volumes of metadata are probably still relatively small (IDC hasn’t published an absolute value), but that growth rate means they will get large; fast. And we currently have a much more limited infrastructure and weaker methodologies to handle metadata than we’ve created over the years for data. Not to mention that the value to be found in the chaos can be discovered only through the lens of the metadata that characterizes the data itself.

For BI, this shift in focus from hard to soft information is only one of the changes we have to manage. Another major change involves the nature and sources of the hard data itself. There is a growing quantity of hard data collected from machine sensors as more and more of the physical world goes on the Web. RFID readers are generating ever increasing volumes of data. (According to VDC Research, nearly 4 billion RFID tags were sold in 2010, a 35% increase over the previous year.) From electricity meters to automobiles, intelligent, connected devices are pumping out ever increasing volumes of data that is being used in a wide variety of new applications. And almost all of these applications can be characterized as operational BI. So, the move from traditional, tactical BI to the near real-time world of operational BI is accelerating, with all of the challenges that entails.

Next time, I’ll be looking at some of the implications of the changes in sourcing on security and privacy, as well as the interesting fact that although the stored digital universe is huge, the transient data volumes are a number of orders of magnitude higher.