In my last post, I discussed some of the key points in the 5th annual Digital Universe study from IDC, released by EMC in June. Here, I consider a few more: some of the implications of the changes in sourcing on security and privacy, the importance of considering transient data, where volumes are a number of orders of magnitude higher, and a gentle reminder that bigger is not necessarily the nub of the problem.
Let's start with transient data. IDC notes that "a gigabyte of stored content can generate a petabyte or more of transient data that we typically don't store (e.g., digital TV signals we watch but don't record, voice calls that are made digital in the network backbone for the duration of a call)". Now, as an old data warehousing geek, that type of statement typically rings alarm bells: what if we miss some business value in the data that we never stored? How can we ever recheck at a future date the results of an old analysis we made in real-time? We used to regularly encounter this problem with DW implementations that focused on aggregated data, often because of the cost of storing the detailed data. Over the years, decreasing storage costs meant that more warehouses moved to storing the detailed data. But now, it seems like we are facing the problem again. However, from a gigabyte to a petabyte is a factor of a million! And, as the study points out, the "growth of the [permanent] digital universe continues to outpace the growth of storage capacity". So, this is probably a bridge to far for hardware evolution.
The implication (for me) is that our old paradigm about the need to keep raw, detailed data needs to be reconsidered, at least for certain types of data. This leads to the point about "big data" and whether the issue is really about size at all. The focus on size, which is the sound-bite for this study and most of the talk about big data, distracts us from the reality that this expanding universe of data contains some very different types of data to traditional business data and comes from a very different class of sources. Simplistically, we can see two very different types of big data: (1) human-generated content, such as voice and video and (2) machine metric data such as website server logs and RFID sensor event data. Both types are clearly big in volume, but in terms of structure, information value per gigabyte, retention needs and more, they are very different beasts. And interesting to note that some vendors are beginning to specialize. Infobright, for example, is focusing on what they call "machine-generated data", a class of big data that is particularly suited to their technical strengths.
Finally, a quick comment on security and privacy. The study identifies the issues: "Less than a third of the information in the digital universe can be said to have at least minimal security or protection; only about half the information that should be protected is protected." Given how much information that consumers are willing to post on social networking sites or share with businesses in order to get a 1% discount, this is a significant issue that proponents of big data and data warehousing projects. As we bring this data from social networking sources into our internal information-based decision-making systems, we will increasingly expose our business to possible charges of misusing information, exposing personal information, and so on.
There are many more thought-provoking observations in the Digital Universe study. Well worth a read for anybody considering integrating data warehouse and big data.
Digital Universe Study: The Big Hype
Other Posts by Barry Devlin
Death By a Thousand Analytics - May 2, 2012
Not Only SQL, Not Only Big Data - April 25, 2012
Collaborative BI - What Women and Men Want - March 27, 2012
Big Brother... or do I mean Big Data? - February 17, 2012
Big Data, Big Mistakes? - January 16, 2012
» Already a member? Login now to comment!
» Not a member? Register to comment!
Doug Laney said:
Hi Barry, Interesting concept about transient data. Along the same lines, at Gartner (then META) in 2003 I did a study on and wrote about what I call "subtransactional data" -- i.e. the typically granular, high-velocity, high-volume data about business activity in the vacuum between discernable business events. Examples of subtransactional data are network activity, in-store/online shopper behavior and other weak signals, etc. My conclusion was that organzations that are able to tap and leverage this data are in a better position to affect business event/process outcomes. E.g. observing what a website visitor is doing on your site enables you to customize the experience, leading to higher-probability and higher-value purchases. -Doug
Alan Musnikow said:
The phrase, "this is a significant issue that proponents of big data and data warehousing projects" seems to be missing at least the verb that should follow it.
The moderated business community for business intelligence, predictive analyics, and data professionals.
The Predictive Analytics in the Cloud Study is complete!
Register here to access the full results of this exclsuive study on Predictive Analytics and Cloud Technology including a whitepaper, 2 webinars, multiple podcasts and more!
Stephen Baker is the author of The Numerati & a journalist with 20 years of experience at BusinessWeek. More »
Paul Barsch directs professional services marketing programs for Teradata and has more than fifteen years of information... More »
Gary Cokins is an internationally recognized expert, speaker, and author. More »
Jill Dyché is an internationally recognized author, speaker, and business consultant. More »
Themos Kalafatis has worked as a consultant for Data Mining, Text Mining, Information Extraction and Data Quality for over a decade. More »
James Taylor is CEO and Principal Consultant at Decision Management Solutions and a leading expert in decision management. More »
SmartData Collective
- YOU
- Dean Abbott
- Teradata AusNZ
- Paul Barsch
- Meta S. Brown
- Jason Burke
- Gary Cokins
- Ted Cuzzillo
- Barry Devlin
- Chris Dixon
- Jill Dyché
- Timo Elliott
- Teradata EMEA
- Teradata Experts
- Michael Fauscette
- Bill Franks
- Bob Gourley
- Julie Hunt
- Doug Lautzenheiser
- Jack Mason
- Darryl McDonald
- Alex Olesker
- David Smith
- James Taylor
- Daniel Tunkelang
Advanced Analytics for Pharma & Biotech
When: Thu, 2012-05-17 08:30
The Cloud Assessment Framework
When: Thu, 2012-05-17 09:00
30-Minute Webinar Series for IT and Business Stakeholders in Financial Services, Manufacturing & Healthcare
When: Thu, 2012-05-17 11:00
Business Analytics Innovation Summit
When: Wed, 2012-05-23 08:00
HR & Workforce Analytics Innovation Summit
When: Wed, 2012-05-23 08:00
Salford Analytics and Data Mining Conference
When: Thu, 2012-05-24 12:09
Information management and governance for the public services
When: Fri, 2012-05-25 08:00
Disruptive Technologies & Innovation Minds 2012
When: Mon, 2012-06-18 09:00
Advanced Analytics for Retail
When: Thu, 2012-06-21 08:00
Advanced Analytics for Consumer Goods
When: Thu, 2012-06-21 08:00

About Social Media Today


