Whether we choose to measure it in terabytes, petabytes, exabytes, HoardaBytes, or how harshly reality bites, we have been hoarding data long before the data management industry took a super-sized tip from McDonald’s and put the word “big” in front of its signature sandwich. At least McDonald’s starting phasing out their super-sized menu options in 2004, stating the need to offer healthier food choices, a move that was perhaps motivated in some small way by the success of Morgan Spurlock’s Academy Award Nominated documentary film Super Size Me.
Much like fast food is an enabler for our chronic overeating and the growing epidemic of obesity, big data is an enabler for our data hoarding compulsion and the growing epidemic of data obesity.
Does this data smell bad to you?
Are there alternatives to data hoarding? Perhaps we could put an expiration date on data, which after it has been passed we could at least archive, if not delete, the expired data. One challenge with this approach is that with most data you can not say exactly when it will expire. Even if we could, however, expiration dates for data might be as meaningless as the expiration dates we currently have for food.
As Rose Eveleth reported, “these dates are—essentially—made up. Nobody regulates how long milk or cheese or bread stays good, so companies can essentially print whatever date they want on their products.” Eveleth shared links to many sources that basically recommended ignoring the dates and relying on seeing if the food looks or smells bad.
What about regulatory compliance?
Another enabler of data hoarding is concerns about complying with future regulations. This is somewhat analogous to income tax preparation in the United States where many people hoard boxes of receipts for everything in hopes of using them to itemize their tax deductions. Even though most of the contents are deemed irrelevant when filing an official tax return, some people still store the boxes in their attic just in case of a future tax audit.
How useful would data from this source be?
Although calculating the half-life of data has always been problematic, Larry Hardesty recently reported on a new algorithm developed by MIT graduate student Dan Levine and his advisor Jonathan How. By using the algorithm, Levine explained, “the usefulness of data can be assessed before the data itself becomes available.” Similar algorithms might be the best future alternative to data hoarding, especially when you realize that the “keep it just in case we need it” theory often later faces the “we can’t find it amongst all the stuff we kept” reality.
What Say You?
Have you found alternatives to data hoarding? Please share your thoughts by leaving a comment below.