The Use and Abuse of Big Data

Big data As we begin a new year, we are promised a move from a focus on the meaning and technology of big data to the useful and worthwhile business applications it may offer. A timely move indeed. Hopefully, we’ll begin to hear less about analyzing Twitter streams to optimize advertising and more about applications with the potential to improve people’s lives or the environment. And even more hopefully, people may begin to consider the risks they run when revealing or gathering personal data on our deeply interconnected Web.

With all of the synchronicity that is the Internet, I came across two articles from the New York Times published in last week. The first, by Peter Jaret on January 14, describes how patient records, transcribed and digitized from scrawled (why do they write so poorly?) doctors’ notes, anonymized and stored on the Web, can be statistically mined to discover previously unknown side-effects of and interactions between prescribed drugs. Clearly useful and valuable work. The second article, three days later by Gina Kolata, revealed how easily a genetics researcher was able to identify five individuals and their extended families by combining publicly-available information from the anonymized 1000 Genome Project database, a commercial genealogy Web site, and Google. Kolata quotes Amy L. McGuire, a lawyer and ethicist at Baylor College of Medicine in Houston: “To have the illusion you can fully protect privacy or make data anonymous is no longer a sustainable position.” The underlying genetic data is used in medical research to good effect, of course, but what are the possible consequences for those individuals thus identified as insurance companies, governments or other interested parties make potentially negative assessments based on their once private genomes?

Such occurrences–and there are many of them–should be deeply disturbing to those of us involved in the business of big data and analytics. Here are doctors, scientists and lawyers–with training in logic, ethics and law–who see the power of analytics to improve the human condition, but who seem to gloss over the wider privacy and security implications of making personal information widely available on the Web. After all, the limits of data anonymization on the web were being discussed openly as long ago as May 2011 by Pete Warden on the O’Reilly Radar blog. And as far back as 1997, Prof. Latanya Sweeney, now Director of the Data Privacy Lab at Harvard, could show that the combination of gender, ZIP code and birthdate was unique for 87% of the U.S. population.

Eben Moglen, professor of law and legal history at Columbia University and Chairman of the Software Freedom Law Center, warned at re:publica Berlin in May 2012 that “media that spies on and data-mines the public is destroying freedom of thought and only this generation, the last to grow up remembering the ‘old way,’ is positioned to save this, humanity’s most precious freedom.” With media and medicine, government and retail, telecommunications and finance all gathering hoards of information about us, each for their own allegedly good purpose, the reality is now that the abuse of big data (as opposed to its use) is not only possible, but proceeding apace, even in largely democratic, Western states.

So, given that big data anonymity is “no longer a sustainable position,” it should be clear that the analytics possible on today’s high-powered computers is a double-edged sword; it serves us poorly to focus only on a single, razor-sharp edge. As we evaluate and build useful and worthwhile business analytics applications of this coming year, let us step back, even occasionally, to contemplate whether the profits to be earned or the discoveries to be made are worth the price of human freedom.