In the late 19th century, a New York City planner put out a warning that by 1950 the city would be completely uninhabitable. The problem, as he saw it, was that at the current growth rates the city would not be able to sustain the growing number of horses. And, more importantly, their waste products. By 1950, so the city planner reasoned, New York would be fully covered in horse manure.
I’m sure for most of you, the anecdote will bring a smile to your face. The city planner, most will reason, was proven wrong because horses were replaced by cars which at least in part, solved the problem. Solid reasoning, but not quite. For cars were not invented to counter the problem of rising piles of horse droppings in the street of America. Cars were invented because people saw an opportunity to overcome the downsides of transportation by horses, not the downfall of their excrements. There may be a correlation between horse manure and the invention of the car, but there is no causality.
In a similar fashion, there seems to be a lot of ‘horse manure’ in Big Data. People know me as a protagonist for Big Data and data driven strategies in organizations. But I seem to come across more and more articles promoting Big Data as the ideal solution for ‘predictive analytics’: the use of increasingly large datasets to determine events that will take place in the future. The thought of ‘using larger datasets to improve predictive capability’ should raise a big red flag for anyone. I know from experience that predictive analytics is a viable and reliable solution to some problems. And prediction accuracy can be tinkered to achieve impressive results. But only because some impressively smart people do the tinkering, and hardly ever because they have more data.
In his recent book ‘Big Data: A Revolution That Will Transform How We Live, Work, and Think’, Victor Mayer-Schonberger, offers some compelling examples of predictive analytics, such as how doctors, by analyzing data from hundreds of new born babies, now know that the tell tale signs of an infection in a new born is not the de-stabilization of primary health functions in the baby, but the stabilization of them. Or how fire departments, by analyzing data from previous fires in combination with real estate data and social-demographic data about inhabitants, can quite accurately predict the outbreak of fires and better plan their readiness for particular types of fires in particular neighborhoods. But all of the examples have something important in common: the predictions are not better than usual because of the volume of the dataset, but because of the combination of the right data and a limited and controllable set of variables to tinker with. For people who know what they are talking about it is relatively easy to establish causalities among the correlations presented by Big Data technology. Don’t get me wrong: I’m not downplaying Big Data. But I am downplaying the capability of Big Data to solve problems on its own. It takes in-depth knowledge of the problem to spot the potential solutions that Big Data offers in the range of correlations. Otherwise one may just end up predicting the downfall of a major city by manure.
Luckily for New York, the invention of the car at the beginning of the twentieth century turned the tide on the horse droppings. But it was a narrow escape. Because if it hadn’t been for the stubbornness of one of the leaders in the infant car industry, the problem might just as easily have been increased. After all, it was none other than Henry Ford who said “if I had listened to my customers, I would have built faster horses.”