This article will make you feel better. And you do need to feel better, if you are one of the many of us who practice analytics—or who must consume and rely on analytics—and find ourselves carrying tension in our shoulders or sometimes losing sleep.
This article will make you feel better. And you do need to feel better, if you are one of the many of us who practice analytics—or who must consume and rely on analytics—and find ourselves carrying tension in our shoulders or sometimes losing sleep.
The fear stems from a well-known warning of tragic mishap: “If you torture the data long enough, it will confess,” as stated by University of Chicago economics professor Ronald Coase. There is a general sense that math could be wrong and that analytics is an art.
As John Elder of Elder Research put it, “It’s always possible to get lucky (or unlucky). When you mine data and find something, is it real, or chance?” How can we confidently trust what a computer claims to have learned? How do we avert the dire declension, “Lies, damned lies, and statistics”?
There is a simple, elegant solution from Elder—but first, let me further magnify your fear: Even the very simplest predictive model risks utter failure. Mistaken, misleading conclusions are in fact terribly easy to come by.
A conclusion drawn about one single variable—even without the use of a common multivariate model (such as log-linear regression)—can go awry. In fact, one of the more famous such analytical insights, “an orange used car is least likely to be a lemon,” has recently been debunked by Elder and his colleague Ben Bullard at Elder Research, Inc.
Big data, with all its pomp and circumstance, can actually mean big risk. More data can present more opportunities to inadvertently discover untrue patterns that appear misleadingly strong within your dataset—but, in fact, do not hold true in general. To be more specific, “bigger” data could mean longer data (a longer list of examples, which generallyhelps avert spurious conclusions), but also could mean wider data (more columns—more variables/factors per example). So, even if you are only considering one variable at a time, such as the color of each car, you are more likely to come across one that just happens to look predictive in your data by sheer chance alone. This peril that arises when searching across many variables has been dubbed by John Elder vast search.
Dr. Elder puts it this way: “Modern predictive analytic algorithms are hypothesis-generating machines, capable of testing millions of ‘ideas.’ The best result stumbled upon in its vast search has a much greater chance of being spurious… The problem is so widespread that it is the chief reason for a crisis in experimental science, where most journal results have been discovered to resist replication; that is, to be wrong!”
A few years ago, Berkeley Professor David Leinweber made waves with his discovery that the annual closing price of the S&P 500 stock market index could have been predicted from 1983 to 1993 by the rate of butter production in Bangladesh. Bangladesh’s butter production mathematically explains 75 percent of the index’s variation over that time. Urgent calls were placed to the Credibility Police, since it certainly cannot be believed that Bangladesh’s butter is closely tied to the U.S. stock market. If its butter production boomed or went bust in any given year, how could it be reasonable to assume that U.S. stocks would follow suit? This stirred up the greatest fears of PA skeptics, and vindicated nonbelievers. Eyebrows were raised so vigorously, they catapulted Professor Leinweber onto national television.
Crackpot or legitimate educator? It turns out Leinweber had contrived this analysis as a playful publicity stunt, within a chapter entitled “Stupid Data Miner Tricks” in his book Nerds on Wall Street. His analysis was designed to highlight a common misstep by exaggerating it. It’s dangerously easy to find ridiculous correlations, especially when you’re “predicting” only 11 data points (annual index closings for 1983 to 1993). By searching through a large number of financial indicators across many countries, something or other will show similar trends, just by chance. It will eventually unearth cockamamie relationships. For example, shiver me timbers, a related study showed buried treasure discoveries in England and Wales predicted the Dow Jones Industrial Average a full year ahead from 1992 to 2002.
Leinweber attracted the attention he sought, but his lesson didn’t seem to sink in. “I got calls for years asking me what the current butter business in Bangladesh was looking like and I kept saying, ‘Ya know, it was a joke, it was a joke!’ It’s scary how few people actually get that.” As Black Swan author Nassim Taleb put it in his suitably titled book, Fooled by Randomness, “Nowhere is the problem of induction more relevant than in the world of trading—and nowhere has it been as ignored!” Thus the occasional overzealous yet earnest public claim of economic prediction based on factors like women’s hemlines, men’s necktie width, Super Bowl results, and Christmas day snowfall in Boston.
The culprit that kills machine learning is overlearning (akaoverfitting). Overlearning is the pitfall of mistaking noise for information, assuming too much about what has been shown within data. You’ve overlearned if you’ve read too much into the numbers, led astray from discovering the underlying truth.
While many analytics practitioners consider overlearning a risk with predictive models that combine multiple variables, the truth is even well-publicized single-variable results are at risk. A dire need for a new paradigm has emerged.
But is it really that hard? Why would analysts now assert that standard tests of statistical significance break down when vast search is in play?
And what can be done to validate (i.e., test for significance) even after vast search has claimed to have made a discovery?
Now that your interest has been piqued, you may get the answers from one or both of the following in-depth sources:
- PLENARY CONFERENCE SESSION. Presentation at six (6) Predictive Analytics World events in 2014: PAW San Francisco (March), PAW Toronto (May), PAW Chicago (June), PAW Government (September in DC), PAW Boston (October), and PAW London (October): “The Peril of Vast Search (and How Target Shuffling Can Save Science)” by John Elder, CEO & Founder, Elder Research, Inc. Full session description
- TECHNICAL PAPER: “Are Orange Cars Really not Lemons?” by Ben Bullard & John Elder, Elder Research, Inc. This technical paper explores the difficulty introduced above, walking the reader through a detailed example and introducing a solution for addressing the challenge at hand: target shuffling. Partial excerpt of the paper:
A recent article in The Seattle Times, reported that “an orange used car is least likely to be a lemon.” This discovery surfaced in a competition hosted by Kaggle to predict bad buys among used cars using a labeled dataset. Of the 72,983 used cars, 8,976 were bad buys (12.3%). Yet, of the 415 orange cars in the dataset, only 34 were bad (8.2%)…
But how unusual is this low proportion? That is, assuming the true proportion is really equal, what is the likelihood that it could have occurred by chance for a random partition of that size? Such a calculation takes into account the numbers of cars making up both proportions (good and bad Orange vs. good and bad non-Orange). When we apply a 1-sided statistical hypothesis test for equality of proportions between two samples it yields a p-value of 0.00675. In other words, the hypothesis test reveals that if the underlying reality is that the proportion of bad buys among orange cars is really equal to the proportion of bad buys among all non-orange cars, then the probability that one would observe a sample proportion for orange cars that is so much lower than the sample proportion for non-orange cars (given sample sizes of 415 and 72,466, respectively) is only 0.675%.
[…]
[But] what we see is that statistical hypothesis tests only work when the hypothesis comes first, and the analysis second. One cannot use the data to inform the hypothesis and then test that hypothesis on the same data. That leads to overfit and over-confidence in your results, which leads to the model underperforming (or failing entirely) on new data, where it is most needed.
And yet, how do we know what to hypothesize? Isn’t the great strength of data mining that the computer can try out all sorts of things are report back which one might work?
Access the full technical paper by Bullard and Elder