Prinicpal Components for Modeling
Analysts constructing predictive models frequently encounter the need to reduce the size of the available data, both in terms of variables and observations. One reason is that data sets are now available which are far too large to be modeled directly in their entirety using contemporary hardware and software. Another reason is that some data elements (variables) have an associated cost. For instance, medical tests bring an economic and sometimes human cost, so it would be ideal to minimize their use if possible. Another problem is overfitting: Many modeling algorithms will eagerly consume however much data they are fed, but increasing the size of this data will eventually produce models of increased complexity without a corresponding increase in quality. Model deployment and maintenance, too, may be encumbered by extra model inputs, in terms of both execution time and required data preparation and storage.
Naturally, the goal in data reduction is to decrease the size of needed data, while maintaining (as much as is possible) model performance, this process must be performed carefully.
A Solution: Principal Components
Selection of candidate predictor variables to retain (or to eliminate) is the most obvious way to reduce the size of the data. If model performance is not to suffer, though, then some effective measure of each variable's usefulness in the final model must be employed- which is complicated by the correlations among predictors. Several important procedures have been developed along these lines, such as forward selection, backward selection and stepwise selection.
Another possibility is principal components analysis ("PCA" to his friends), which is a procedure from multivariate statistics which yields a new set of variables (the same number as before), called the principal components. Conveniently, all of the principal components are simply linear functions of the original variables. As a side benefit, all of the principal components are completely uncorrelated. The technical details will not be presented here (see the reference, below), but suffice it to say that if 100 variables enter PCA, then 100 new variables (called the principal components come out. You are now wondering, perhaps, where the "data reduction" is? Simple: PCA constructs the new variables so that the first principal component exhibits the largest variance, the second principal component exhibits the second largest variance, and so on.
How well this works in practice depends completely on the data. In some cases, though, a large fraction of the total variance in the data can be compressed into a very small number of principal components. The data reduction comes when the analyst decides to retain only the first n principal components.
Note that PCA does not eliminate the need for the original variables: they are all still used in the calculation of the principal components, no matter how few of the principal components are retained. Also, statistical variance (which is what is concentrated by PCA) may not correspond perfectly to "predictive information", although it is often a reasonable approximation.
Last Words
Many statistical and data mining software packages will perform PCA, and it is not difficult to write one's own code. If you haven't tried this technique before, I recommend it: It is truly impressive to see PCA squeeze 90% of the variance in a large data set into a handful of variables.
Note: Related terms from the engineering world: eigenanalysis, eigenvector and eigenfunction.
Reference
For the down-and-dirty technical details of PCA (with enough information to allow you to program PCA), see:
Multivariate Statistical Methods: A Primer, by Manly (ISBN: 0-412-28620-3)
Note: The first edition is adequate for coding PCA, and is at present much cheaper than the second or third editions.
Other Posts by Will Dwinnell
Data Mining and Terrorism... Counterpoint - January 6, 2010
Taking Assumptions With A Grain Of Salt - April 26, 2009
Graphing Considered Dangerous - April 1, 2009
How many software packages is too much? - March 20, 2009
KDD 2008 - March 17, 2009
The moderated business community for business intelligence, predictive analyics, and data professionals.
The Predictive Analytics in the Cloud Study is complete!
Register here to access the full results of this exclsuive study on Predictive Analytics and Cloud Technology including a whitepaper, 2 webinars, multiple podcasts and more!
Stephen Baker is the author of The Numerati & a journalist with 20 years of experience at BusinessWeek. More »
Paul Barsch directs professional services marketing programs for Teradata and has more than fifteen years of information... More »
Gary Cokins is an internationally recognized expert, speaker, and author. More »
Jill Dyché is an internationally recognized author, speaker, and business consultant. More »
Themos Kalafatis has worked as a consultant for Data Mining, Text Mining, Information Extraction and Data Quality for over a decade. More »
James Taylor is CEO and Principal Consultant at Decision Management Solutions and a leading expert in decision management. More »
SmartData Collective
- YOU
- Dean Abbott
- Teradata AusNZ
- Paul Barsch
- Meta S. Brown
- Jason Burke
- Gary Cokins
- Ted Cuzzillo
- Barry Devlin
- Chris Dixon
- Jill Dyché
- Timo Elliott
- Teradata EMEA
- Teradata Experts
- Michael Fauscette
- Bill Franks
- Bob Gourley
- Julie Hunt
- Doug Lautzenheiser
- Jack Mason
- Darryl McDonald
- Alex Olesker
- David Smith
- James Taylor
- Daniel Tunkelang
HR & Workforce Analytics Innovation Summit
When: Thu, 2012-05-24 08:00
Business Analytics Innovation Summit
When: Thu, 2012-05-24 08:00
Salford Analytics and Data Mining Conference
When: Thu, 2012-05-24 12:09
Information management and governance for the public services
When: Fri, 2012-05-25 08:00
Disruptive Technologies & Innovation Minds 2012
When: Mon, 2012-06-18 09:00
Advanced Analytics for Retail
When: Thu, 2012-06-21 08:00
Advanced Analytics for Consumer Goods
When: Thu, 2012-06-21 08:00
CIMI.Con Evolution 2012
When: Mon, 2012-06-25 08:00
Predictive Analytics World, June 25-26, 2012 in Chicago
When: Mon, 2012-06-25 09:00
Big Data for Enterprise USA 2012
When: Wed, 2012-06-27 08:00

About Social Media Today


