Resampling Data in Hadoop with RHadoop
On Revolution Analytics partner Cloudera's blog, Uri Laserson has posted an excellent guide to resampling from a large data set in Hadoop. Resampling is an important step in fitting ensemble models (including random forests and other bagging techniques), and Uri provides a step-by-step guide to implementing resampling methods using RHadoop. He provides the complete map-reduce code in the R language, as well as a useful script for installing RHadoop on a Cloudera instance.
By the way, if you're new to RHadoop, here's RHadoop creator and project leader Antonio Piccolboni introducting RHadoop at last year's Strata CA conference.
Other Posts by David Smith
The moderated business community for business intelligence, predictive analytics, and data professionals.