Analyst and BI expert Steve Miller takes a look at the facilities in R for doing “by-group” processing of data. The task consisted of:
… read several text files, merge the results, reshape the intermediate data, calculate some new variables, take care of missing values, attend to meta data, execute a few predictive models and graph the results.
Analyst and BI expert Steve Miller takes a look at the facilities in R for doing “by-group” processing of data. The task consisted of:
… read several text files, merge the results, reshape the intermediate data, calculate some new variables, take care of missing values, attend to meta data, execute a few predictive models and graph the results.
Then repeat the models and graphs for groups or sub-populations marked by distinct values of one or more dimension variables of interest.
The latter step is commonly referred to as “by-group processing.” SAS programmers will recognize by group processing with syntax that invokes a procedure on a sorted data set that looks something like:
proc reg data = dblahblah; by vblahblah;
Check out Steve’s post for how he addressed this in R using the high-performance data.table package by Matthew Dowle (and as Steve suggests, a good place to get started is the example vignettes).
I’d also add a recommendation for the plyr package which also offers tools to split up data sets by various criteria, and then do by-processing. Here, the plyr: divide and conquer guide is a good place to start. As an added bonus, you can also divide and conquer the computations by exploiting multiple nodes in parallel by engaging a parallel backend for the foreach function. (Note for Windows users: the doSMP backend from Revolution R is also available now on R-Forge and will be on CRAN soon, too.)
Information Management: By-Group Processing, the R data.table and the Power of Open Source