As I previewed yesterday, REvolution R Enterprise 2.0 is now available to subscribers. In yesterday’s post, I focused mainly on the process of creating the release; today, I’d like to talk about some of its new features.
64-bit Windows support
REvolution R Enterprise 2.0 is the only version of R available for 64-bit Windows systems. This means that it is now possible to analyze much larger data sets on Windows systems than ever before. The reason for this is that, with a
few exceptions, all of the computational routines in R are
in-memory. This means that the entire data-set
and any temporary copies and working variables required by the routine must be able to fit into the
operating system’s memory at once. As a rough rule of thumb, most statistical routines (like regression or tree models) will require at least three temporary copies of the data. So on a 32-bit Windows system where the maximum memory available is around 3 gigabytes that means you can analyze a data set of 750 megabytes, tops. But on a 64-bit system, as long as you have enough
disk space available, you can analyze much larger data sets. In fact, your limitation will likely be the amount of time you ha
…
As I previewed yesterday, REvolution R Enterprise 2.0 is now available to subscribers. In yesterday’s post, I focused mainly on the process of creating the release; today, I’d like to talk about some of its new features.
64-bit Windows support
REvolution R Enterprise 2.0 is the only version of R available for 64-bit Windows systems. This means that it is now possible to analyze much larger data sets on Windows systems than ever before. The reason for this is that, with a
few exceptions, all of the computational routines in R are
in-memory. This means that the entire data-set
and any temporary copies and working variables required by the routine must be able to fit into the
operating system’s memory at once. As a rough rule of thumb, most statistical routines (like regression or tree models) will require at least three temporary copies of the data. So on a 32-bit Windows system where the maximum memory available is around 3 gigabytes that means you can analyze a data set of 750 megabytes, tops. But on a 64-bit system, as long as you have enough
disk space available, you can analyze much larger data sets. In fact, your limitation will likely be the amount of time you have to wait rather than the storage you have available. Even better, you can install and use more than 4 gigabytes of RAM (memory chips) on 64-bit Windows systems, and for large data set analysis the more RAM you have installed, the faster it will run. You can expect the best performance when the installed RAM is at least 3-4 times the size of the data. (You don’t
need that much, but the analysis will run slower if you have less.)
This opens R to a whole new world of possibilities for analyzing data on 64-bit Windows systems. For example, you can now:
- Estimate correlation matrices (and calculate Value at Risk) for much larger financial portfolios
- Use the Bioconductor suite to analyze pharmaceutical and biochemical data from much larger microarrays
- Build predictive models about purchasing behavior on larger databases of customer data, without the need for sampling
Brian Ripley, Professor of Applied Statistics at the University of Oxford and member of the R Core Development Team, reported using REvolution R Enterprise for genetic analysis during the beta test. He said:
“REvolution are to be congratulated on a technical tour de force…This will bring to Windows users the freedom to use R on large problems that users of Unix-like platforms have enjoyed for several years. We did some testing on a 32GB Windows box on behalf of a computational genetics project, and the beta was 100% reliable and comparable in performance to the Rcore 32-bit distribution but able to tackle much larger problems.”
Basically, if you’ve tried to use R to analyze a large dataset on Windows before and gotten an error like “cannot allocate vector of size 858213 Kb”, switching to a 64-bit version of Windows with REvolution R Enterprise 2.0 is likely to help.
ParallelR upgraded
REvolution R Enterprise 2.0 comes with
ParallelR, a suite of packages from REvolution Computing that simplify parallel programming in R. If you have a multiprocessor workstation (and most higher-end laptops and desktops sold today have at least 2 processors or
cores), then parallel programming is a way of instructing R to use all processors simultaneously to reduce computation time. REvolution R automatically uses multiple processors for some
key mathematical routines like matrix multiplication and decomposition, but for general R code only one processor will be used at a time unless you use the features of ParallelR.
There are a few other systems available for parallel programming in R, but after talking to users who had attempted to use them, we found that most attempts by casual users had been abandoned in frustration. This is because these systems were designed primarily for use on clusters (collections of workstations) for distributed computing. This in turn requires complex procedures for setting up the environment: designating the server and clients, nominating processors on each, bypassing security measures so that each instance of R can talk to each other, and so on. We also heard that writing parallel programs in these systems was complicated: the programmer had to deal with a lot of unfamiliar concepts like clients, servers, shared variables, message-passing and so on. When correctly configured these systems can offer excellent performance, but unless you have the computer-science background and training to rewrite your R programs using these new paradigms the performance gain is, well, zero.
ParallelR is designed so that the casual R programmer can easily convert “
embarrassingly parallel” R programs to run faster on multiprocessor workstations. Embarrassingly parallel problems are those with sequences of steps that can be arbitrarily reordered because no step depends on the results of any other step. Common examples in the Statistics world are simulations, bagging and boosting procedures (fitting random forest models, for example), predictions, and fitting the same model to a sequence of dependent variables (a series of regions or segments, for example).
The key innovation is a new function called
foreach, which you can use to replace the traditional
for loop in R. If you had enough processors available, each iteration of the loop would run at the
same time, in parallel. More realistically, a few iterations will run in parallel at any one time — one per available processor. ParallelR handles all the complexity of scheduling each iteration when a processor becomes available and collecting the results, and automatically ensures that the local variables of the loop are replicated so the values from one iteration do no trample those of another. You can see some
examples of foreach in action on the REvolution website, or in this
recent webcast on backtesting financial models where using a quad-core system in parallel reduced the computational time by almost 75%.
For really meaty jobs you can speed up performance even more by adding more processors with a
cluster. You don’t need to have a dedicated laboratory full of high-powered workstations available: you can always harness those PCs and Macs sitting idle around the office overnight for your heaviest number-crunching problems. ParallelR makes it easy to take the code you’ve already run in parallel on your desktop and extend it to a cluster of machines running R using a feature called
sleighs. And if the overnight cleaner accidentally unplugs one of those machines you won’t have wasted a night’s computing time: ParallelR has
fault tolerance so your job will still complete even if some of the nodes in the cluster become unavailable.
Enterprise Support and Service
As our subscription-level version of R,
REvolution R Enterprise is backed by REvolution Computing and comes with full technical support services from our teams. It also comes ready for use in
validated environments, such as for the analysis of FDA-controlled clinical trials.
Looking for more?
In the coming weeks we’ll have more examples and stories about REvolution R Enterprise here, but in the meantime if you’d like more information or want to enquire about subscriptions and editions, please just
contact REvolution Computing and we’ll be happy to help.