Getting access to the Hadoop cluster could not have been easier. All that I had to do was set up a Cygwin shell configured with OpenSSH and then set up the proper permissions in the .pem file that was provided to me and put the file in my Cygwin directory. Now, to fit a model using the Hadoop cluster all I have to do is run a few lines of R code that invoke my permissions and set my compute context for the Hortonworks cluster. The following script which I can run from almost any Palo Alto coffee shop fits a logistic regression model using data on the Hadoop cluster.
#----------------------------------------------------------------------------------------------- # RUNNING REVOLUTION R ENTERPRISE 7.0 REVOSCALER FUNCTIONS ON A HADOOP CLUSTER # This script shows code for executing RevoScaleR functions in an alpha-level version # of Revolution R Enterprise (RRE) V7.0 on a Hadoop Cluster. The Hadoop cluster is running # remotely in an Amazon Ec2 cloud. The script assumes that an ssh connection has been established # with a Linux node running the JobTracker and NameNode for the Hadoop cluster #----------------------------------------------------------------------------------------------- # SET UP PERMISSIONS FOR ACCESSING THE HADOOP CLUSTER mySshUsername = 'user-name' # Set user name mySshHostname <- "xx.xxx.xxx.xxx" # Public facing cluster IP address mySshSwitches <- "-i C:/cygwin/user-name.pem" # Location of .pem permissions file myHadoopCluster (sshUsername = mySshUsername, # Describe the Hadoop compute context sshHostname = mySshHostname, sshSwitches = mySshSwitches) myNameNode <- "master.local" # name of name node myPort <- 8020 # Port number of Hadoop name node bigDataDirRoot <- "/share" # Location of the provided data #------------------------------------------------------------------------------------------------ # POINT TO THE DATA ON THE HADOOP CLUSTER hdfsFS <- RxHdfsFileSystem(hostName=myNameNode, port=myPort) # Create file system object mortCsvDataDir <- file.path(bigDataDirRoot, "mortDefault/CSV") # Specify path on Hadoop cluster hdfsFS <- RxHdfsFileSystem(hostName=myNameNode, port=myPort) # Generate a file system object mortText <- RxTextData( mortCsvDataDir, fileSystem = hdfsFS ) # Set the data source #------------------------------------------------------------------------------------------------- # CHANGE THE COMPUTE CONTEXT TO POINT TO THE HADOOP CLUSTER rxSetComputeContext(myHadoopCluster) # Set the compute context rxGetComputeContext() # Check that the context has been reset #------------------------------------------------------------------------------------------------ # DATA ANALYSIS rxSummary(~., data = mortText) # Summarize the data # Fit a logistic regression model logitObj (default ~ F(year) + creditScore + yearsEmploy + ccDebt, data = mortText, reportProgress = 1) summary(logitObj) # look at the output #-------------------------------------------------------------------------------------------------
Created by Pretty R at inside-R.org
The first section of the script after the initial comments sets up permissions and specifies the Hadoop compute context. The second section points to the data on the Hadoop cluster in much the same way that one would point to data on a local machine. Then there is a line of code that points to the Hadoop compute context. Following that, we have the code to execute an rxSummary() function to read and summarize the data which is in a .csv file in the HDFS file system, and an rxLogit() function that fits a logistic regression model to this data.
What happens when the script runs is basically the following. My local instance of Revolution R Enterprise recognizes the call to use the remote compute context and sets up the connection to Hadoop cluster using the permissions provided. Executing the rxLogit() function causes an instance of R 3.0.1 and Revolution R Enterprise 7 to fire up on the Hadoop JobTracker node. Behind the scenes, this kicks off a Hadoop Map/Reduce job. Since logistic regression is a implemented as an iterative algorithm this means that a different Map/Reduce job gets kicked off for each iteration. This cycle repeats until the regression converges or the limit for the number of iterations is reached. This file contains some of the output sent back to my R console from running the script. It shows the progress reported on the Map/Reduce jobs and a few other details that the Hadoop curious may find interesting.
Soon running Map/Reduce jobs on Hadoop scale data sets will be within the reach of anyone with a basic R skills and access to Revolution R Enterprise. (Note that when it is released, Revolution R Enterprise 7 will support both Hortonworks 1.3 and Cloudera’s CDH3 and CDH4.)
For more information on Revolution and Hadoop have a look at the recording of Revolution developer Mario Inchiosa’s recent webinar and don’t miss the webinar describing Revolution and Hortonworks integration coming up on 9/24.
by Joseph Rickert