While working with a large number of files for data processing, I used the following R commands for data processing. Given that everyone needs to split as well merge and append data – I am just giving some code on splitting data based on parameters , and appending data as well as merging data.
Splitting Data […]
While working with a large number of files for data processing, I used the following R commands for data processing. Given that everyone needs to split as well merge and append data – I am just giving some code on splitting data based on parameters , and appending data as well as merging data.
Splitting Data Based on a Parameter.
The following divides the data into subsets which contain either Male or anything else in different datasets.
Input and Subset
Note the read.table command assigns the dataset name X in R environment from the file reference (path denoted by ….)
x <- read.table(....)
rowIndx <- grep("Male", x$col)
write.table(x[rowIndx,], file="match")
write.table(x[-rowIndx,], file="nomatch")
Suppose we need to divide the dataset into multiple data sets.
X17 <- subset(X, REGION == 17)
This is prefered to the technique -
attach(X)
X17 = X[REGION == 17,]
Output
For putting the files back to the Windows environment you can use-
write.table(x,file="",row.names=TRUE,col.names=TRUE,sep=" ")
Append
Lets say you have a large number of data files ( say csv files )
that you need to append (assuming the files are in same syrycture)
after performing basic operations on them.
>setwd("C:\\Documents and Settings\\admin\\My Documents\\Data")
Note this changes the working folder to folder you want it to be,
note the double slashes which are needed to define the path
>list.files(path = ".", pattern = NULL, all.files = FALSE, full.names = FALSE,
+ recursive = FALSE, ignore.case = FALSE)
The R output would be something like below
[1] "cal1.csv" "cal2.csv"
[3] "cal3.csv" "cal4.csv"
[5] "cal5.csv" "cal6.csv"
[7] "cal7.csv" "cal8.csv"
Now you can use the file.append command for succesively appending the second file
to the first file.
If writing a lot of similar code is a tedium use the & (concatenate) function
in excel to create the code.Note the Formula Bar (B7=A7&C7&D7&E7)
Excel is useful because it is good in click and drag repetitive text and
concatenation is easily done.
The output would be something like
>file.append("cal1.csv","cal2.csv") [1] TRUE |
>file.append("cal1.csv","cal3.csv") [1] TRUE |
>file.append("cal1.csv","cal4.csv") [1] TRUE |
>file.append("cal1.csv","cal5.csv") [1] TRUE |
>file.append("cal1.csv","cal6.csv") [1] TRUE |
>file.append("cal1.csv","cal7.csv") [1] TRUE |
>file.append("cal1.csv","cal8.csv") [1] TRUE |
Note all data here gets appended to filecal1.csv
This should be a good starting point for you to trying out R.
For a Reference Sheet, here is an excellent reference sheet from Tom Short,
and it is aptly called the Short Refcard
(http://cran.r-project.org/doc/contrib/Short-refcard.pdf)
Note- Experienced analytics people are best served by
Anyways MeRRy ChRistmas !