Big Data Sets You Can Use with R

The world may indeed be awash with data, however, it is not always easy to find a suitable data set when you need one.

The world may indeed be awash with data, however, it is not always easy to find a suitable data set when you need one. As the number of people becoming involved with R and data science increases so does the need for interesting data sets for creating examples, showcasing machine learning algorithms and developing statistical analyses. The most difficult data sets to find are those that would provide the foundation for impressive big data examples: data sets with a 100 million rows and hundreds of variables.The problem with big data, however, is that most of it is proprietary and locked away. Consequently, when constructing examples it is often necessary “make do” with data sets that are considerably smaller than an analyst is likely to be faced with in practice. To help with this problem, we have added some new data sets to lists of data sets on inside-r.org that we began keeping since almost two years ago. So, if you are looking for a sample data set or if you are the kind of person who enjoys browsing data repositories as some people enjoy browsing bookstores have a look at what is available there. The following presents some of the highlights.

The Revolution Analytics collection contains some of the data sets we use at Revolution to show off the Parallel External Memory Algorithms in our RevoScaleR package. The collection includes easily accessible “tarred-up” versions of the Airlines Data Set, Census5PCT2000 data set and an artificial set of mortgage default data.

The Airlines data set that was used in the 2009 American Statistical Association challenge has become the “iris” data set for big data. This file contains information on US Domestic Flights between 1987 and 2008 and has some nice properties that make it useful for different kinds of analyses. It has over 123 million rows (observations) and 29 columns containing variables of different data types including factors with lots of levels. The following output from the RevoScaleR function rxGetInfo() displays basic information for the variables in the file.

> rxGetInfoXdf(working.file,getVarInfo=TRUE) File name: C:\DATA\Airlines_87_08\BigAir3.xdf Number of observations: 123534969 Number of variables: 31 Number of blocks: 833 Variable information: Var 1: Year, Type: integer, Low/High: (1987, 2008) Var 2: Month 12 factor levels: January February March April May ... August September October November December Var 3: DayofMonth, Type: integer, Low/High: (1, 31) Var 4: DayOfWeek 7 factor levels: Monday Tuesday Wednesday Thursday Friday Saturday Sunday Var 5: DepTime, Type: numeric, Storage: float32, Low/High: (0.0167, 29.5000) Var 6: CRSDepTime, Type: numeric, Storage: float32, Low/High: (0.0000, 24.0000) Var 7: ArrTime, Type: numeric, Storage: float32, Low/High: (0.0167, 29.9167) Var 8: CRSArrTime, Type: numeric, Storage: float32, Low/High: (0.0000, 24.0000) Var 9: UniqueCarrier 29 factor levels: 9E AA AQ AS B6 ... UA US WN XE YV Var 10: FlightNum 8160 factor levels: 1 10 100 1000 1001 ... 995 996 997 998 999 Var 11: TailNum, Type: numeric, Storage: float32, Low/High: (0.0000, 715.0000) Var 12: ActualElapsedTime, Type: integer, Low/High: (-719, 1883) Var 13: CRSElapsedTime, Type: integer, Low/High: (-1240, 1613) Var 14: AirTime, Type: integer, Low/High: (-3818, 3508) Var 15: ArrDelay, Type: integer, Low/High: (-1437, 2598) Var 16: DepDelay, Type: integer, Low/High: (-1410, 2601) Var 17: Origin 347 factor levels: ABE ABI ABQ ABY ACK ... XNA YAK YAP YKM YUM Var 18: Dest 352 factor levels: ABE ABI ABQ ABY ACK ... XNA YAK YAP YKM YUM Var 19: Distance, Type: integer, Low/High: (0, 4983) Var 20: TaxiIn, Type: integer, Low/High: (0, 1523) Var 21: TaxiOut, Type: integer, Low/High: (0, 3905) Var 22: Cancelled, Type: logical, Low/High: (0, 1) Var 23: CancellationCode 5 factor levels: NA carrier weather NAS security Var 24: Diverted, Type: logical, Low/High: (0, 1) Var 25: CarrierDelay, Type: integer, Low/High: (0, 2580) Var 26: WeatherDelay, Type: integer, Low/High: (0, 1510) Var 27: NASDelay, Type: integer, Low/High: (-60, 1392) Var 28: SecurityDelay, Type: integer, Low/High: (0, 533) Var 29: LateAircraftDelay, Type: integer, Low/High: (0, 1407)

Created by Pretty R at inside-R.org

Note that the 22 .csv files that comprise the Airlines dataset are available on RITA, the FAA website, along with data for more recent time periods

A smaller, but still very useful file for machine learning applications, containing medicare data was used in an R-bloggers post highlighting bigglm and ffbase. This file contains almost 3 million rows and eleven variables.

Graham Williams and others (me included) have made good use of the small version of the Australian weather file in his rattle R package. However, in an appendix of his book Data Mining with Rattle and R, Grahm points the way to the Australian government site which makes the data available in what Hadley Wickham might call a “tidy” format. (The data are not “clean” but they are in good enough shape to work with.) The following chart was built with rattle from Canberra Data collected between March and July of this year. Code to access and clean the file a bit, based on code Graham provides in his book, is available here: Download Code to clean weather data.

The moderately large airline “Edge” data set (3.5 million records) along with the airports and their locations data set, both available without charge from infochimps provided the occasion for a slightly more elaborate data shaping and cleaning effort using RevoScaleR functions. One way to do this is documented in the RevoScaleR Data Step White Paper.

As a final example of R friendly datasets have a look at those that Max Kuhn and Kjell Johnson have wrapped into the R package, AppliedPredictiveModeling, which they wrote to support their Springer book of the same name. This package offers a number of interesting small datasets including segmentationOriginal, which provides measurements on cell body features and has over 2,000 observations and 100 variables.

The data sets on the Inside-R list cover quite a bit of ground, however, I am sure that there is much more out there that should be on the list. We at Revolution Analytics would very much appreciate learning about what we have missed. Many thanks to everyone who has provided data sets or contributed to the Inside-R list in some other way.

by Joseph Rickert