Scalable Data Analysis in R
R is a great environment for interactive analysis on your desktop, but when your data needs outgrow your personal computer, it’s not clear what to do next.
I’ve put together material for a day-long tutorial on scalable data analysis in R. It covers:
- A brief introduction to R for those coming from a Python background;
- The bigmemory package for out-of-core computation on large data matrices, with a simple physical sciences example;
- The standard parallel package, including what was the snow and multicore facilities, using airline data as an example
- The foreach package, using airline data and simple stock data;
- The Rdsm package for shared memory; and
- a brief introduction to the powerful pbdR pacakges for extremely large-scale computation.
The presentation for the material, in R markdown (so including the sourcecode) is in the presentation directory; you can read the resulting presentation as markdown there, or as a PDF.
The R code from the slides can be found in the R directory.
Some data can be found in the data directory; but as you might expect in a workshop on scalable data analysis, the files are quite large! Mostly you can just find scripts for downloading the data; running make in the main directory will pull almost everything down, but a little more work needs go to into automating some of the production of the data products used.
Suggestions, as always, greatly welcomed.