Beyond Single Core R- Parallel Data Analysis
I was asked recently to do short presentation for the Greater Toronto R Users Group on parallel computing in R; My slides can be seen below or on github, where the complete materials can be found.
I covered some similar things I had covered in a half-day workshop a couple of years earlier (though, obviously, without the hands-on component):
- How to think about parallelism and scalability in data analysis
- The standard parallel package, including what was the snow and multicore facilities, using airline data as an example
- The foreach package, using airline data and simple stock data;
- A summary of best practices,
with some bonus material tacked on the end touching on a couple advanced topics.
I was quite surprised at how little had changed since late 2014, other than further development of SparkR (which I didn’t cover), and the interesting but seemingly not very much used future package. I was also struck by how hard it is to find similar materials online, covering a range of parallel computing topics in R - it’s rare enough that even this simple effort made it to the HPC project view on CRAN (under “related links”). R continues to grow in popularity for data analysis; is this all desktop computing? Is Spark siphoning off the clustered-dataframe usage?
(This was also my first time with RPres in RStudio; wow, not a fan, RPres was not ready for general release. And I’m a big fan of RMarkdown.)