Blog7

Posted by Phil Colgan

THe split-apply-combine strategy involves taking a complex data set and breaking it up into more easily manageable sections, working on each of those sections separately, then combining them back together afterwards. There are several languages that have tools to facilitate this functionality, but this paper describes the plyr package in R. The paper describes the specific code used to split up data input of variable dimension and type(arrays, lists, data frames) similar to subsetting data in SQL. Plyr is beneficial because it makes parsing large datasets simpler. Compared to using for-loops, commands in plyr are more intuitive and much shorter which decreases the chances of user error from miscoding and makes working with the data easier.

Prompt:

Read Hadley Wickham’s Split-Apply-Combine paper here: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.182.5667&rep=rep1&type=pdf

What is the split-apply-combine strategy and how is it used in the process of data anlysis? Where have we already seen/used the split-apply-combine strategy in this class? What are some advantages of using the split-apply-combine strategy?