Posted by Jared Flater
SAC or split-apply-combine is a strategy for data analysis useful whenever breaking up of data into smaller chunks is necessary to ease analysis. This is usually followed by application of said analysis to all chunks and assembly of said analyzed chunks into something useful.
The SAC strategy is applied to data sets where one action may be better of performed on a small subset of data. For my use, this could be useful in terms of the size of our data…often in the 10s of gigs. Imagine a function applied that takes days to run through all the samples. What if you get an error at the end of those days? Wouldn’t it make more sense to apply your function to a small, digestible in hours, set of data? Human time is more valuable than computer time, it’s apparent the the SAC strategy could be used to make better use of time.
Another benefit of Hadley’s plyr package and the SAC strategy is the simplification of code. When transforming a variable in a data set over many variable, a for loop is often used. However, plyr allows for a much simpler chunk to achieve the same result.
For loop example:
plyr example:
The advantages of this strategy or at least a subset of it were made apparent to me as I tackled our homework for next week. We are given two data sets, one for DC and one for Marvel comics and asked to investigate said data, to facilitate my analysis (comparison) of these two publishers, I wanted to combine the two data sets into one data frame while adding a column to identify DC vs. Marvel. I used the rbind function to achieve this: