Blog 5 - Tidy data

Posted by Phil Colgan

The paper by Hadly Wickam title “Tidy Data” describes a framework and tools for structuring datasets to make them easier to work with and more informative. Wickham points out that there are several methods described for working with messy data, but often they are written in incompatible languages that make them inaccessible to many data users. Here, Wickham proposes an approach to keeping data “tidy” that was made to be amenable to statisticians.

Tidy data is formatted in a way that resembles Codd’s 3rd normal form, a topic we have discussed while learning about relational databases. The difference is that Tidy data focuses on a single data set instead of many. Each variable must form a column, each observation forms a row, and each type of observational unit forms a table. These characteristics are necessary to simplify the process of extracting required variables in fewer steps, resulting in less chance of creating errors.