The tidyverse continues

Tibbles

Like a dataframe, but does less

Never changes the type of inputs
Never changes the names of variables
Never creates row names
Is not as picky about column names (accepts spaces or names beginning with numbers)
- These must be surrounded by backticks

Tibbles

When you print a tibble, you see

First 10 rows
Type of column
Only the columns that fit on the screen (other columns listed)
- If you want to see all the columns use width = Inf.

Tidy data

Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.

Why tidy data

Consistent data structure allows easier learning of related tools because they have an underlying uniformity
Placing variables in columns takes advantage of R’s vectorized nature
ggplot2 and dplyr are best friends with tidy data

Very easy to find data that is well organized, but not tidy

## # A tibble: 5 × 5
##   property five_cm ten_cm fifteen_cm twenty_cm
##      <chr>   <dbl>  <dbl>      <dbl>     <dbl>
## 1       pH       7    6.7        6.5       6.4
## 2      NO3      35   22.0       20.0      18.0
## 3     clay      20   20.0       20.0      20.0
## 4     silt      20   20.0       20.0      20.0
## 5     sand      60   60.0       60.0      60.0

Plot the change in pH across depth?

`gather`

When column names are values instead of variables, we need to gather

## # A tibble: 5 × 5
##   property five_cm ten_cm fifteen_cm twenty_cm
##      <chr>   <dbl>  <dbl>      <dbl>     <dbl>
## 1       pH       7    6.7        6.5       6.4
## 2      NO3      35   22.0       20.0      18.0
## 3     clay      20   20.0       20.0      20.0
## 4     silt      20   20.0       20.0      20.0
## 5     sand      60   60.0       60.0      60.0

We need to know the columns that are the values that need to be incorporated
The name of the variable for the values, here it is depth. This is the key.

`gather`

## # A tibble: 5 × 5
##   property five_cm ten_cm fifteen_cm twenty_cm
##      <chr>   <dbl>  <dbl>      <dbl>     <dbl>
## 1       pH       7    6.7        6.5       6.4
## 2      NO3      35   22.0       20.0      18.0
## 3     clay      20   20.0       20.0      20.0
## 4     silt      20   20.0       20.0      20.0
## 5     sand      60   60.0       60.0      60.0

The name of the variable whose values are spread over the cells. This is the value. If you know what it is, name it (NO3ppm). If you have many mixed units, you can use value = value or value = measured etc.

`gather`

tidy_soil<-gather(soil, five_cm, ten_cm, fifteen_cm, twenty_cm, 
                  key=depth, value=value)
tidy_soil

## # A tibble: 20 × 3
##    property      depth value
##       <chr>      <chr> <dbl>
## 1        pH    five_cm   7.0
## 2       NO3    five_cm  35.0
## 3      clay    five_cm  20.0
## 4      silt    five_cm  20.0
## 5      sand    five_cm  60.0
## 6        pH     ten_cm   6.7
## 7       NO3     ten_cm  22.0
## 8      clay     ten_cm  20.0
## 9      silt     ten_cm  20.0
## 10     sand     ten_cm  60.0
## 11       pH fifteen_cm   6.5
## 12      NO3 fifteen_cm  20.0
## 13     clay fifteen_cm  20.0
## 14     silt fifteen_cm  20.0
## 15     sand fifteen_cm  60.0
## 16       pH  twenty_cm   6.4
## 17      NO3  twenty_cm  18.0
## 18     clay  twenty_cm  20.0
## 19     silt  twenty_cm  20.0
## 20     sand  twenty_cm  60.0

filter(tidy_soil, property == "pH")%>%
ggplot(aes(x=depth, y=value))+
  geom_point(size=3)

Almost!

`spread`

Use this if one of your columns is full of what could be headers

## # A tibble: 6 × 4
##    Year Commodity                         `Data Item`    Value
##   <int>     <chr>                               <chr>    <dbl>
## 1  2015      OATS              OATS - ACRES HARVESTED    57000
## 2  2015      OATS                OATS - ACRES PLANTED   125000
## 3  2015      OATS    OATS - PRODUCTION, MEASURED IN $ 11027000
## 4  2015      OATS   OATS - PRODUCTION, MEASURED IN BU  4161000
## 5  2015      OATS OATS - YIELD, MEASURED IN BU / ACRE       73
## 6  2014      OATS              OATS - ACRES HARVESTED    55000

Now column that includes the variable names is key, here it is Data Item.
Column that contains the values is the value column, here they already have it as Values

spread(oats, key = `Data Item`, value = Value)

## # A tibble: 150 × 8
##     Year Commodity `OATS - ACRES HARVESTED` `OATS - ACRES PLANTED`
## *  <int>     <chr>                    <dbl>                  <dbl>
## 1   1866      OATS                   450000                     NA
## 2   1867      OATS                   500000                     NA
## 3   1868      OATS                   560000                     NA
## 4   1869      OATS                   610000                     NA
## 5   1870      OATS                   660000                     NA
## 6   1871      OATS                   750000                     NA
## 7   1872      OATS                   850000                     NA
## 8   1873      OATS                   910000                     NA
## 9   1874      OATS                   960000                     NA
## 10  1875      OATS                  1075000                     NA
## # ... with 140 more rows, and 4 more variables: `OATS - PRODUCTION,
## #   MEASURED IN $` <dbl>, `OATS - PRODUCTION, MEASURED IN BU` <dbl>, `OATS
## #   - YIELD, MEASURED IN BU / ACRE` <dbl>, `OATS - YIELD, MEASURED IN BU /
## #   NET PLANTED ACRE` <dbl>

spread(oats, key = `Data Item`, value = Value)%>%
ggplot(aes(y=`OATS - YIELD, MEASURED IN BU / ACRE`,
           x = `OATS - ACRES HARVESTED`, label = Year))+
        geom_text()

Our turn

## # A tibble: 4 × 7
##     trt block  part   may  june  july august
##   <chr> <chr> <chr> <dbl> <dbl> <dbl>  <dbl>
## 1     0     A  leaf   5.5   7.5  12.5   12.5
## 2   100     A  stem   5.7   6.7   7.7    7.7
## 3   130     A fruit   0.0   2.0   8.2    8.8
## 4   150     A  root  12.0  14.0  22.5   22.0

Which columns need to be incorporated?
What is the key?
What is the value?

gather(biomass, may, june, july, august, key = month, value = g_m2)

## # A tibble: 16 × 5
##      trt block  part  month  g_m2
##    <chr> <chr> <chr>  <chr> <dbl>
## 1      0     A  leaf    may   5.5
## 2    100     A  stem    may   5.7
## 3    130     A fruit    may   0.0
## 4    150     A  root    may  12.0
## 5      0     A  leaf   june   7.5
## 6    100     A  stem   june   6.7
## 7    130     A fruit   june   2.0
## 8    150     A  root   june  14.0
## 9      0     A  leaf   july  12.5
## 10   100     A  stem   july   7.7
## 11   130     A fruit   july   8.2
## 12   150     A  root   july  22.5
## 13     0     A  leaf august  12.5
## 14   100     A  stem august   7.7
## 15   130     A fruit august   8.8
## 16   150     A  root august  22.0

spread(biomass_tidy, key = part, value = g_m2)

## # A tibble: 16 × 7
##      trt block  month fruit  leaf  root  stem
## *  <chr> <chr>  <chr> <dbl> <dbl> <dbl> <dbl>
## 1      0     A august    NA  12.5    NA    NA
## 2      0     A   july    NA  12.5    NA    NA
## 3      0     A   june    NA   7.5    NA    NA
## 4      0     A    may    NA   5.5    NA    NA
## 5    100     A august    NA    NA    NA   7.7
## 6    100     A   july    NA    NA    NA   7.7
## 7    100     A   june    NA    NA    NA   6.7
## 8    100     A    may    NA    NA    NA   5.7
## 9    130     A august   8.8    NA    NA    NA
## 10   130     A   july   8.2    NA    NA    NA
## 11   130     A   june   2.0    NA    NA    NA
## 12   130     A    may   0.0    NA    NA    NA
## 13   150     A august    NA    NA  22.0    NA
## 14   150     A   july    NA    NA  22.5    NA
## 15   150     A   june    NA    NA  14.0    NA
## 16   150     A    may    NA    NA  12.0    NA

Your turn

Look at the NASS soybean data from 1900 to 2015.
Spread the data and make an XY plot of soybean acres planted vs soybean acres harvested.
Gather the data and make a plot of how all of the “Data Item” variables change over time.

The tidyverse continues

tidyr

Today’s lecture brought to you by

Tibbles

Tibbles

Tidy data

Why tidy data

Very easy to find data that is well organized, but not tidy

`gather`

`gather`

`gather`

`spread`

Our turn

Your turn