dplyr verbs

Ranae Dietzel & Andee Kaplan

Verbs, that’s what’s happening!

Filter

filter() allows you to select a subset of rows in a data frame. Just like WHERE statements in SQL.

filter(baseball, team == "HOU" & year == 2005)
##          id year stint team lg   g  ab  r   h X2b X3b hr rbi sb cs bb so
## 1 sprinru01 2005     1  HOU NL  62   0  0   0   0   0  0   0  0  0  0  0
## 2 francjo01 2005     1  HOU NL  31   0  0   0   0   0  0   0  0  0  0  0
## 3 clemero02 2005     1  HOU NL  32  58  2  12   2   0  0   4  0  0  5 18
## 4 vizcajo01 2005     1  HOU NL  98 187 15  46  10   2  1  23  2  0 15 40
## 5 biggicr01 2005     1  HOU NL 155 590 94 156  40   1 26  69 11  1 37 90
## 6 bagweje01 2005     1  HOU NL  39 100 11  25   4   0  3  19  0  0 18 21
## 7 ausmubr01 2005     1  HOU NL 134 387 35 100  19   0  3  47  5  3 51 48
##   ibb hbp sh sf gidp
## 1   0   0  0  0    0
## 2   0   0  0  0    0
## 3   0   1  5  0    1
## 4   4   0  1  2    2
## 5   2  17  4  3   10
## 6   1   1  0  4    2
## 7   8   5  7  1   17

Arrange

arrange() reorders rows. It takes a data frame, and a set of column names to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns. Just like ORDER BY in SQL.

astros_ws <- filter(baseball, team == "HOU" & year == 2005)
arrange(astros_ws, desc(hr), desc(rbi), g)
##          id year stint team lg   g  ab  r   h X2b X3b hr rbi sb cs bb so
## 1 biggicr01 2005     1  HOU NL 155 590 94 156  40   1 26  69 11  1 37 90
## 2 ausmubr01 2005     1  HOU NL 134 387 35 100  19   0  3  47  5  3 51 48
## 3 bagweje01 2005     1  HOU NL  39 100 11  25   4   0  3  19  0  0 18 21
## 4 vizcajo01 2005     1  HOU NL  98 187 15  46  10   2  1  23  2  0 15 40
## 5 clemero02 2005     1  HOU NL  32  58  2  12   2   0  0   4  0  0  5 18
## 6 francjo01 2005     1  HOU NL  31   0  0   0   0   0  0   0  0  0  0  0
## 7 sprinru01 2005     1  HOU NL  62   0  0   0   0   0  0   0  0  0  0  0
##   ibb hbp sh sf gidp
## 1   2  17  4  3   10
## 2   8   5  7  1   17
## 3   1   1  0  4    2
## 4   4   0  1  2    2
## 5   0   1  5  0    1
## 6   0   0  0  0    0
## 7   0   0  0  0    0

Select

Often you work with large datasets with many columns but only a few are actually of interest to you. select() allows you show only the columns you are interested. Just like SELECT in SQL.

select(astros_ws, id, hr, rbi, g)
##          id hr rbi   g
## 1 sprinru01  0   0  62
## 2 francjo01  0   0  31
## 3 clemero02  0   4  32
## 4 vizcajo01  1  23  98
## 5 biggicr01 26  69 155
## 6 bagweje01  3  19  39
## 7 ausmubr01  3  47 134

Helper functions: starts_with(), ends_with(), matches() and contains()

Distinct

distinct() finds unique values in a table. Just like SELECT DISTINCT in SQL.

head(distinct(select(baseball, team)))
##   team
## 1  RC1
## 2  WS3
## 3  FW1
## 4  NY2
## 5  CL1
## 6  TRO

Mutate

mutate() adds new columns that are the functions of existing columns

head(mutate(baseball, ba = h/ab))
##          id year stint team lg  g  ab  r  h X2b X3b hr rbi sb cs bb so ibb
## 1 ansonca01 1871     1  RC1    25 120 29 39  11   3  0  16  6  2  2  1  NA
## 2 forceda01 1871     1  WS3    32 162 45 45   9   4  0  29  8  0  4  0  NA
## 3 mathebo01 1871     1  FW1    19  89 15 24   3   1  0  10  2  1  2  0  NA
## 4 startjo01 1871     1  NY2    33 161 35 58   5   1  1  34  4  2  3  0  NA
## 5 suttoez01 1871     1  CL1    29 128 35 45   3   7  3  23  3  1  1  0  NA
## 6 whitede01 1871     1  CL1    29 146 40 47   6   5  1  21  2  2  4  1  NA
##   hbp sh sf gidp        ba
## 1  NA NA NA   NA 0.3250000
## 2  NA NA NA   NA 0.2777778
## 3  NA NA NA   NA 0.2696629
## 4  NA NA NA   NA 0.3602484
## 5  NA NA NA   NA 0.3515625
## 6  NA NA NA   NA 0.3219178

The pipe

The pipe operator in dplyr actually comes from a package called magrittr, and is meant to simplify your code into readable, sequential operators (almost like stringing together actions into a story).

x %>% f()

The above code can be read as “Take x, and then do f”. The %>% operator takes whatever is to the left of it and plugs it in as the first argument in the function to the right. So, this is equivalent to

f(x)

The Treachery of Images

This allows you to string together operations sequentially.

baseball %>%
  filter(team == "HOU" & year == 2005) %>% 
  select(id, hr, rbi, g) %>%
  arrange(desc(hr), desc(rbi)) %>%
  head(3)
##          id hr rbi   g
## 1 biggicr01 26  69 155
## 2 ausmubr01  3  47 134
## 3 bagweje01  3  19  39

Your turn

Use dplyr to do the following

  1. Calculate the career batting average for each player.
  2. Filter out players that have zero at bats before calculating batting average
  3. Arrange your results by batting average to find the player with the best career batting average
  4. Limit your results to only players who have played in at least 100 games.

Joins

dplyr also has joins implemented.

You can control the variables to join on by using by = "common_name" or by = c("column_a" = "column_b", ...) as a parameter in the join function.