Ranae Dietzel & Andee Kaplan
filter()
allows you to select a subset of rows in a data frame. Just like WHERE
statements in SQL.
filter(baseball, team == "HOU" & year == 2005)
## id year stint team lg g ab r h X2b X3b hr rbi sb cs bb so
## 1 sprinru01 2005 1 HOU NL 62 0 0 0 0 0 0 0 0 0 0 0
## 2 francjo01 2005 1 HOU NL 31 0 0 0 0 0 0 0 0 0 0 0
## 3 clemero02 2005 1 HOU NL 32 58 2 12 2 0 0 4 0 0 5 18
## 4 vizcajo01 2005 1 HOU NL 98 187 15 46 10 2 1 23 2 0 15 40
## 5 biggicr01 2005 1 HOU NL 155 590 94 156 40 1 26 69 11 1 37 90
## 6 bagweje01 2005 1 HOU NL 39 100 11 25 4 0 3 19 0 0 18 21
## 7 ausmubr01 2005 1 HOU NL 134 387 35 100 19 0 3 47 5 3 51 48
## ibb hbp sh sf gidp
## 1 0 0 0 0 0
## 2 0 0 0 0 0
## 3 0 1 5 0 1
## 4 4 0 1 2 2
## 5 2 17 4 3 10
## 6 1 1 0 4 2
## 7 8 5 7 1 17
arrange()
reorders rows. It takes a data frame, and a set of column names to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns. Just like ORDER BY
in SQL.
astros_ws <- filter(baseball, team == "HOU" & year == 2005)
arrange(astros_ws, desc(hr), desc(rbi), g)
## id year stint team lg g ab r h X2b X3b hr rbi sb cs bb so
## 1 biggicr01 2005 1 HOU NL 155 590 94 156 40 1 26 69 11 1 37 90
## 2 ausmubr01 2005 1 HOU NL 134 387 35 100 19 0 3 47 5 3 51 48
## 3 bagweje01 2005 1 HOU NL 39 100 11 25 4 0 3 19 0 0 18 21
## 4 vizcajo01 2005 1 HOU NL 98 187 15 46 10 2 1 23 2 0 15 40
## 5 clemero02 2005 1 HOU NL 32 58 2 12 2 0 0 4 0 0 5 18
## 6 francjo01 2005 1 HOU NL 31 0 0 0 0 0 0 0 0 0 0 0
## 7 sprinru01 2005 1 HOU NL 62 0 0 0 0 0 0 0 0 0 0 0
## ibb hbp sh sf gidp
## 1 2 17 4 3 10
## 2 8 5 7 1 17
## 3 1 1 0 4 2
## 4 4 0 1 2 2
## 5 0 1 5 0 1
## 6 0 0 0 0 0
## 7 0 0 0 0 0
Often you work with large datasets with many columns but only a few are actually of interest to you. select()
allows you show only the columns you are interested. Just like SELECT
in SQL.
select(astros_ws, id, hr, rbi, g)
## id hr rbi g
## 1 sprinru01 0 0 62
## 2 francjo01 0 0 31
## 3 clemero02 0 4 32
## 4 vizcajo01 1 23 98
## 5 biggicr01 26 69 155
## 6 bagweje01 3 19 39
## 7 ausmubr01 3 47 134
Helper functions: starts_with()
, ends_with()
, matches()
and contains()
distinct()
finds unique values in a table. Just like SELECT DISTINCT
in SQL.
head(distinct(select(baseball, team)))
## team
## 1 RC1
## 2 WS3
## 3 FW1
## 4 NY2
## 5 CL1
## 6 TRO
mutate()
adds new columns that are the functions of existing columns
head(mutate(baseball, ba = h/ab))
## id year stint team lg g ab r h X2b X3b hr rbi sb cs bb so ibb
## 1 ansonca01 1871 1 RC1 25 120 29 39 11 3 0 16 6 2 2 1 NA
## 2 forceda01 1871 1 WS3 32 162 45 45 9 4 0 29 8 0 4 0 NA
## 3 mathebo01 1871 1 FW1 19 89 15 24 3 1 0 10 2 1 2 0 NA
## 4 startjo01 1871 1 NY2 33 161 35 58 5 1 1 34 4 2 3 0 NA
## 5 suttoez01 1871 1 CL1 29 128 35 45 3 7 3 23 3 1 1 0 NA
## 6 whitede01 1871 1 CL1 29 146 40 47 6 5 1 21 2 2 4 1 NA
## hbp sh sf gidp ba
## 1 NA NA NA NA 0.3250000
## 2 NA NA NA NA 0.2777778
## 3 NA NA NA NA 0.2696629
## 4 NA NA NA NA 0.3602484
## 5 NA NA NA NA 0.3515625
## 6 NA NA NA NA 0.3219178
The pipe operator in dplyr
actually comes from a package called magrittr
, and is meant to simplify your code into readable, sequential operators (almost like stringing together actions into a story).
x %>% f()
The above code can be read as “Take x
, and then do f
”. The %>%
operator takes whatever is to the left of it and plugs it in as the first argument in the function to the right. So, this is equivalent to
f(x)
This allows you to string together operations sequentially.
baseball %>%
filter(team == "HOU" & year == 2005) %>%
select(id, hr, rbi, g) %>%
arrange(desc(hr), desc(rbi)) %>%
head(3)
## id hr rbi g
## 1 biggicr01 26 69 155
## 2 ausmubr01 3 47 134
## 3 bagweje01 3 19 39
Use dplyr
to do the following
dplyr
also has joins implemented.
left_join()
right_join()
inner_join()
full_join()
You can control the variables to join on by using by = "common_name"
or by = c("column_a" = "column_b", ...)
as a parameter in the join function.