Joins in R

Ranae Dietzel & Andee Kaplan

Throwback

Recall joins

Your turn

Discuss the result of the following joins using the tables:

  1. Inner join of A and B on id
  2. Inner join of A and B on id and group
  3. Left join of A and B on id
  4. Right join of A and B on id
  5. Outer join of A and B on id

Joins in R

dplyr also has joins implemented.

You can control the variables to join on by using by = "common_name" or by = c("column_a" = "column_b", ...) as a parameter in the join function.

Revisiting the Star Wars universe

I have the Star Wars DB in an R data object (rather than a SQLite database) available so that we can

  1. Refresh our memories on joins
  2. Learn the dplyr syntax
library(dplyr)
load("data/star_wars.RData")

What is the most highly populated planet documented in the universe?

planet %>%
  mutate(max_pop = max(population, na.rm = TRUE)) %>%
  filter(population == max_pop) %>%
  select(name)
##        name
## 1 Coruscant

Which planet has the most named characters from the films? What species is native to this planet?

person %>%
  group_by(homeworld_id) %>%
  summarise(count = n()) %>%
  inner_join(planet, by = c("homeworld_id" = "planet_id")) %>%
  inner_join(species, by = c("homeworld_id")) %>%
  select(name.x, count, name.y) %>%
  arrange(desc(count)) %>%
  head(1)
## # A tibble: 1 × 3
##   name.x count name.y
##    <chr> <int>  <chr>
## 1  Naboo    11 Gungan

Who has piloted the most spacecraft (vehicles and starships) in the universe (documented)?

pilot %>%
  group_by(pilot_id) %>%
  summarise(count = n()) %>%
  inner_join(person, by=c("pilot_id" = "person_id")) %>%
  arrange(desc(count)) %>%
  select(name, count) %>%
  head(1)
## # A tibble: 1 × 2
##             name count
##            <chr> <int>
## 1 Obi-Wan Kenobi     6

What about just vehicles?

pilot %>%
  filter(craft_type == "vehicle") %>%
  group_by(pilot_id) %>%
  summarise(count = n()) %>%
  inner_join(person, by=c("pilot_id" = "person_id")) %>%
  arrange(desc(count)) %>%
  select(name, count) %>%
  head(2)
## # A tibble: 2 × 2
##               name count
##              <chr> <int>
## 1   Luke Skywalker     2
## 2 Anakin Skywalker     2

Your turn

Use dplyr to answer the following

  1. How many people have never piloted a spacecraft according to this data?
  2. How many spacecraft have no documented pilots? What are their names?