R + Databases

Ranae Dietzel and Andee Kaplan

R + Databases

Situation

  1. You have a wonderful relational DB implemented with all of your beautiful data.
  2. You love using R and R markdown to analyse your data.

Question: How to analyze your relational database data in R?

Options

There are many options for you. Here are 2:

  1. Database backend with dplyr
  2. Run SQL queries with DBI

dplyr

Currently dplyr supports the three most popular open source databases (sqlite, mysql and postgresql), and google’s bigquery. If your data fits in memory there is no advantage to putting it in a database: it will only be slower and more hassle.

The reasons you’d want to use dplyr with a database are

  1. either your data is already in a database (and you don’t want to work with static csv files that someone else has dumped out for you), or
  2. you have so much data that it does not fit in memory and you have to use a database.

Motivation

Let’s do it!

library(dplyr)

# set up the connection
star_wars <- src_sqlite("data/star_wars.db")

# get a table by name
planet <- tbl(star_wars, "planet")

planet %>%
  filter(population > 1e10) %>%
  summarise(count = n())
## Source:   query [?? x 1]
## Database: sqlite 3.8.6 [data/star_wars.db]
## 
##   count
##   <int>
## 1     6

Arbitrary SQL with dplyr

You can also use arbitrary SQL.

# arbitrary sql queries
planet <- tbl(star_wars, sql("SELECT * FROM planet"))

head(planet)
## Source:   query [?? x 10]
## Database: sqlite 3.8.6 [data/star_wars.db]
## 
##   planet_id     name rotation_period orbital_period diameter
##       <int>    <chr>           <int>          <int>    <dbl>
## 1         1 Tatooine              23            304    10465
## 2         2 Alderaan              24            364    12500
## 3         3 Yavin IV              24           4818    10200
## 4         4     Hoth              23            549     7200
## 5         5  Dagobah              23            341     8900
## 6         6   Bespin              12           5110   118000
## # ... with 5 more variables: climate <chr>, gravity <chr>, terrain <chr>,
## #   surface_water <int>, population <int>

Your turn

Connect to the star wars DB from your homework using dplyr (try to use raw SQL as little as possible) and answer

  1. How many planets are not in the movies?
  2. What are their names?
  3. Which species are native to them?

Laziness

dplyr never pulls data into R unless you explicitly ask for it.

It delays doing any work until the last possible moment: it collects together everything you want to do and then sends it to the database in one step.

c1 <- filter(planet, population > 1e7 & diameter < 10000)
c2 <- select(c1, name, climate, terrain, gravity, diameter, population)
c3 <- mutate(c2, surface_area = 4*pi*(diameter/2)^2)
c4 <- mutate(c3, pop_area = population/surface_area)
c5 <- arrange(c4, pop_area)

This sequence of operations never actually touches the database. It’s not until you ask for the data (e.g. by printing c5) that dplyr generates the SQL and requests the results from the database. Even then it only pulls down 10 rows.

Collect

To pull down all the results use collect().

# dplyr doesn't actually pull all the data
dim(c5)
## [1] NA  8
# unless we tell it to
c5 %>% collect() %>% dim()
## [1] 6 8

DBI

Example

library(DBI)

# connect
sw_con <- dbConnect(dbDriver("SQLite"), "data/star_wars.db")

# list of tables
dbListTables(sw_con)
## [1] "film"          "film_includes" "person"        "pilot"        
## [5] "planet"        "residence"     "species"       "starship"     
## [9] "vehicle"
# get the data
query <- dbSendQuery(sw_con, "SELECT * FROM planet")
planet <- fetch(query)
dim(planet)
## [1] 61 10
# close the connection
dbDisconnect(sw_con)
## [1] TRUE