Ranae Dietzel and Andee Kaplan
R
and R markdown to analyse your data.Question: How to analyze your relational database data in R
?
There are many options for you. Here are 2:
dplyr
DBI
dplyr
Currently dplyr supports the three most popular open source databases (sqlite, mysql and postgresql), and google’s bigquery. If your data fits in memory there is no advantage to putting it in a database: it will only be slower and more hassle.
The reasons you’d want to use dplyr with a database are
R
and SQL many times before you get the perfect dataset.dplyr
helps by allowing you to write R
code that is automatically translated to SQL.R
function. Instead, dplyr only generates SELECT
statements, the SQL you write most often during analysis.library(dplyr)
# set up the connection
star_wars <- src_sqlite("data/star_wars.db")
# get a table by name
planet <- tbl(star_wars, "planet")
planet %>%
filter(population > 1e10) %>%
summarise(count = n())
## Source: query [?? x 1]
## Database: sqlite 3.8.6 [data/star_wars.db]
##
## count
## <int>
## 1 6
dplyr
You can also use arbitrary SQL.
# arbitrary sql queries
planet <- tbl(star_wars, sql("SELECT * FROM planet"))
head(planet)
## Source: query [?? x 10]
## Database: sqlite 3.8.6 [data/star_wars.db]
##
## planet_id name rotation_period orbital_period diameter
## <int> <chr> <int> <int> <dbl>
## 1 1 Tatooine 23 304 10465
## 2 2 Alderaan 24 364 12500
## 3 3 Yavin IV 24 4818 10200
## 4 4 Hoth 23 549 7200
## 5 5 Dagobah 23 341 8900
## 6 6 Bespin 12 5110 118000
## # ... with 5 more variables: climate <chr>, gravity <chr>, terrain <chr>,
## # surface_water <int>, population <int>
Connect to the star wars DB from your homework using dplyr
(try to use raw SQL as little as possible) and answer
dplyr
never pulls data into R
unless you explicitly ask for it.
It delays doing any work until the last possible moment: it collects together everything you want to do and then sends it to the database in one step.
c1 <- filter(planet, population > 1e7 & diameter < 10000)
c2 <- select(c1, name, climate, terrain, gravity, diameter, population)
c3 <- mutate(c2, surface_area = 4*pi*(diameter/2)^2)
c4 <- mutate(c3, pop_area = population/surface_area)
c5 <- arrange(c4, pop_area)
This sequence of operations never actually touches the database. It’s not until you ask for the data (e.g. by printing c5
) that dplyr
generates the SQL and requests the results from the database. Even then it only pulls down 10 rows.
To pull down all the results use collect()
.
# dplyr doesn't actually pull all the data
dim(c5)
## [1] NA 8
# unless we tell it to
c5 %>% collect() %>% dim()
## [1] 6 8
DBI
SELECT
(CREATE
, DROP
, INSERT
, UPDATE
)library(DBI)
# connect
sw_con <- dbConnect(dbDriver("SQLite"), "data/star_wars.db")
# list of tables
dbListTables(sw_con)
## [1] "film" "film_includes" "person" "pilot"
## [5] "planet" "residence" "species" "starship"
## [9] "vehicle"
# get the data
query <- dbSendQuery(sw_con, "SELECT * FROM planet")
planet <- fetch(query)
dim(planet)
## [1] 61 10
# close the connection
dbDisconnect(sw_con)
## [1] TRUE