There are two parts to dbplyr SQL translation: translating dplyr
verbs, and translating expressions within those verbs. This vignette
describes how entire verbs are translated;
vignette("translation-function")
describes how individual
expressions within those verbs are translated.
All dplyr verbs generate a SELECT
statement. To
demonstrate we’ll make a temporary database with a couple of tables
library(dplyr)
<- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
con <- copy_to(con, nycflights13::flights)
flights <- copy_to(con, nycflights13::airports) airports
select()
and mutate()
modify the
SELECT
clause:
%>%
flights select(contains("delay")) %>%
show_query()
## <SQL>
## SELECT `dep_delay`, `arr_delay`
## FROM `nycflights13::flights`
%>%
flights select(distance, air_time) %>%
mutate(speed = distance / (air_time / 60)) %>%
show_query()
## <SQL>
## SELECT `distance`, `air_time`, `distance` / (`air_time` / 60.0) AS `speed`
## FROM `nycflights13::flights`
filter()
generates a WHERE
clause:
%>%
flights filter(month == 1, day == 1) %>%
show_query()
## <SQL>
## SELECT *
## FROM `nycflights13::flights`
## WHERE (`month` = 1.0) AND (`day` = 1.0)
arrange()
generates an ORDER BY
clause:
%>%
flights arrange(carrier, desc(arr_delay)) %>%
show_query()
## <SQL>
## SELECT *
## FROM `nycflights13::flights`
## ORDER BY `carrier`, `arr_delay` DESC
summarise()
and group_by()
work
together to generate a GROUP BY
clause:
%>%
flights group_by(month, day) %>%
summarise(delay = mean(dep_delay, na.rm = TRUE)) %>%
show_query()
## `summarise()` has grouped output by "month". You can override using the
## `.groups` argument.
## <SQL>
## SELECT `month`, `day`, AVG(`dep_delay`) AS `delay`
## FROM `nycflights13::flights`
## GROUP BY `month`, `day`
R | SQL |
---|---|
inner_join() |
SELECT * FROM x JOIN y ON x.a = y.a |
left_join() |
SELECT * FROM x LEFT JOIN y ON x.a = y.a |
right_join() |
SELECT * FROM x RIGHT JOIN y ON x.a = y.a |
full_join() |
SELECT * FROM x FULL JOIN y ON x.a = y.a |
semi_join() |
SELECT * FROM x WHERE EXISTS (SELECT 1 FROM y WHERE x.a = y.a) |
anti_join() |
SELECT * FROM x WHERE NOT EXISTS (SELECT 1 FROM y WHERE x.a = y.a) |
intersect(x, y) |
SELECT * FROM x INTERSECT SELECT * FROM y |
union(x, y) |
SELECT * FROM x UNION SELECT * FROM y |
setdiff(x, y) |
SELECT * FROM x EXCEPT SELECT * FROM y |
x
and y
don’t have to be tables in the same
database. If you specify copy = TRUE
, dplyr will copy the
y
table into the same location as the x
variable. This is useful if you’ve downloaded a summarised dataset and
determined a subset of interest that you now want the full data for. You
can use semi_join(x, y, copy = TRUE)
to upload the indices
of interest to a temporary table in the same database as x
,
and then perform a efficient semi join in the database.
If you’re working with large data, it maybe also be helpful to set
auto_index = TRUE
. That will automatically add an index on
the join variables to the temporary table.
The verb level SQL translation is implemented on top of
tbl_lazy
, which basically tracks the operations you perform
in a pipeline (see lazy-ops.R
). Turning that into a SQL
query takes place in three steps:
sql_build()
recurses over the lazy op data structure
building up query objects (select_query()
,
join_query()
, set_op_query()
etc) that
represent the different subtypes of SELECT
queries that we
might generate.
sql_optimise()
takes a pass over these SQL objects,
looking for potential optimisations. Currently this only involves
removing subqueries where possible.
sql_render()
calls an SQL generation function
(sql_query_select()
, sql_query_join()
,
sql_query_semi_join()
, sql_query_set_op()
, …)
to produce the actual SQL. Each of these functions is a generic, taking
the connection as an argument, so that the translation can be customised
for different databases.