There are two parts to dbplyr SQL translation: translating dplyr
verbs, and translating expressions within those verbs. This vignette
describes how individual expressions (function calls) are translated;
vignette("translation-verb")
describes how entire verbs are
translated.
library(dbplyr)
library(dplyr)
dbplyr::translate_sql()
powers translation of individual
function calls, and I’ll use it extensively in this vignette to show
what’s happening. You shouldn’t need to use it ordinary code as dbplyr
takes care of the translation automatically.
translate_sql((x + y) / 2)
#> <SQL> (`x` + `y`) / 2.0
translate_sql()
takes an optional con
parameter. If not supplied, this causes dplyr to generate
(approximately) SQL-92 compliant SQL. If supplied, dplyr uses
sql_translation()
to look up a custom environment which
makes it possible for different databases to generate slightly different
SQL: see vignette("new-backend")
for more details. You can
use the various simulate helpers to see the translations used by
different backends:
translate_sql(x ^ 2L)
#> <SQL> POWER(`x`, 2)
translate_sql(x ^ 2L, con = simulate_sqlite())
#> <SQL> POWER(`x`, 2)
translate_sql(x ^ 2L, con = simulate_access())
#> <SQL> `x` ^ 2
Perfect translation is not possible because databases don’t have all
the functions that R does. The goal of dplyr is to provide a semantic
rather than a literal translation: what you mean, rather than precisely
what is done. In fact, even for functions that exist both in databases
and R, you shouldn’t expect results to be identical; database
programmers have different priorities than R core programmers. For
example, in R in order to get a higher level of numerical accuracy,
mean()
loops through the data twice. R’s
mean()
also provides a trim
option for
computing trimmed means; this is something that databases do not
provide.
If you’re interested in how translate_sql()
is
implemented, the basic techniques that underlie the implementation of
translate_sql()
are described in “Advanced R”.
The following examples work through some of the basic differences between R and SQL.
"
and '
mean different things
# In SQLite variable names are escaped by double quotes:
translate_sql(x)
#> <SQL> `x`
# And strings are escaped by single quotes
translate_sql("x")
#> <SQL> 'x'
And some functions have different argument orders:
translate_sql(substr(x, 5, 10))
#> <SQL> SUBSTR(`x`, 5, 6)
translate_sql(log(x, 10))
#> <SQL> LOG(10.0, `x`)
R and SQL have different defaults for integers and reals. In R, 1 is a real, and 1L is an integer. In SQL, 1 is an integer, and 1.0 is a real
translate_sql(1)
#> <SQL> 1.0
translate_sql(1L)
#> <SQL> 1
+
, -
,
*
, /
, ^
acos()
, asin()
,
atan()
, atan2()
, cos()
,
cot()
, tan()
, sin()
cosh()
, coth()
,
sinh()
, tanh()
log()
, log10()
,
exp()
abs()
, ceiling()
,
sqrt()
, sign()
, round()
dbplyr translates %%
to the SQL equivalents but note
that it’s not precisely the same: most databases use truncated division
where the modulo operator takes the sign of the dividend, where R using
the mathematically preferred floored division with the modulo sign
taking the sign of the divisor.
<- tibble(
df x = c(10L, 10L, -10L, -10L),
y = c(3L, -3L, 3L, -3L)
)<- tbl_memdb(df)
mf
%>% mutate(x %% y)
df #> # A tibble: 4 × 3
#> x y `x%%y`
#> <int> <int> <int>
#> 1 10 3 1
#> 2 10 -3 -2
#> 3 -10 3 2
#> 4 -10 -3 -1
%>% mutate(x %% y)
mf #> # Source: SQL [4 x 3]
#> # Database: sqlite 3.38.2 [:memory:]
#> x y `x%%y`
#> <int> <int> <int>
#> 1 10 3 1
#> 2 10 -3 1
#> 3 -10 3 -1
#> 4 -10 -3 -1
dbplyr no longer translates %/%
because there’s no
robust cross-database translation available.
<
, <=
,
!=
, >=
, >
, ==
,
%in%
&
, &&
,
|
, ||
, !
, xor()
All database provide translation for the basic aggregations:
mean()
, sum()
, min()
,
max()
, sd()
, var()
. Databases
automatically drop NULLs (their equivalent of missing values) whereas in
R you have to ask nicely. The aggregation functions warn you about this
important difference:
translate_sql(mean(x))
#> <SQL> AVG(`x`) OVER ()
translate_sql(mean(x, na.rm = TRUE))
#> <SQL> AVG(`x`) OVER ()
Note that, by default, translate()
assumes that the call
is inside a mutate()
or filter()
and generates
a window translation. If you want to see the equivalent
summarise()
/aggregation translation, use
window = FALSE
:
translate_sql(mean(x, na.rm = TRUE), window = FALSE)
#> <SQL> AVG(`x`)
if
and switch()
are translate to
CASE WHEN
:
translate_sql(if (x > 5) "big" else "small")
#> <SQL> CASE WHEN (`x` > 5.0) THEN 'big' WHEN NOT (`x` > 5.0) THEN 'small' END
translate_sql(switch(x, a = 1L, b = 2L, 3L))
#> <SQL> CASE `x` WHEN ('a') THEN (1) WHEN ('b') THEN (2) ELSE (3) END
tolower
, toupper
,
trimws
, nchar
, substr
as.numeric
, as.integer
,
as.character
Any function that dplyr doesn’t know how to convert is left as is.
This means that database functions that are not covered by dplyr can
often be used directly via translate_sql()
.
Any function that dbplyr doesn’t know about will be left as is:
translate_sql(foofify(x, y))
#> <SQL> foofify(`x`, `y`)
Because SQL functions are general case insensitive, I recommend using upper case when you’re using SQL functions in R code. That makes it easier to spot that you’re doing something unusual:
translate_sql(FOOFIFY(x, y))
#> <SQL> FOOFIFY(`x`, `y`)
As well as prefix functions (where the name of the function comes
before the arguments), dbplyr also translates infix functions. That
allows you to use expressions like LIKE
which does a
limited form of pattern matching:
translate_sql(x %LIKE% "%foo%")
#> <SQL> `x` LIKE '%foo%'
Or use ||
for string concatenation (although most
backends will translate paste()
and paste0()
for you):
translate_sql(x %||% y)
#> <SQL> `x` || `y`
SQL functions tend to have a greater variety of syntax than R. That
means there are a number of expressions that can’t be translated
directly from R code. To insert these in your own queries, you can use
literal SQL inside sql()
:
translate_sql(sql("x!"))
#> <SQL> x!
translate_sql(x == sql("ANY VALUES(1, 2, 3)"))
#> <SQL> `x` = ANY VALUES(1, 2, 3)
This gives you a lot of freedom to generate the SQL you need:
<- memdb_frame(x = 1, y = 2)
mf
%>%
mf transmute(factorial = sql("x!")) %>%
show_query()
#> <SQL>
#> SELECT x! AS `factorial`
#> FROM `dbplyr_003`
%>%
mf transmute(factorial = sql("CAST(x AS FLOAT)")) %>%
show_query()
#> <SQL>
#> SELECT CAST(x AS FLOAT) AS `factorial`
#> FROM `dbplyr_003`
If needed, you can also force dbplyr to error if it doesn’t know how
to translate a function with the dplyr.strict_sql
option:
options(dplyr.strict_sql = TRUE)
translate_sql(glob(x, y))
#> Error in `glob()`:
#> ! Don't know how to translate `glob()`
Things get a little trickier with window functions, because SQL’s
window functions are considerably more expressive than the specific
variants provided by base R or dplyr. They have the form
[expression] OVER ([partition clause] [order clause] [frame_clause])
:
The expression is a combination of variable
names and window functions. Support for window functions varies from
database to database, but most support the ranking functions,
lead
, lag
, nth
,
first
, last
, count
,
min
, max
, sum
, avg
and stddev
.
The partition clause specifies how the window
function is broken down over groups. It plays an analogous role to
GROUP BY
for aggregate functions, and
group_by()
in dplyr. It is possible for different window
functions to be partitioned into different groups, but not all databases
support it, and neither does dplyr.
The order clause controls the ordering (when it makes a difference). This is important for the ranking functions since it specifies which variables to rank by, but it’s also needed for cumulative functions and lead. Whenever you’re thinking about before and after in SQL, you must always tell it which variable defines the order. If the order clause is missing when needed, some databases fail with an error message while others return non-deterministic results.
The frame clause defines which rows, or frame, that are passed to the window function, describing which rows (relative to the current row) should be included. The frame clause provides two offsets which determine the start and end of frame. There are three special values: -Inf means to include all preceding rows (in SQL, “unbounded preceding”), 0 means the current row (“current row”), and Inf means all following rows (“unbounded following”). The complete set of options is comprehensive, but fairly confusing, and is summarised visually below.
Of the many possible specifications, there are only three that commonly used. They select between aggregation variants:
Recycled:
BETWEEN UNBOUND PRECEEDING AND UNBOUND FOLLOWING
Cumulative:
BETWEEN UNBOUND PRECEEDING AND CURRENT ROW
Rolling:
BETWEEN 2 PRECEEDING AND 2 FOLLOWING
dplyr generates the frame clause based on whether your using a recycled aggregate or a cumulative aggregate.
To see how individual window functions are translated to SQL, we can
again use translate_sql()
:
translate_sql(mean(G))
#> <SQL> AVG(`G`) OVER ()
translate_sql(rank(G))
#> <SQL> RANK() OVER (ORDER BY `G`)
translate_sql(ntile(G, 2))
#> <SQL> NTILE(2) OVER (ORDER BY `G`)
translate_sql(lag(G))
#> <SQL> LAG(`G`, 1, NULL) OVER ()
If the tbl has been grouped or arranged previously in the pipeline,
then dplyr will use that information to set the “partition by” and
“order by” clauses. For interactive exploration, you can achieve the
same effect by setting the vars_group
and
vars_order
arguments to translate_sql()
translate_sql(cummean(G), vars_order = "year")
#> <SQL> AVG(`G`) OVER (ORDER BY `year` ROWS UNBOUNDED PRECEDING)
translate_sql(rank(), vars_group = "ID")
#> <SQL> RANK() OVER (PARTITION BY `ID`)
There are some challenges when translating window functions between R and SQL, because dplyr tries to keep the window functions as similar as possible to both the existing R analogues and to the SQL functions. This means that there are three ways to control the order clause depending on which window function you’re using:
For ranking functions, the ordering variable is the first
argument: rank(x)
, ntile(y, 2)
. If omitted or
NULL
, will use the default ordering associated with the tbl
(as set by arrange()
).
Accumulating aggregates only take a single argument (the vector
to aggregate). To control ordering, use
order_by()
.
Aggregates implemented in dplyr (lead
,
lag
, nth_value
, first_value
,
last_value
) have an order_by
argument. Supply
it to override the default ordering.
The three options are illustrated in the snippet below:
mutate(players,
min_rank(yearID),
order_by(yearID, cumsum(G)),
lead(G, order_by = yearID)
)
Currently there is no way to order by multiple variables, except by
setting the default ordering with arrange()
. This will be
added in a future release.