etl
is an R package to facilitate Extract - Transform - Load (ETL) operations for medium data. The end result is generally a populated SQL database, but the user interaction takes place solely within R.
etl
Instantiate an etl
object using a string that determines the class of the resulting object, and the package that provides access to that data. The trivial mtcars
database is built into etl
.
library(etl)
<- etl("mtcars") cars
## No database was specified so I created one for you at:
## /tmp/RtmpoOM7zi/file203e470b6031.sqlite3
class(cars)
## [1] "etl_mtcars" "etl" "src_SQLiteConnection"
## [4] "src_dbi" "src_sql" "src"
Pay careful attention to where the SQLite database is stored. The default location is a temporary directory, but you will want to move this to a more secure location if you want this storage to be persistent. See file.copy()
for examples on how to move a file.
etl
works with a local or remote database to store your data. Every etl
object extends a dplyr::src_dbi
object. If, as in the example above, you do not specify a SQL source, a local RSQLite
database will be created for you. However, you can also specify any source that inherits from dplyr::src_dbi
.
Note: If you want to use a database other than a local RSQLite, you must create the
mtcars
database and have permission to write to it first!
# For PostgreSQL
library(RPostgreSQL)
<- src_postgres(dbname = "mtcars", user = "postgres", host = "localhost")
db
# Alternatively, for MySQL
library(RMySQL)
<- src_mysql(dbname = "mtcars", user = "r-user", password = "mypass", host = "localhost")
db <- etl("mtcars", db) cars
At the heart of etl
are three functions: etl_extract()
, etl_transform()
, and etl_load()
.
The first step is to acquire data from an online source.
%>%
cars etl_extract()
## Extracting raw data...
## Warning: The `path` argument of `write_csv()` is deprecated as of readr 1.4.0.
## Please use the `file` argument instead.
This creates a local store of raw data.
These data may need to be transformed from their raw form to files suitable for importing into SQL (usually CSVs).
%>%
cars etl_transform()
Populate the SQL database with the transformed data.
%>%
cars etl_load()
## Loading 12 file(s) into the database...
To populate the whole database from scratch, use etl_create
.
%>%
cars etl_create()
## Initializing DB using SQL script init.sqlite
## Extracting raw data...
## Loading 12 file(s) into the database...
You can also update an existing database without re-initializing, but watch out for primary key collisions.
%>%
cars etl_update()
Under the hood, there are three functions that etl_update
chains together:
getS3method("etl_update", "default")
## function (obj, ...)
## {
## obj <- obj %>% etl_extract(...) %>% etl_transform(...) %>%
## etl_load(...)
## invisible(obj)
## }
## <bytecode: 0x55951e09a9e8>
## <environment: namespace:etl>
etl_create
is simply a call to etl_update
that forces the SQL database to be written from scratch.
getS3method("etl_create", "default")
## function (obj, ...)
## {
## obj <- obj %>% etl_init(...) %>% etl_update(...) %>% etl_cleanup(...)
## invisible(obj)
## }
## <bytecode: 0x55951a4a5988>
## <environment: namespace:etl>
Now that your database is populated, you can work with it as a src
data table just like any other dplyr
source.
%>%
cars tbl("mtcars") %>%
group_by(cyl) %>%
summarise(N = n(), mean_mpg = mean(mpg))
## Warning: Missing values are always removed in SQL.
## Use `mean(x, na.rm = TRUE)` to silence this warning
## This warning is displayed only once per session.
## # Source: lazy query [?? x 3]
## # Database: sqlite 3.35.5 [/tmp/RtmpoOM7zi/file203e470b6031.sqlite3]
## cyl N mean_mpg
## <int> <int> <dbl>
## 1 4 11 26.7
## 2 6 7 19.7
## 3 8 14 15.1
etl
Suppose you want to create your own ETL package called pkgname
. All you have to do is write a package that requires etl
, and then you have to write one S3 methods:
etl_extract.etl_pkgname()
You may also wish to write
etl_transform.etl_pkgname()
etl_load.etl_pkgname()
All of these functions must take and return an object of class etl_pkgname
that inherits from etl
. Please see the “Extending etl” vignette for more information.
Packages that use the etl
framework are available on CRAN and/or GitHub:
::dependsOnPkgs("etl") tools
## [1] "fec" "macleish"