The goal of the RAthena
package is to provide a
DBI-compliant interface to Amazon’s Athena (https://aws.amazon.com/athena/) using Boto3
SDK. This allows for an efficient, easy setup connection to Athena using
the Boto3
SDK as a driver.
NOTE: Before using RAthena
you must
have an aws account or have access to aws account with permissions
allowing you to use Athena.
Before installing RAthena
ensure that
Python 3+
is installed onto your machine: https://www.python.org/downloads/. To install
Boto3
either it can installed the pip command or using
RAthena
installation function:
pip install boto3
RAthena Method (after RAthena
has been installed this
method can be used)
::install_boto() RAthena
To install RAthena
you can get it from CRAN with:
install.packages("RAthena")
Or to get the development version from Github with:
::install_github("dyfanjones/rathena") remotes
The most basic way to connect to AWS Athena is to hard-code your access key and secret access key. However this method is not recommended as your credentials are hard-coded.
library(DBI)
<- dbConnect(RAthena::athena(),
con aws_access_key_id='YOUR_ACCESS_KEY_ID',
aws_secret_access_key='YOUR_SECRET_ACCESS_KEY',
s3_staging_dir='s3://path/to/query/bucket/',
region_name='eu-west-1')
The next method is to use profile names set up by AWS CLI or created
manually in the ~/.aws
directory. To create the profile
names manually please refer to:
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html.
RAthena is compatible with AWS CLI. This allows your aws credentials to be stored and not be hard coded in your connection.
To install AWS CLI please refer to: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html, to configure AWS CLI please refer to: https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html
Once AWS CLI has been set up you will be able to connect to Athena by
only putting the s3_staging_dir
.
Using default profile name:
library(DBI)
<- dbConnect(RAthena::athena(),
con s3_staging_dir = 's3://path/to/query/bucket/')
Connecting to Athena using profile name other than
default
.
library(DBI)
<- dbConnect(RAthena::athena(),
con profile_name = "your_profile",
s3_staging_dir = 's3://path/to/query/bucket/')
library(RAthena)
get_session_token("YOUR_PROFILE_NAME",
serial_number='arn:aws:iam::123456789012:mfa/user',
token_code = "531602",
set_env = TRUE)
# Connect to Athena using temporary credentials
<- dbConnect(athena(),
con s3_staging_dir = 's3://path/to/query/bucket/')
Another method in connecting to Athena is to use Amazon Resource Name (ARN) role.
Setting credentials in environmental variables:
library(RAthena)
assume_role(profile_name = "YOUR_PROFILE_NAME",
role_arn = "arn:aws:sts::123456789012:assumed-role/role_name/role_session_name",
set_env = TRUE)
# Connect to Athena using temporary credentials
<- dbConnect(athena(),
con s3_staging_dir = 's3://path/to/query/bucket/')
Connecting to Athena directly using ARN role:
library(DBI)
<- dbConnect(athena(),
con profile_name = "YOUR_PROFILE_NAME",
role_arn = "arn:aws:sts::123456789012:assumed-role/role_name/role_session_name",
s3_staging_dir = 's3://path/to/query/bucket/')
To change the duration of ARN role session please change the
parameter duration_seconds
. By default
duration_seconds
is set to 3600 seconds (1 hour).
Connect to athena, and send a query and return results back to R.
library(DBI)
<- dbConnect(RAthena::athena(),
con aws_access_key_id='YOUR_ACCESS_KEY_ID',
aws_secret_access_key='YOUR_SECRET_ACCESS_KEY',
s3_staging_dir='s3://path/to/query/bucket/',
region_name='eu-west-1')
<- dbExecute(con, "SELECT * FROM one_row")
res dbFetch(res)
dbClearResult(res)
To retrieve query in 1 step.
dbGetQuery(con, "SELECT * FROM one_row")
To create a tables in athena, dbExecute
will send the
query to athena and wait until query has been executed. This makes it
and idea method to create tables within athena.
<-
query "CREATE EXTERNAL TABLE impressions (
requestBeginTime string,
adId string,
impressionId string,
referrer string,
userAgent string,
userCookie string,
ip string,
number string,
processId string,
browserCookie string,
requestEndTime string,
timers struct<modelLookup:string, requestTime:string>,
threadId string,
hostname string,
sessionId string)
PARTITIONED BY (dt string)
ROW FORMAT serde 'org.apache.hive.hcatalog.data.JsonSerDe'
with serdeproperties ( 'paths'='requestBeginTime, adId, impressionId, referrer, userAgent, userCookie, ip' )
LOCATION 's3://elasticmapreduce/samples/hive-ads/tables/impressions/' ;"
dbExecute(con, query)
RAthena has 2 extra function to return extra information around
Athena tables: dbGetParitiions
and dbShow
dbGetPartitions
will return all the partitions (returns
data.frame):
::dbGetPartition(con, "impressions") RAthena
dbShow
will return the table’s ddl, so you will able to
see how the table was constructed in Athena (returns SQL character):
::dbShow(con, "impressions") RAthena
library(DBI)
<- dbConnect(RAthena::athena(),
con s3_staging_dir = 's3://path/to/query/bucket/')
RAthena has created a method to send data.frame from R to Athena.
# Check existing tables
dbListTables(con)
# Upload iris to Athena
dbWriteTable(con, "iris", iris,
partition=c("TIMESTAMP" = format(Sys.Date(), "%Y%m%d")))
# Read in iris from Athena
dbReadTable(con, "iris")
# Check new existing tables in Athena
dbListTables(con)
# Check if iris exists in Athena
dbExistsTable(con, "iris")
Please check out RAthena
method for dbWriteTable
for more information in how to upload data to AWS Athena and AWS S3.
For more information around how to get the most out of AWS Athena when uploading data please check out: Top 10 Performance Tuning Tips for Amazon Athena
Creating a connection to Athena and query and already existing table
iris
that was created in previous example.
library(DBI)
library(dplyr)
<- dbConnect(RAthena::athena(),
con aws_access_key_id='YOUR_ACCESS_KEY_ID',
aws_secret_access_key='YOUR_SECRET_ACCESS_KEY',
s3_staging_dir='s3://path/to/query/bucket/',
region_name='eu-west-1')
tbl(con, sql("SELECT * FROM iris"))
# Source: SQL [?? x 5]
# Database: Athena 1.9.210 [eu-west-1/default]
sepal_length sepal_width petal_length petal_width species
<dbl> <dbl> <dbl> <dbl> <chr>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
# … with more rows
dplyr provides lazy querying with allows to short hand
tbl(con, sql("SELECT * FROM iris"))
to
tbl(con, "iris")
. For more information please look at https://db.rstudio.com/dplyr/.
tbl(con, "iris")
# Source: table<iris> [?? x 5]
# Database: Athena 1.9.210 [eu-west-1/default]
sepal_length sepal_width petal_length petal_width species
<dbl> <dbl> <dbl> <dbl> <chr>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
# … with more rows
Querying Athena with profile_name
instead of hard coding
aws_access_key_id
and aws_secret_access_key
.
By using profile_name
extra Meta Data is returned in the
query to give users extra information.
<- dbConnect(RAthena::athena(),
con profile_name = "your_profile",
s3_staging_dir='s3://path/to/query/bucket/')
tbl(con, "iris")) %>%
filter(petal_length < 1.3)
# Source: lazy query [?? x 5]
# Database: Athena 1.9.210 [your_profile@eu-west-1/default]
sepal_length sepal_width petal_length petal_width species
<dbl> <dbl> <dbl> <dbl> <chr>
1 4.7 3.2 1.3 0.2 setosa
2 4.3 3 1.1 0.1 setosa
3 5.8 4 1.2 0.2 setosa
4 5.4 3.9 1.3 0.4 setosa
5 4.6 3.6 1 0.2 setosa
6 5 3.2 1.2 0.2 setosa
7 5.5 3.5 1.3 0.2 setosa
8 4.4 3 1.3 0.2 setosa
9 5 3.5 1.3 0.3 setosa
10 4.5 2.3 1.3 0.3 setosa
# … with more rows
tbl(con, "iris") %>%
select(contains("sepal"), contains("petal"))
# Source: lazy query [?? x 4]
# Database: Athena 1.9.210 [your_profile@eu-west-1/default]
sepal_length sepal_width petal_length petal_width
<dbl> <dbl> <dbl> <dbl>
1 5.1 3.5 1.4 0.2
2 4.9 3 1.4 0.2
3 4.7 3.2 1.3 0.2
4 4.6 3.1 1.5 0.2
5 5 3.6 1.4 0.2
6 5.4 3.9 1.7 0.4
7 4.6 3.4 1.4 0.3
8 5 3.4 1.5 0.2
9 4.4 2.9 1.4 0.2
10 4.9 3.1 1.5 0.1
# … with more rows
Upload data using dplyr
function copy_to
and compute
.
library(DBI)
library(dplyr)
<- dbConnect(RAthena::athena(),
con profile_name = "your_profile",
s3_staging_dir='s3://path/to/query/bucket/')
Write data.frame to Athena table
copy_to(con, mtcars,
s3_location = "s3://mybucket/data/")
Write Athena table from tbl_sql
<- tbl(con, "mtcars")
athena_mtcars <- athena_mtcars %>% filter(gear >=4) mtcars_filter
Create athena with unique table name
%>% compute() mtcars_filer
Create athena with specified name and s3 location
%>%
mtcars_filer compute("mtcars_filer",
s3_location = "s3://mybucket/mtcars_filer/")
# Disconnect from Athena
dbDisconnect(con)
Creating work group:
library(RAthena)
library(DBI)
<- dbConnect(RAthena::athena(),
con profile_name = "your_profile",
encryption_option = "SSE_S3",
s3_staging_dir='s3://path/to/query/bucket/')
create_work_group(con, "demo_work_group", description = "This is a demo work group",
tags = tag_options(key= "demo_work_group", value = "demo_01"))
List work groups:
list_work_groups(con)
[[1]]
[[1]]$Name
[1] "demo_work_group"
[[1]]$State
[1] "ENABLED"
[[1]]$Description
[1] "This is a demo work group"
[[1]]$CreationTime
2019-09-06 18:51:28.902000+01:00
[[2]]
[[2]]$Name
[1] "primary"
[[2]]$State
[1] "ENABLED"
[[2]]$Description
[1] ""
[[2]]$CreationTime
2019-08-22 16:14:47.902000+01:00
Update work group:
update_work_group(con, "demo_work_group", description = "This is a demo work group update")
Return work group meta data:
get_work_group(con, "demo_work_group")
$Name
[1] "demo_work_group"
$State
[1] "ENABLED"
$Configuration
$Configuration$ResultConfiguration
$Configuration$ResultConfiguration$OutputLocation
[1] "s3://path/to/query/bucket/"
$Configuration$ResultConfiguration$EncryptionConfiguration
$Configuration$ResultConfiguration$EncryptionConfiguration$EncryptionOption
[1] "SSE_S3"
$Configuration$EnforceWorkGroupConfiguration
[1] FALSE
$Configuration$PublishCloudWatchMetricsEnabled
[1] FALSE
$Configuration$BytesScannedCutoffPerQuery
[1] 10000000
$Configuration$RequesterPaysEnabled
[1] FALSE
$Description
[1] "This is a demo work group update"
$CreationTime
2019-09-06 18:51:28.902000+01:00
Connect to Athena using work group:
<- dbConnect(RAthena::athena(),
con profile_name = "your_profile",
work_group = "demo_work_group")
Delete work group:
delete_work_group(con, "demo_work_group")
pyAthena
- A python wrapper of the python package
Boto3
using the sqlAlchemy framework: https://github.com/laughingman7743/PyAthenapyAthenaJDBC
- A python interface into AWS Athena’s
JDBC drivers: https://github.com/laughingman7743/PyAthenaJDBCAWR.Athena
- A R wrapper of RJDBC for the AWS Athena’s
JDBC drivers: https://github.com/nfultz/AWR.Athenanoctua
- A R wrapper of the R AWS SDK paws
to develop a
DBI interface https://github.com/DyfanJones/noctuaawsathena
- rJava Interface to AWS Athena SDK https://github.com/hrbrmstr/awsathenametis
- Helpers for Accessing and Querying Amazon
Athena using R, Including a lightweight RJDBC shim https://github.com/hrbrmstr/metismetisjars
- JARs for metis
https://github.com/hrbrmstr/metis-jarsmetis.tidy
- Access and Query Amazon Athena via the
Tidyverse https://github.com/hrbrmstr/metis-tidyawsathena
and metis
family of packages are
currently used in production every day to analyze petabytes of internet
scan and honeypot data.
The reason why RAthena
stands slightly apart from
AWR.Athena
is that AWR.Athena
uses the Athena
JDBC drivers and RAthena
uses the Python AWS SDK
Boto3
. The ultimate goal is to provide an extra method for
R users to interface with AWS Athena. As pyAthena
is the
most similar project, this project has used an appropriate name to
reflect this … RAthena
.