The fastai library simplifies training fast and accurate neural nets using modern best practices. See the fastai website to get started. The library is based on research into deep learning best practices undertaken at fast.ai
, and includes “out of the box” support for vision
, text
, tabular
, and collab
(collaborative filtering) models.
Download and read data:
library(fastai)
library(magrittr)
URLs_WIKITEXT()
= 'wikitext-2'
path
= data.table::fread(paste(path, 'train.csv', sep = '/'), header = FALSE, fill = TRUE)
train
= data.table::fread(paste(path, 'test.csv', sep = '/'), header = FALSE, fill = TRUE)
test
= rbind(train, test)
df
rm(train,test)
Improt library, get model and weights:
= reticulate::import('transformers')
tr = 'gpt2'
pretrained_weights = tr$GPT2TokenizerFast$from_pretrained(pretrained_weights)
tokenizer = tr$GPT2LMHeadModel$from_pretrained(pretrained_weights) model
Tokenize and place into list:
= function(text) {
tokenize = tokenizer$tokenize(text)
toks tensor(tokenizer$convert_tokens_to_ids(toks))
}
= list()
tokenized
for (i in 1:length(df$V1)) {
= tokenize(df$V1[i])
tokeniz = tokenized %>% append(tokeniz)
tokenized if(i %% 100 == 0 ) {
print(i)
} }
Later split data:
= 1:nrow(df)
tot = sample(nrow(df), 0.8 * nrow(df))
tr_idx = tot[!tot %in% tr_idx]
ts_idx = list(tr_idx, ts_idx) splits
Prepare dataloader and train data:
Note: The HuggingFace model will return a tuple in outputs, with the actual predictions and some additional activations (should we want to use them in some regularization scheme). To work inside the fastai training loop, we will need to drop those using a Callback: we use those to alter the behavior of the training loop. Here we need to write the event after_pred and replace self.learn.pred (which contains the predictions that will be passed to the loss function) by just its first element. In callbacks, there is a shortcut that lets you access any of the underlying Learner attributes so we can write
self$pred[0]
instead ofself$learn$pred[0]
. That shortcut only works for read access, not write, so we have to writeself$learn$pred
on the right side (otherwise we would set a pred attribute in theCallback
).
= TfmdLists(tokenized, TransformersTokenizer(tokenizer),
tls splits = splits,
dl_type = LMDataLoader())
= 8
bs = 100
sl = tls %>% dataloaders(bs = bs, seq_len = sl)
dls
# Now, we are ready to create our Learner, which is a fastai object grouping data, model
# and loss function and handles model training or inference. Since we are in a language
#model setting, we pass perplexity as a metric, and we need to use the callback we just
# defined. Lastly, we use mixed precision to save every bit of memory we can (and if you
# have a modern GPU, it will also make training faster):
= Learner(dls, model, loss_func=CrossEntropyLossFlat(),
learn cbs = list(TransformersDropOutput()),
metrics = Perplexity())$to_fp16()
%>% fit_one_cycle(1, 1e-4) learn
epoch train_loss valid_loss perplexity time
------ ----------- ----------- ----------- ------
0 3.245887 3.301065 27.141541 07:40
1 3.065197 3.234682 25.398289 07:43
Generate text:
= "\n = Unicorn = \n \n A unicorn is a magical creature with a rainbow tail and a horn"
prompt = tokenizer$encode(prompt)
prompt_ids = tensor(prompt_ids)[NULL]$cuda()
inp = learn$model$generate(inp, max_length = 80L, num_beams = 5L, temperature = 1.5)
preds $decode(as.integer(preds[0]$cpu()$numpy())) tokenizer
Result:
"\n = Unicorn = \n \n A unicorn is a magical creature with a rainbow tail and a horn
@-@ like head. The unicorn is a member of the <unk> family, a group of <unk>.
The unicorn is a member of the <unk> family, a group of <unk>. The unicorn is a
member of the <unk> family, a group of"