This vignette demonstrates some advanced features of the text2vec package: how to read large collections of text stored on disk rather than in memory, and how to let text2vec functions use multiple cores.

Working with files

In many cases, you will have a corpus of texts which are too large to fit in memory. This section demonstrates how to use text2vec to vectorize large collections of text stored in files.

Imagine we have a collection of movie reviews stored in multiple text files on disk. For this vignette, we will create files on disk using the movie_review dataset:

library(text2vec)
library(magrittr)
data("movie_review")

# remove all internal EOL to simplify reading
movie_review$review = gsub(pattern = '\n', replacement = ' ', 
                            x = movie_review$review, fixed = TRUE)
N_FILES = 10
CHUNK_LEN = nrow(movie_review) / N_FILES
files = sapply(1:N_FILES, function(x) tempfile())
chunks = split(movie_review, rep(1:N_FILES, 
                                  each = nrow(movie_review) / N_FILES ))
for (i in 1:N_FILES ) {
  write.table(chunks[[i]], files[[i]], quote = T, row.names = F,
              col.names = T, sep = '|')
}

# Note what the moview review data looks like
str(movie_review, strict.width = 'cut')
## 'data.frame':    5000 obs. of  3 variables:
##  $ id       : chr  "5814_8" "2381_9" "7759_3" "3630_4" ...
##  $ sentiment: int  1 1 0 0 1 1 0 0 0 1 ...
##  $ review   : chr  "With all this stuff going down at the moment with MJ i've"..

The text2vec provides functions to easily work with files. You need to follow a few steps.

  1. Construct an iterator over the files with the ifiles() function.
  2. Provide a reader() function to ifiles() that can read those files. You can use a function from base R or any other package to read plain text, XML, or other files and convert them to text. The text2vec package doesn’t handle the reading itself. reader function should return NAMED character vector:
    • elements of character vector will be treated as documents
    • names of the elements will will be treated as documents ids
    • If user won’t provide named character vector, text2vec will generate document ids filename + line_number (assuming that each line is a separate document)
  3. Construct a tokens iterator from the files iterator using the itoken() function.

Let’s see how it works:

library(data.table)
reader = function(x, ...) {
  # read
  chunk = data.table::fread(x, header = T, sep = '|')
  # select column with review
  res = chunk$review
  # assign ids to reviews
  names(res) = chunk$id
  res
}
# create iterator over files
it_files  = ifiles(files, reader = reader)
# create iterator over tokens from files iterator
it_tokens = itoken(it_files, preprocessor = tolower, tokenizer = word_tokenizer, progressbar = FALSE)

vocab = create_vocabulary(it_tokens)

Now are able to construct DTM:

dtm = create_dtm(it_tokens, vectorizer = vocab_vectorizer(vocab))
str(dtm, list.len = 5)
## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   ..@ i       : int [1:718250] 398 1301 27 1292 560 2833 2260 560 3245 28 ...
##   ..@ p       : int [1:42653] 0 1 2 3 4 5 6 7 8 9 ...
##   ..@ Dim     : int [1:2] 5000 42652
##   ..@ Dimnames:List of 2
##   .. ..$ : chr [1:5000] "5814_8" "2381_9" "7759_3" "3630_4" ...
##   .. ..$ : chr [1:42652] "0.3" "0.48" "0.89" "00015" ...
##   ..@ x       : num [1:718250] 1 1 1 1 1 1 1 1 1 1 ...
##   .. [list output truncated]

Note that the DTM has document ids. They are inherited from the document names we assigned in reader function. This is a convenient way to assign document IDs when working with files.

Fall back to auto-generated ids. Lets see how text2vec would handle the cases when user didn’t provide document ids:

for (i in 1:N_FILES ) {
  write.table(chunks[[i]][["review"]], files[[i]], quote = T, row.names = F,
              col.names = T, sep = '|')
}
# read with default reader - readLines
it_files  = ifiles(files)
# create iterator over tokens from files iterator
it_tokens = itoken(it_files, preprocessor = tolower,  tokenizer = word_tokenizer, progressbar = FALSE)
dtm = create_dtm(it_tokens, vectorizer = hash_vectorizer())
str(dtm, list.len = 5)
## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   ..@ i       : int [1:717943] 819 4972 3155 4826 172 1013 3717 4032 4334 4870 ...
##   ..@ p       : int [1:262145] 0 0 0 0 0 0 0 0 0 1 ...
##   ..@ Dim     : int [1:2] 5010 262144
##   ..@ Dimnames:List of 2
##   .. ..$ : chr [1:5010] "file137765f3b220_1" "file137765f3b220_2" "file137765f3b220_3" "file137765f3b220_4" ...
##   .. ..$ : NULL
##   ..@ x       : num [1:717943] 1 1 1 1 1 1 1 2 1 1 ...
##   .. [list output truncated]

Multicore machines

For many tasks text2vec allows to take the advantage of multicore machines. The functions create_dtm(), create_tcm(), and create_vocabulary() are good example. In contrast to GloVe fitting which uses low-level thread parallelism via OpenMP, these functions use fork-join R parallelizatin on UNIX-like systems provided by the parallel package. But remember that such high-level parallelism might involve significant overhead.

In order to take advantage of a multicore machine user just has to use use ifiles_parallel and itoken_parallel iterators.

Data in memory

# note that we can control level of granularity with `n_chunks` argument
it_token_par = itoken_parallel(movie_review$review, preprocessor = tolower, 
                               tokenizer = word_tokenizer, ids = movie_review$id, 
                               n_chunks = 8)
vocab = create_vocabulary(it_token_par)
v_vectorizer = vocab_vectorizer(vocab)
dtm = create_dtm(it_token_par, vectorizer = v_vectorizer)

Data on disk

Processing files from disk is also easy with ifiles_parallel and itoken_parallel:

it_files_par = ifiles_parallel(file_paths = files)
it_token_par = itoken_parallel(it_files_par, preprocessor = tolower, tokenizer = word_tokenizer)

vocab = create_vocabulary(it_token_par)
# DTM vocabulary vectorization
v_vectorizer = vocab_vectorizer(vocab)
dtm_v = create_dtm(it_token_par, vectorizer = v_vectorizer)

# DTM hash vectorization
h_vectorizer = hash_vectorizer()
dtm_h = create_dtm(it_token_par, vectorizer = h_vectorizer)

# co-ocurence statistics
tcm = create_tcm(it_token_par, vectorizer = v_vectorizer, skip_grams_window = 5)