This vignette discuss the new functionality, which is added in the textTinyR package (version 1.1.0). I’ll explain some of the functions by using the data and pre-processing steps of this blog-post.
The following code chunks assume that the nltk-corpus is already downloaded and the reticulate package is installed,
= reticulate::import("nltk.corpus")
NLTK
= NLTK$reuters
text_reuters
= reticulate::import("nltk")
nltk
# if the 'reuters' data is not already available then it can be downloaded from within R
$download('reuters') nltk
= text_reuters$fileids()
documents
str(documents)
# List of categories
= text_reuters$categories()
categories
str(categories)
# Documents in a category
= text_reuters$fileids("acq")
category_docs
str(category_docs)
= text_reuters$raw("test/14843")
one_doc
one_doc
The collection originally consisted of 21,578 documents but a subset and split is traditionally used. The most common split is Mod-Apte which only considers categories that have at least one document in the training set and the test set. The Mod-Apte split has 90 categories with a training set of 7769 documents and a test set of 3019 documents.
= text_reuters$fileids()
documents
# document ids for train - test
= documents[as.vector(sapply(documents, function(i) substr(i, 1, 5) == "train"))]
train_docs_id = documents[as.vector(sapply(documents, function(i) substr(i, 1, 4) == "test"))]
test_docs_id
= lapply(1:length(train_docs_id), function(x) text_reuters$raw(train_docs_id[x]))
train_docs = lapply(1:length(test_docs_id), function(x) text_reuters$raw(test_docs_id[x]))
test_docs
str(train_docs)
str(test_docs)
# train - test labels [ some categories might have more than one label (overlapping) ]
= as.vector(sapply(train_docs_id, function(x) text_reuters$categories(x)))
train_labels = as.vector(sapply(test_docs_id, function(x) text_reuters$categories(x))) test_labels
First, I’ll perform the following pre-processing steps :
= c(unlist(train_docs), unlist(test_docs))
concat
length(concat)
= textTinyR::tokenize_transform_vec_docs(object = concat, as_token = T,
clust_vec to_lower = T,
remove_punctuation_vector = F,
remove_numbers = F,
trim_token = T,
split_string = T,
split_separator = " \r\n\t.,;:()?!//",
remove_stopwords = T,
language = "english",
min_num_char = 3,
max_num_char = 100,
stemmer = "porter2_stemmer",
threads = 4,
verbose = T)
= unique(unlist(clust_vec$token, recursive = F))
unq length(unq)
# I'll build also the term matrix as I'll need the global-term-weights
= textTinyR::sparse_term_matrix$new(vector_data = concat, file_data = NULL,
utl document_term_matrix = TRUE)
= utl$Term_Matrix(sort_terms = FALSE, to_lower = T, remove_punctuation_vector = F,
tm remove_numbers = F, trim_token = T, split_string = T,
stemmer = "porter2_stemmer",
split_separator = " \r\n\t.,;:()?!//", remove_stopwords = T,
language = "english", min_num_char = 3, max_num_char = 100,
print_every_rows = 100000, normalize = NULL, tf_idf = F,
threads = 6, verbose = T)
= utl$global_term_weights()
gl_term_w
str(gl_term_w)
For simplicity, I’ll use the Reuters data as input to the fastTextR::skipgram_cbow function. The data has to be first pre-processed and then saved to a file,
= textTinyR::tokenize_transform_vec_docs(object = concat, as_token = T,
save_dat to_lower = T,
remove_punctuation_vector = F,
remove_numbers = F, trim_token = T,
split_string = T,
split_separator = " \r\n\t.,;:()?!//",
remove_stopwords = T, language = "english",
min_num_char = 3, max_num_char = 100,
stemmer = "porter2_stemmer",
path_2folder = "/path_to_your_folder/",
threads = 1, # whenever I save data to file set the number threads to 1
verbose = T)
UPDATE 11-04-2019: There is an updated version of the fastText R package which includes all the features of the ported fasttext library. Therefore the old fastTextR repository is archived. See also the corresponding blog-post.
Then, I’ll load the previously saved data and I’ll use fastTextR to build the word-vectors,
= "/path_to_your_folder/output_token_single_file.txt"
PATH_INPUT
= "/path_to_your_folder/rt_fst_model"
PATH_OUT
= fastTextR::skipgram_cbow(input_path = PATH_INPUT, output_path = PATH_OUT,
vecs method = "skipgram", lr = 0.075, lrUpdateRate = 100,
dim = 300, ws = 5, epoch = 5, minCount = 1, neg = 5,
wordNgrams = 2, loss = "ns", bucket = 2e+06,
minn = 0, maxn = 0, thread = 6, t = 1e-04, verbose = 2)
Before using one of the three methods, it would be better to reduce the initial dimensions of the word-vectors (rows of the matrix). So, I’ll keep the word-vectors for which the terms appear in the Reuters data set - clust_vec$token ( although it’s not applicable in this case, if the resulted word-vectors were based on external data - say the Wikipedia data - then their dimensions would be way larger and many of the terms would be redundant for the Reuters data set increasing that way the computation time considerably when invoking one of the doc2vec methods),
= textTinyR::Doc2Vec$new(token_list = clust_vec$token,
init
word_vector_FILE = "path_to_your_folder/rt_fst_model.vec",
print_every_rows = 5000,
verbose = TRUE,
copy_data = FALSE) # use of external pointer
-processing of input data starts ...
pre
File is successfully opened: 25000
total.number.lines.processed.input
creation of index starts ...
intersection of tokens and wordvec character strings starts ...
modification of indices starts ...
final processing of data starts ...
File is successfully opened: 25000 total.number.lines.processed.output
In case that copy_data = TRUE then the pre-processed data can be observed before invoking one of the ‘doc2vec’ methods,
# res_wv = init$pre_processed_wv()
#
# str(res_wv)
Then, I can use one of the three methods (sum_sqrt, min_max_norm, idf) to receive the transformed vectors. These methods are based on the following blog-posts , see especially:
www.linkedin.com/pulse/duplicate-quora-question-abhishek-thakur
https://erogol.com/duplicate-question-detection-deep-learning/
= init$doc2vec_methods(method = "sum_sqrt", threads = 6)
doc2_sum = init$doc2vec_methods(method = "min_max_norm", threads = 6)
doc2_norm = init$doc2vec_methods(method = "idf", global_term_weights = gl_term_w, threads = 6)
doc2_idf
= 1:5
rows_cols
doc2_sum[rows_cols, rows_cols]
doc2_norm[rows_cols, rows_cols]
doc2_idf[rows_cols, rows_cols]
> dim(doc2_sum)
1] 10788 300
[> dim(doc2_norm)
1] 10788 300
[> dim(doc2_idf)
1] 10788 300 [
For illustration, I’ll use the resulted word-vectors of the sum_sqrt method. The approach described can be used as an alternative to Latent semantic indexing (LSI) or topic-modeling in order to discover categories in text data (documents).
First, someone can seach for the optimal number of clusters using the Optimal_Clusters_KMeans function of the ClusterR package,
= ClusterR::center_scale(doc2_sum) # center and scale the data
scal_dat
= ClusterR::Optimal_Clusters_KMeans(scal_dat, max_clusters = 15,
opt_cl criterion = "distortion_fK",
fK_threshold = 0.85, num_init = 3,
max_iters = 50,
initializer = "kmeans++", tol = 1e-04,
plot_clusters = TRUE,
verbose = T, tol_optimal_init = 0.3,
seed = 1)
Based on the output of the Optimal_Clusters_KMeans function, I’ll pick 5 as the optimal number of clusters in order to perform k-means clustering,
= 5
num_clust
= ClusterR::KMeans_rcpp(scal_dat, clusters = num_clust, num_init = 3, max_iters = 50,
km initializer = "kmeans++", fuzzy = T, verbose = F,
CENTROIDS = NULL, tol = 1e-04, tol_optimal_init = 0.3, seed = 2)
table(km$clusters)
1 2 3 4 5
713 2439 2393 2607 2636
As a follow up, someone can also perform cluster-medoids clustering using the pearson-correlation metric, which resembles the cosine distance ( the latter is frequently used for text clustering ),
= ClusterR::Cluster_Medoids(scal_dat, clusters = num_clust,
kmed distance_metric = "pearson_correlation",
minkowski_p = 1, threads = 6, swap_phase = TRUE,
fuzzy = FALSE, verbose = F, seed = 1)
table(kmed$clusters)
1 2 3 4 5
2396 2293 2680 875 2544
Finally, the word-frequencies of the documents can be obtained using the cluster_frequency function, which groups the tokens (words) of the documents based on which cluster each document appears,
= textTinyR::cluster_frequency(tokenized_list_text = clust_vec$token,
freq_clust cluster_vector = km$clusters, verbose = T)
0.1762383 secs Time difference of
> freq_clust
$`3`
WORDS COUNTS1: mln 8701
2: 000 6741
3: cts 6260
4: net 5949
5: loss 4628
---
6417: vira> 1
6418: gain> 1
6419: pwj> 1
6420: drummond 1
6421: parisian 1
$`1`
WORDS COUNTS1: cts 1303
2: record 696
3: april 669
4: < 652
5: dividend 554
---
1833: hvt> 1
1834: bang> 1
1835: replac 1
1836: stbk> 1
1837: bic> 1
$`4`
WORDS COUNTS1: mln 6137
2: pct 5084
3: dlrs 4024
4: year 3397
5: billion 3390
---
10968: heijn 1
10969: "behind 1
10970: myo> 1
10971: "favor 1
10972: wonder> 1
$`5`
WORDS COUNTS1: < 4244
2: share 3748
3: dlrs 3274
4: compani 3184
5: mln 2659
---
13059: often-fat 1
13060: computerknowledg 1
13061: fibrinolyt 1
13062: hercul 1
13063: ceroni 1
$`2`
WORDS COUNTS1: trade 3077
2: bank 2578
3: market 2535
4: pct 2416
5: rate 2308
---
13702: "mfn 1
13703: uk> 1
13704: honolulu 1
13705: arap 1
13706: infinitesim 1
= textTinyR::cluster_frequency(tokenized_list_text = clust_vec$token,
freq_clust_kmed cluster_vector = kmed$clusters, verbose = T)
0.1685851 secs Time difference of
This is one of the ways that the transformed word-vectors can be used and is solely based on tokens (words) and word frequencies. However a more advanced approach would be to cluster documents based on word n-grams and take advantage of graphs as explained here in order to plot the nodes, edges and text.
References:
www.linkedin.com/pulse/duplicate-quora-question-abhishek-thakur
https://erogol.com/duplicate-question-detection-deep-learning/