corpustools: Managing, Querying and Analyzing Tokenized Text

by Kasper Welbers and Wouter van Atteveldt

2019-08-14

Introduction

library(corpustools)

The corpustools package offers various tools for anayzing text corpora. What sets it appart from other text analysis packages is that it focuses on the use of a tokenlist format for storing tokenized texts. By a tokenlist we mean a data.frame in which each token (i.e. word) of a text is a row, and columns contain information about each token. The advantage of this approach is that all information from the full text is preserved, and more information can be added. This format can also be used to work with the output from natural language processing pipelines such as SpaCy, UDpipe and Stanford CoreNLP. Furthermore, by preserving the full text information, it is possible to reconstruct texts with annotations from text analysis techniques. This enables qualitative analysis and manual validation of the results of computational text analysis methods (e.g., highlighted search results or dictionary terms, coloring words based on topicmodels).

The problem is that the tokenlist format quickly leads to huge data.frames that can be difficult to manage. This is where corpustools comes in. The backbone of corpustools is the tCorpus class (i.e. tokenlist corpus), that builds on the R6 and data.table packages to work efficiently with huge tokenlists. corpustools provides functions to create the tcorpus from full text, apply basic and advanced text preprocessing, and to use various analysis and visualization techniques.

An example application that combines functionalities could be as follows. Given full-text data, you create a tcorpus. With the built-in search functionalities you use a Lucene style Boolean query to find where in these texts certain issues are discussed. You subset the tcorpus to focus you analysis on the text within 100 words of the search results. You now train a topicmodel on this data, and annotate the tcorpus with the wordassignments (i.e. the topics assigned to individual words). This information can then be used in other analyses or visualizations, and a topic browser can be created in which full texts are shown with the topic words highlighted.

This document explains how to create and use a tcorpus. For a quick reference, you can also access the documentation hub from within R.

?tcorpus

Creating a tcorpus

A tcorpus consists of two data.tables (i.e. enhanced data.frames supported by the data.table package).

There are two ways to create a tcorpus: create a tcorpus from full-text or by importing a tokenlist.

creating a tcorpus from full-text

The create_tcorpus function creates a tcorpus from full-text input. The full text can be provided as a single character vector or as a data.frame in which the text is given in one of the columns. We recommend using the data.frame approach, because this automatically imports all other columns as document meta.

As an example we have provided the sotu_texts demo data. This is a data.frame in which each rows represents a paragraph of the state of the union speeches from Bush and Obama

## [1] "id"        "date"      "party"     "text"      "president"

We can pass sotu_texts to the create_tcorpus function. Here we also need to specify which column contain the text (text_columns), and which column contains the document id (doc_column). Note that multiple text columns can be given.

Printing tc shows the number of tokens (i.e. words) in the tcorpus, the number of documents, and the columns in the tokens and meta data.tables.

## tCorpus containing 90827 tokens
## grouped by documents (n = 1090)
## contains:
##   - 3 columns in $tokens:    doc_id, token_id, token
##   - 4 columns in $meta:      doc_id, date, party, president

We can also look at the tokens and meta data directly (for changing this data, please read the Managing the tcorpus section below).

##       doc_id token_id      token
## 1: 111541965        1         It
## 2: 111541965        2         is
## 3: 111541965        3        our
## 4: 111541965        4 unfinished
## 5: 111541965        5       task
## 6: 111541965        6         to
##       doc_id       date     party    president
## 1: 111541965 2013-02-12 Democrats Barack Obama
## 2: 111541995 2013-02-12 Democrats Barack Obama
## 3: 111542001 2013-02-12 Democrats Barack Obama
## 4: 111542006 2013-02-12 Democrats Barack Obama
## 5: 111542013 2013-02-12 Democrats Barack Obama
## 6: 111542018 2013-02-12 Democrats Barack Obama

Additional options

The create_tcorpus function has some additional parameters.

  • The split_sentences argument can be set to TRUE to perform sentence boundary detection. This adds the sentence column to the tokens data.table, which can be used in seveal techniques in corpustools.
  • While the tcorpus preserves the full-text information in terms of word order, some information is lost regarding the empty spaces between words (e.g., single spaces versus tabs or empty lines). This information can be preserved with the remember_spaces argument.
  • It can be usefull and memory efficient to only focus on the first part of a text. The max_sentences and max_tokens parameters can be set to limit the tcorpus to contain only the first x sentences/tokens of each text.
  • It is also possible to directly use the udpipe package to tokenize the texts with a natural language processing pipeline. This is discussed in the section on Preprocessing.

Importing a tokenlist

If you already have a tokenlist it can be imported as a tcorpus with the tokens_to_tcorpus function. The tokenlist has to be formatted as a data.frame. As an example we provide the corenlp_tokens demo data, which is the output of the Stanford CoreNLP parser.

This type of data.frame can be passed to the tokens_to_tcorpus function. The names of the columns that describe the token positions (document_id, sentence and token_id) also need to be specified.

## tCorpus containing 36 tokens
## grouped by documents (n = 1) and sentences (n = 6)
## contains:
##   - 13 columns in $tokens:   doc_id, sentence, token_id, token, lemma, CharacterOffsetBegin, CharacterOffsetEnd, POS, NER, Speaker, parent, relation, pos1
##   - 1 column in $meta:   doc_id

To include document meta data, a separate data.frame with meta data has to be passed to the meta argument. This data.frame needs to have a column with the document id, using the name specified in doc_col.

Managing a tCorpus

Most operations for modifying data (e.g., preprocessing, rule-based annotations) have specialized functions, but in some cases it’s necessary to manually modify the tCorpus. In particular:

While it is possible to directly work in the tc$tokens and tc$meta data.tables, it is recommended to use the special accessor methods (e.g., set, subset). The tCorpus makes several assumptions about how the data is structured in order to work fast, and directly editing the tokens and meta data can break the tCorpus. Using the accessor functions should ensure that this doesn’t happen. Also, the accessor methods make use of the modify by reference semantics of the data.table package.

Adding, removing and mutating columns

To add, remove or mutate columns in the tc$tokens or tc$meta data.tables, the tc$set() and tc$set_meta() methods can be used (also see ?tCorpus_data. These methods change the tCorpus by reference (for more details about R6 methods and changing by reference, see the section Using the tCorpus R6 methods.).

The set method (?set) has two mandatory arguments: columns and value. The column argument takes the name of the column to add or mutate, and the value argument takes the value to assign to the column.

##    doc_id token_id   token new_column
## 1:      1        1    This  any value
## 2:      1        2      is  any value
## 3:      1        3      an  any value
## 4:      1        4 example  any value

The value can also be an expression, which will be evaluated in the $tokens data.table. Here we overwrite the new_column with the values of the token column in uppercase.

Optionally, the subset argument can be used to only mutate values for the rows where the subset expression evaluates to TRUE. For example, the following operation changes the new_column values to lowercase if token_id <= 2.

Set cannot be used to change a column name, so for this we have the $set_name() method

To remove columns, you can use the delete_columns method.

To add, remove and mutate columns in the $meta data.table, the set_meta, set_meta_name and delete_meta_columns methods can be used.

Subsetting a tCorpus

The generic subset function can be used to subset the tCorpus on both the tokens and meta data. To subset on tokens, the subset argument can be used.

## [1] 90827
## [1] 23233

To subset on meta, the subset_meta argument can be used.

## [1] 45956

As an alternative, subsetting can also be performed with the $subset() R6 method. This will modify the tCorpus by reference, which is faster and more memory efficient.

## [1] 12274

Now tc itself has been subsetted, and only contains tokens of the first sentence of Obama documents.

Deduplication

A common problem in text analysis is that there can be duplicate or near duplicate documents in a corpus. For example, a corpus of news articles might have multiple versions of the same news article if it has been updated. It is often appropriate to delete such duplicates. The tCorpus has a flexible deduplicate method that can delete duplicates based on document similarity.

Document similarity is calculated using a similarity measure for the weighted frequencies of words (or any selected feature) in documents. By default, this is the cosine similarity of token frequencies with a tf-idf weighting scheme. To ease computation a minimum document frequency (default is 2) and maximum document frequency percentage (default is 50%) is used.

It is often possible to limit comparisons by including meta data about date and categories. For example, duplicate articles in newspapers should occur within a limited time frame and within the same newspaper. The meta_cols argument can be used to only compare within documents with identical meta values, and the date_col can be used to only compare documents within a the time interval specified in hour_window. Also, by using date_col, it can be specified whether the duplicate with the first or last date should be removed.

## [1] "doc 1" "doc 3"
## [1] "doc 2" "doc 3"

Preprocessing

Splitting texts into a tokenlist is a form of text preprocessing called tokenization. For many text analysis techniques it is furthermore necessary (or at least strongly recommended) to use additional preprocessing techniques (see e.g., Welbers, van Atteveldt & Benoit, 2017; Denny & Spirling, 2018).

Corpustools support various preprocessing techniques. We make a rough distinction between basic and advanced preprocessing. By basic preprocessing we mean the common lightweight techniques such as stemming, lowercasing and stopword removal. By advanced preprocessing we refer to techniques such as lemmatization, part-of-speech tagging and dependency parsing, that require more sophisticated NLP pipelines. Also, it is often usefull to filter out certain terms. For instance, you might want to look only at nouns, drop all terms that occur less than 10 times in the corpus, or perhaps use more stopwords.

Basic preprocessing

The basic preprocessing techniques can be performed on a tcorpus with the preprocess method. The main arguments are:

Lowercasing and removing punctuation is so common that its the default. In the following example we also apply stemming and stopword removal. Note that since we work with English data we do not need to set the language, but if other languages are used the language argument needs to be used. Also note that we do not specify column and new_column. A tcorpus created with create_tcorpus always has a “token” column with the raw token text.

##           doc_id token_id      token  feature
##     1: 111541965        1         It     <NA>
##     2: 111541965        2         is     <NA>
##     3: 111541965        3        our     <NA>
##     4: 111541965        4 unfinished unfinish
##     5: 111541965        5       task     task
##    ---                                       
## 90823: 111552781        3   continue  continu
## 90824: 111552781        4         to     <NA>
## 90825: 111552781        5      bless    bless
## 90826: 111552781        6    America  america
## 90827: 111552781        7          .     <NA>

The tokens data.table now has the new column “feature” which contains the preprocessed token.

Advanced preprocessing with UDPipe

There are several packages for using NLP pipelines in R, such as spacyr for SpaCy, coreNLP for Stanford CoreNLP, and udpipe for UDPipe. Most of these rely on external dependencies (Python for SpaCy, Java for CoreNLP), but the UDPipe pipeline runs in C++ which plays nicely with R. We have built a wrapper for the udpipe package which provides bindings for using UDPipe directly from within R. Unlike basic preprocessing, this feature has to be used in the create_tcorpus function, because it requires the raw text input.

To use a UDPipe model in create_tcorpus, you simply need to use the udpipe_model argument. The value for this argument is the language/name of the model you want to use (if you don’t know the name, just give the language and you’ll get suggestions). The first time using a language, R will automatically download the model. By default the udpipe model is stored in the current working directory, in a folder called udpipe_models, but you can set the path with the udpipe_model_path parameter.

By default, UDPipe will not perform dependency parsing. You can activate this part of the pipeline by using the use_parser = TRUE argument, but it takes longer to compute and not all language have good dependency parser performance.

The feats column contains a bunch of nested features, such as Person (first, second or third) and Number (singular or plural). If you don’t need this, its good to remove the colum (tc$delete_colums('feats')). If you want to use (some of) this information, you can use the tc$feats_to_columns method to cast these nested features to columns. Here we specifically select the Tense, Person and Number features. Note that the default behavior is to also delete the feats column.

Create_tcorpus keeps a persistent cache

NLP Parsing lots of articles can take quite a while, and you don’t want to spend several hours parsing just to see that you crashed your R session and lost all progress. By default, create_tcorpus therefore keeps a persistent cache of the most recent three unique uses of create_tcorpus with udpipe (i.e. calling create_tcorpus with the same input and arguments). If you like to keep more caches, you can set the udpipe_cache argument.

Thus, if you run the create_tcorpus function (using udpipe) a second time with the same input and parameters, it will continue where you last left of, or load all data from cache if it already finished the first time. So if your R session crashes or you need to shut down your laptop, simply fire create_tcorpus up again next time.

Using multiple cores

If you are parsing a lot of data and you have CPU cores to spare, you can use them to parse the data faster. Simply set the udpipe_cores argument in create_tcorpus. The following example processes the sotu_texts data with four cores (if you have them).

Filtering tokens

In basic preprocessing we saw that NA values were introduced where stopwords were filtered out. If you want to filter out more tokens, you can use the tc$feature_subset() method. This works as a regular subset, but the difference is that the corpus is kept intact. That is, the rows of the tokens that are filtered out are not deleted.

As a (silly) example, we can specifically filter out the token with token_id 5. For this we take the subset of tokens that is not 5.

This sets all values in the ‘feature’ column with token_id 5 to NA. This is just an example that is easy to run. A more usefull application is to use this to subset on part-of-speech tags.

It is often usefull to filter out tokens based on frequency. For instance, to only include words in a model that are not too rare to be interesting and not too common to be informative. The feature subset function has four arguments to do so: min_freq, max_freq, min_docfreq and max_docfreq. Here freq indicates how often a term occured in the corpus, and docfreq indicates the number of unique documents in which a term occured. For example, the following code filters out all tokens that occur less than 10 times.

For the sake of convenience, the min/max functions are also integrated in the preprocess method. So it would also have been possible to use tc$preprocess(use_stemming = T, remove_stopwords=T, min_freq=10). Also, if you want to inspect the freq and docfreq of terms, you could use the feature_stats(tc, 'feature') function.

Creating a DTM or DFM

The data in a tCorpus can easily be converted to a document-term matrix (dtm), or to the document-feature matrix (dfm) class of the quanteda package.

In general, we recommmend using the quanteda dfm class, which can be created from a tCorpus by using the get_dfm function. Alternatively, the get_dtm function can be used to create a regular sparse matrix (dgTMatrix from the Matrix package) or the DocumentTermMatrix class from the tm package is also supported. Here we only demonstrate get_dfm.

Here we first preprocess the tokens, creating the feature column, and then create a dfm with these features.

## Document-feature matrix of: 1,090 documents, 1,422 features (97.82% sparse) and 3 docvars.
##            features
## docs        1 10 100 11 11th 12 15 18 19 1990s
##   111541965 0  0   0  0    0  0  0  0  0     0
##   111541995 0  0   0  0    0  0  0  0  0     0
##   111542001 0  0   0  0    0  0  0  0  0     0
##   111542006 0  0   0  0    0  0  0  0  0     0
##   111542013 0  0   0  0    0  0  0  0  0     0
##   111542018 0  0   0  0    0  0  0  0  0     0
## [ reached max_ndoc ... 1,084 more documents, reached max_nfeat ... 1,412 more features ]

The get_dfm function has several usefull parameters. We won’t show each in detail, but list some applications.

By preprocessing the tokens within the tCorpus, the features in the dfm can be linked to the original tokens in the tCorpus. In the next section, Keeping the full corpus intact, we show how this can be used to annotate the tCorpus with results from bag-of-words style text analyses.

Why keep the full corpus intact?

In the tCorpus the original tokens are still intact after preprocessing. This is not very memory efficient, so why do we do it?

Keeping the corpus intact in this way has the benefit that the results of an analysis, as performed with the preprocessed tokens, can be linked to the full corpus. For example, here we show a quick and dirty example of annotating a tCorpus based on a topic model.

The lda_fit R6 method is simply a wrapper for the LDA function in the topicmodels package. However, next to returning the topic model (m), it also adds a column to tc$tokens with the topic assignments.

##        doc_id token_id      token  feature topic
##  1: 111541965        1         It     <NA>    NA
##  2: 111541965        2         is     <NA>    NA
##  3: 111541965        3        our     <NA>    NA
##  4: 111541965        4 unfinished unfinish     1
##  5: 111541965        5       task     task     1
##  6: 111541965        6         to     <NA>    NA
##  7: 111541965        7    restore   restor     1
##  8: 111541965        8        the     <NA>    NA
##  9: 111541965        9      basic    basic     1
## 10: 111541965       10    bargain     <NA>    NA

This is just an example (with a poorly trained topic model) but it demonstrates the purpose. In the tokens we now see that certain words (unfinished, restore) have topic assignments. These topic assignments are based on the topic model trained with the preprocessed versions of the tokens (unfinish, restor). Thus, we can relate the results of a text analysis technique (in this case topic modeling, but the same would work with word scaling, dictionaries, etc.) to the original text. This is important, because in the end we apply these techniques to make inferences about the text, and this approach allows us to validate the results and/or perform additional analyses on the original texts.

We can now also reconstruct the full text with the topic words coloured to visualize the topic assignments. The browse_text function creates an HTML browser of the full texts, and here we use the category = 'topic' argument to indicate that the tokens$topic column should be used to colour the words. If you use the view = TRUE argument, you’ll also directly see the browser in the viewer panel.

By default, browse_texts only uses the first 500 documents to prevent making huge HTML files. The number of documents can be changed (n = 100), and the selection can be set to random (select = ‘random’), optionally with a seed (seed = 1) to create a reproducible sample for validation purposes.

The browse_texts function can also be used to visualize types of annotations (highlight values between 0 and 1, scale values between -1 and 1, categories such as topics or search results), or to only create a text browser.

Querying the tcorpus

One of the nice features of corpustools is the rather extensive querying capabilities. We’ve actually implemented a detailed Lucene-like boolean query language. Not only can you use the common AND, OR and NOT operators (also with parentheses), but you can also look for words within a given word distance. You can also include all features in the tokens data.table in a query, such as part-of-speech tags or lemma.

Furthermore, since we are not concerned with competitive performance on huge databases, we can support some features that are often not supported or accessible in search engines. For instance:

A description of the query language can be found in the documentation of the search_features() function.

?search_features()

search_features()

To demonstrate the search_features() function, we first make a tcorpus of the SOTU speeches. Note that we use split_sentences = T, so that we can also view results at the sentence level.

The search_features() function takes the tcorpus as the first argument. The query should be a character vector with one or multiple queries. Here we look for two queries, terror* and war*.

The first time a query search is performed the token column is indexed to enable fast binary search(the tCorpus remembers the index).

The output of search_features is a featureHits object.

## 284 hits (in 175 documents)
##      code hits documents
## 1 query_1  184       126
## 2 query_2  100        80

The regular output shows the total number of hits, and the summary shows the hits per query. Note, however, that the code is now query_1 and query_2 because we didn’t label our queries. There are two ways to label queries in corpustools. One is to provide the code labels with the code argument. The other is to prefix the label in the query string with a hashtag. Here we use the latter.

##     code hits documents
## 1 Terror  184       126
## 2    War  100        80

It is usefull to know that the featureHits object has a hits and queries slot. The queries slot contains the query input for provenance, and the hits slot shows the tokens that are matched. Note that queries can match multiple words. The hit_id column indicates the unique matches.

##     code    query
## 1 Terror  terror*
## 2    War     war*
##      code    feature    doc_id sentence token_id hit_id
## 1: Terror terrorists 111542025       NA       99      1
## 2: Terror terrorists 111542114       NA       84      2
## 3:    War        war 111542119       NA        5      1
## 4: Terror  terrorism 111542119       NA       34      3
## 5:    War        war 111542189       NA       50      2
## 6:    War        war 111542284       NA       54      3

Counting hits and plotting

Say we have the following search results.

We can use the count_tcorpus method to count hits. Here we use wide = TRUE to return results in wide format.

##    group    N     V Economy Education Terrorism War
## 1: total 1090 90827     182       126        95  97
##         president   N     V Economy Education Terrorism War
## 1:   Barack Obama 554 45956     115        80        14  38
## 2: George W. Bush 536 44871      67        46        81  59

By setting the wide argument to FALSE, the query results are stacked in a long format, with the columns code and count for the labels and scores. Among other things, this makes it easy to prepare data for plotting with ggplot2.

Associations

We can create semantic networks based on the cooccurence of queries in documents. For more information on using the semnet and semnet_window functions and visualizing semantic networks, see the brief tutorial below.

The output g is a network in the igraph format. The data can also be extracted as an edgelist (igraph::get.data.frame(g, 'edges')) or adjacency matrix (igraph::get.adjacency(g, attr = 'weight')).

## 4 x 4 sparse Matrix of class "dgCMatrix"
##           Economy Education Terrorism    War
## Economy    .         0.1429    0.0385 0.0385
## Education  0.2063    .         0.0238 0.0635
## Terrorism  0.0737    0.0316    .      0.2842
## War        0.0722    0.0825    0.2784 .

We like the idea of using semantic networks for getting a quick indication of the occurence and cooccurence of search results, so we also made it the default plot for a featureHits object (output of search_features) and contextHits object (output of search_contexts). Size of nodes indicates relative frequency, edge width indicates that queries often occur in the same documents, and colors indicate clusters.

Inspect results in full text

We can create HTML browsers that highlight hits in full text. If you run this command in RStudio with view = TRUE, you’ll directly see the browser in the viewer pane.

The function also returns the path where the html file is saved. A nice thing to know here is that you can use the base function browseURL(url) to open urls in your default browser.

We can also view hits in keyword-in-context (kwic) listings. This shows the keywords within a window of a given word distance. If keywords represent queries with multiple terms (AND statements, or word proximities), the window is given for each keyword (with a […] separator if the windows do not overlap).

##      doc_id code hit_id            feature
## 1 111542576           1 America -> freedom
## 2 111542742           2 America -> freedom
##                                                                                                                                                                                                                                 kwic
## 1 ...misguided idealism. In reality, the future security of <America> depends on it. On September the 11th, 2001 [...] and join the fight against terror. Every step toward <freedom> in the world makes our country safer, so we...
## 2                                               ...In Afghanistan, <America>, our 25 NATO allies, and 15 partner nations are helping the Afghan people defend their <freedom> and rebuild their country. Thanks to the courage of...

Adding query hits as token features

Technically, you can do pretty much anything with query hits that you can do with regular tokens. The format of the tcorpus allows any token-level annotations to be added to the tokens data.frame. For adding query hits, we also have a wrapper function that does this, called code_features.

##        doc_id sentence token_id      token    code
##  1: 111541965        1        1         It    <NA>
##  2: 111541965        1        2         is    <NA>
##  3: 111541965        1        3        our    <NA>
##  4: 111541965        1        4 unfinished Example
##  5: 111541965        1        5       task    <NA>
##  6: 111541965        1        6         to    <NA>
##  7: 111541965        1        7    restore Example
##  8: 111541965        1        8        the    <NA>
##  9: 111541965        1        9      basic Example
## 10: 111541965        1       10    bargain Example

Note that tc$code_features is an R6 method, and adds the ‘code’ column by reference. If this is new to you, an explanation is provided below in the Using the tcorpus R6 methods section.

search_contexts()

The search_contexts function is very similar to search_features, but it only looks for whether a query occurs in a document or sentence.

## 68 documents

Subset by search_contexts()

You can also subset a tCorpus with a query

As with regular subset, there is also an R6 method for subset_query that subsets the tCorpus by reference.

This subsets tc without making a copy. Subsetting by reference is explained in more detail in the Using the tcorpus R6 methods section.

search_dictionary

The search_dictionary function, and the related code_dictionary method, are alternative to search_features and search_context for working with larger but less complex dictionaries. For example, dictionaries with names of people, organizations or countries, or scores for certain emotions. Most of these dictionaries do not use Boolean operators such as AND and more complicated patterns such as word proximity.

The following code shows an example (not shown in this vignette, because it depends on quanteda). Here we use one of the dictionaries provided by quanteda. The quanteda dictionary class (dictionary2) can be used as input for the search dictionary functions.

Given a tCorpus, the dictionary can be used (by default on the “token” column) with the search_dictionary function.

The output of this function is the same as search_features, so we can do anything with it as discussed above. For example, we can count and plot the data.

Like code_features, the code_dictionary R6 method can be used to annotate the tokens data with the results. For the LSD2015 dictionary this would add a column with the labels “positive” and “negative”, and the negated terms “neg_positive” and “neg_negative”.

Instead of using these labels, we can also convert them to scores on a scale from -1 (negative) to 1 (positive). While it is possible to create the numeric scores for sentiment within the tCorpus, we’ll show another option here. Instead of using the quanteda dictionary, we can use a data.frame as a dictionary as long as it has a column named “string” that contains the dictionary pattern. All other columns in the data.frame are then added to $tokens. With the melt_quanteda_dict function we can also convert a quanteda dictionary to this type of data.frame. The code is then as follows:

The browse_texts function can also color words based on a scale value between -1 and 1. This is a nice way to inspect and validate the results of a sentiment dictionary.

We can use agg_tcorpus to aggregate the results. This is a convenient wrapper for data.table aggregation for tokens data. Especially important is the .id argument, which allows us to specify an id column. In the dictionary results, there are matches that span multiple rows (e.g., “not good”), and we only want to count these values once. By using the code_id column that code_dictionary added to indicate unique matches, we only aggregate unique non-NA rows in tokens.

The by argument also lets us group by columns in the tokens or meta data.

Text analysis techniques

The tCorpus is mainly intended for managing, querying, annotating and browsing tokens in a tokenlist format, but there are some built-in techniques for finding patterns in frequencies and co-occurences of features. Here we use tokens with basic preprocessing as an example, but note that these techniques can be applied for all features and annotations in a tCorpus.

Semantic networks based on co-occurence

Semantic networks can be created based on the co-occurence of features in contexts (document, sentences) or word windows. For this we use the semnet and semnet_window functions.

It is recommended to first apply preprocessing to filter out terms below a given document frequency, because the number of comparisons and the size of the output matrix/network increases exponentially with vocabulary size.

We can view the network with the plot_semnet function. However, this network will be a mess if we do not apply some filter on nodes (i.e. features) and edges (i.e. similarity scores). One way is to take only the most frequent words, and use a hard threshold for similarity.

A nice alternative is to use backbone extraction using the backbone_filter function. We then give an alpha score (similar to a p-value, but see paper reference in function documentation for details) to filter edges. In addition, the max_vertices argument can be used to specify the maximum number of nodes/features. Instead of looking for the most frequent features, this approach keeps lowering the alpha and deleting isolates until the max is reached. As a result, the filter focuses on features with strong co-occurences.

## Used cutoff edge-weight 2.13863525362624e-05 to keep number of vertices under 100
## (For the edges the original weight is still used)

Corpus comparisons

We can compare feature frequencies between two corpora (compare_corpus), or between subsets of a corpus (compare_subset). Here we compare the subset of speeches from Obama with other speeches (in this demo data that’s only George Bush)

The output comp is a data.frame with for each feature how often it occured in the subset (Obama) and the rest of the corpus (Bush), including (smoothed) probability scores, the ratio of probabilities, and a chi2 value.

For a quick peek, we can plot the most strongly overrepresented (ratio > 1) and underrepresented (ratio < 1). By default, the plot method for the output of a vocabulary comparison function(such as compare_subset) plots the log ratio on the x axis, and uses the chi2 score to determine the size of words (chi2 can be larger for more frequent terms, but also relies on how strongly a term is over/under-represented).

For alternative settings, see ?plot.vocabularyComparison.

Feature associations

Similar to a corpus comparison, we can compare how frequently words occur close to the results of a query to the overal frequency of these words. This gives a nice indication of the context in which words queries tend to occur.

The feature_associations function either requires a query or hits (results of search_features). With the feature argument we indiate for what feature column we want to compare the values. Important to consider is that this is not necessarily the column on which we want to perform the query. For example, we might want to compare frequencies of stemmed, lowercase words, or topics from a topic model, but we want to perform the query on the raw tokens.

By default, the query is always performed on the ‘token’ column. In the following example we thus query on the tokens, but look at the frequencies of the preprocessed features. We query for "terror*".

The output fa is almost identical to the output of a corpus comparison, but by default feature_associtaions only returns the top features with the strongest overrepresentation (underrepresentation is not really interesting here).

##     feature freq freq_NOT ratio  chi2
## 837     war   25       61  7.35 103.1
## 74   attack   16       30  9.57  84.4
## 540  offens    8        5 28.41  74.8
## 843  weapon   20       54  6.64  74.2
## 115    camp    7        4 30.97  66.7
## 308   fight   19       57  5.98  62.8

We can also plot the output as a wordcloud.

For more details see ?plot.featureAssociations.

Using the tcorpus R6 methods

The downside of preserving all text information of the corpus is that all this information has to be kept in memory. To work with this type of data efficiently, we need to prevent making unneccary copies of the data. The tCorpus therefore uses the data.table package to manage the tokens and meta data, and uses R6 class methods to modify data by reference (i.e. without copying).

R6 methods are accessed with the dollar symbol, similar to how you access columns in a data.frame. For example, this is how you access the tcorpus subset method.

tc$subset()

R6 methods of the tCorpus modify the data by reference. That is, instead of creating a new tCorpus that can be assigned to a new name, the tCorpus from which the method is used is modified directly. In the following example we use the R6 subset method to subset tc by reference.

tc = create_tcorpus('this is an example')
tc$subset(token_id < 3)
tc$tokens
##    doc_id token_id token
## 1:      1        1  this
## 2:      1        2    is

Note that the subset has correctly been performed. I did not need to assign the output of tc$subset() to overwrite tc. That is, we didn’t need to do tc = tc$subset(token_id < 3). The advantage of this approach is that tc does not have to be copied, which can be slow or problematic memory wise with a large corpus.

However, the catch is that you need to be aware of the distinction between deep and shallow copies (one of the nice things about R is that you normally do not have to worry about this). If this is new to you, please read the following section. Otherwise, skip to Copying a tCorpus to see how you can make deep copies if you want to.

Being carefull with shallow copies.

Here we show two examples of possible mistakes in using corpustools that you might make if you are not familiar with modifying by reference. The following section shows how to do this properly.

Firstly, you can assign the output of a tCorpus R6 method, but this will only be a shallow copy of the tCorpus. The original corpus will still be modified. In the following example, we create a subset of tc and assign it to tc2, but we see that tc itself is still modified.

## [1] TRUE

The second thing to be aware of is that you do not copy a tCorpus by assigning it to a different name. In the following example, tc2 is a shallow copy of tc, and by modifying tc2 we also modify tc.

## [1] TRUE

Copying a tCorpus

The tCorpus R6 methods modify the tCorpus by reference. If you do not want this, you have several options. Firstly, for methods where it makes sense, we provide identical regular functions that do not modify the tCorpus by reference. For example, instead of the subset R6 method you could use the subset function is classic R style.

Alternatively, these R6 methods also have a copy argument, which can be set to TRUE to force a deep copy.

Finally, you could always make a deep copy of the entire tCorpus. This is not the same as simply assigning to the tCorpus to a new name.

The difference is that in the shallow copy, if you change tc2 by reference (e.g., by using the subset R6 method), it wil also change tc!! With a deep copy, this does not happen because the entire tCorpus is copied and stored seperately in memory.

Of course, making deep copies is ‘expensive’ in speed and memory. Our recommended approach is therefore to not overwrite or remove data from the corpus, but rather add columns or make values NA. The tCorpus was designed to keep all information in one big corpus, and manages to do this fairly efficiently by properly using the data.table package to prevent copying where possible.