A word embedding comprises values that represent the latent meaning
of a word. The numbers may be seen as coordinates in a space that
comprises several hundred dimensions. The more similar two words’
embeddings are, the closer positioned they are in this embedding space,
and thus, the more similar the words are in meaning. Hence, embeddings
reflect the relationships among words, where proximity in the embedding
space represents similarity in latent meaning. Text
uses
already existing language models to map text data to high quality word
embeddings.
To represent several words, sentences and paragraphs, word embeddings of single words may be combined or aggregated into one word embedding. This can be achieved by taking the mean, minimum or maximum value of each dimension of the embeddings.
This tutorial focuses on how to retrieve layers and how to
aggregate them to receive word embeddings in text
.
The focus will be on the actual functions.
For more detailed information about word embeddings and the language
models in regard to text
please see text: An R-package
for Analyzing and Visualizing Human Language Using Natural Language
Processing and Deep Learning; and for more comprehensive
information about the inner workings of the language models, for example
see Illustrated
BERT or the references given in Table 1.
Table 1 show some of the more common language models; for more detailed information see HuggingFace
Models | References | Layers | Dimensions | Language |
---|---|---|---|---|
‘bert-base-uncased’ | Devlin et al. 2019 | 12 | 768 | English |
‘roberta-base’ | Liu et al. 2019 | 12 | 768 | English |
‘distilbert-base-cased’ | Sahn et al., 2019 | 6? | 768? | English |
‘bert-base-multilingual-cased’ | Devlin et al.2019 | 12 | 768 | 104 top languages at Wikipedia |
‘xlm-roberta-large’ | Liu et al | 24 | 1024 | 100 language |
The main function to transform text to word embeddings is
textEmbed()
. First, provide a tibble containing the
text-variable(s) that you want to transform (note that it is OK to
submit other variables too; the function will only grab the character
variables). Second, set the language model
; using a setting
among the options for model
ensures that you use a model
that have been tested with text.
Setting the advanced options pretrained_weights
(e.g.,
to pretrained_weights = 'bert-base-uncased'
),
tokenizer_class
(e.g., to
tokenizer_class = BertTokenizer
) and
model_class
(e.g., to model_class = BertModel
;
and model = NULL
); allows you to set a model directly with
the HuggingFace interface. Make sure that the pretrained_weights,
tokenizer_class, and model_class fit together (otherwise you will get an
error).
Third, decide whether you want contextualized and/or decontextualized
word embeddings; by setting the contexts
ans
deconext
parameters to TRUE/FALSE. Contextualized word
embeddings are standard and return word embeddings that have taken into
account the context in which the word was used; the decontextualized
word embeddings do not take into account the context of how the word was
used (and are used in the plot functions).
Last, select the number of layers you want to use and the way you want to aggregate them.
library(text)
# Transform the text data to BERT word embeddings
<- textEmbed(x = Language_based_assessment_data_8,
wordembeddings model = 'bert-base-uncased',
contexts = TRUE,
layers = 11:12,
context_aggregation = "mean",
decontexts = TRUE,
decontext_layers = 11:12,
decontext_aggregation = "mean")
# Save the word embeddings to avoid having to import the text every time
# saveRDS(wordembeddings, "_YOURPATH_/wordembeddings.rds")
# Get the word embeddings again
# wordembeddings <- readRDS("_YOURPATH_/wordembeddings.rds")
# See how word embeddings are structured
wordembeddings
The textEmbed()
function is suitable when you are just
interested in getting good word embeddings to test some research
hypothesis with. That is, the defaults are based on general experience
of what works. Under the hood textEmbed
uses one function
for retrieving the layers (textEmbedLayersOutput
) and
another function for aggregating them
(textEmbedLayerAggreation
). So, if you are interested in
examining different layers and different aggregation methods it is
better to split up the work flow so that you first retrieve all layers
(which takes most time) and then test different aggregation methods.
The textEmbedLayersOutput
function is used to retrieve
the layers of hidden states.
library(text)
#Transform the text data to BERT word embeddings
<- Language_based_assessment_data_8[1:2, 1:2]
x
<- textEmbedLayersOutput(x,
wordembeddings_tokens_layers contexts = TRUE,
decontexts = FALSE,
model = 'bert-base-uncased',
layers = 'all',
return_tokens = TRUE)
wordembeddings_tokens_layers
The output from the textEmbedLayerAggreation()
function
is the same as that of textEmbed()
; but, now you have the
possibility to test different ways to aggregate the layers without
having to retrieve them from the language model. In
textEmbedLayerAggreation()
, you can select any combination
of the layers that you want to aggregate; and then you can select to
aggregate them using the mean of the dimensions, the minimum or maximum
value.
library(text)
# Aggregating layer 11 and 12 by taking the mean of each dimension.
<- textEmbedLayerAggreation(word_embeddings_layers = wordembeddings_tokens_layers,
we_11_12_mean layers = 11:12,
aggregation = "mean")
# Aggregating layer 11 and 12 by taking the minimum of each dimension accross the two layers.
<- textEmbedLayerAggreation(word_embeddings_layers = wordembeddings_tokens_layers,
we_11_12_min layers = 11:12,
aggregation = "min")
# Aggregating layer 1 to 12 by taking the max value of each dimension accross the 12 layers.
<- textEmbedLayerAggreation(word_embeddings_layers = wordembeddings_tokens_layers,
we_1_12_min layers = 1:12,
aggregation = "max")
we_1_12_min
Now the word embeddings are ready to be used in down stream tasks such as predicting numeric variables or be plotted according to different dimensions.