wordpiece.data: Data for Wordpiece-Style Tokenization

Provides data to be used by the wordpiece algorithm in order to tokenize text into somewhat meaningful chunks. Included vocabularies were retrieved from <https://huggingface.co/bert-base-cased/resolve/main/vocab.txt> and <https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt> and parsed into an R-friendly format.

Version: 2.0.0
Depends: R (≥ 3.5.0)
Suggests: testthat (≥ 3.0.0)
Published: 2022-03-03
Author: Jonathan Bratt ORCID iD [aut], Jon Harmon ORCID iD [aut, cre], Bedford Freeman & Worth Pub Grp LLC DBA Macmillan Learning [cph], Google, Inc [cph] (original BERT vocabularies)
Maintainer: Jon Harmon <jonthegeek at gmail.com>
BugReports: https://github.com/macmillancontentscience/wordpiece.data/issues
License: Apache License (≥ 2)
URL: https://github.com/macmillancontentscience/wordpiece.data
NeedsCompilation: no
Materials: README NEWS
CRAN checks: wordpiece.data results

Documentation:

Reference manual: wordpiece.data.pdf

Downloads:

Package source: wordpiece.data_2.0.0.tar.gz
Windows binaries: r-devel: wordpiece.data_2.0.0.zip, r-release: wordpiece.data_2.0.0.zip, r-oldrel: wordpiece.data_2.0.0.zip
macOS binaries: r-release (arm64): wordpiece.data_2.0.0.tgz, r-oldrel (arm64): wordpiece.data_2.0.0.tgz, r-release (x86_64): wordpiece.data_2.0.0.tgz, r-oldrel (x86_64): wordpiece.data_2.0.0.tgz
Old sources: wordpiece.data archive

Reverse dependencies:

Reverse imports: wordpiece

Linking:

Please use the canonical form https://CRAN.R-project.org/package=wordpiece.data to link to this page.