Radix trees in R

Oliver Keyes

2016-08-03

A radix tree, or trie, is a data structure optimised for storing key-value pairs in a way optimised for searching. This makes them very, very good for efficiently matching data against keys, and retrieving the values associated with those keys.

triebeard provides an implementation of tries for R (and one that can be used in Rcpp development, too, if that’s your thing) so that useRs can take advantage of the fast, efficient and user-friendly matching that they allow.

Radix usage

Suppose we have observations in a dataset that are labelled, with a 2-3 letter code that identifies the facility the sample came from:

labels <- c("AO-1002", "AEO-1004", "AAI-1009", "AFT-1403", "QZ-9065", "QZ-1021", "RF-0901",
            "AO-1099", "AFT-1101", "QZ-4933")

We know the facility each code maps to, and we want to be able to map the labels to that - not over 10 entries but over hundreds, or thousands, or hundreds of thousands. Tries are a great way of doing that: we treat the codes as keys and the full facility names as values. So let’s make a trie to do this matching, and then, well, match:

library(triebeard)
trie <- trie(keys = c("AO", "AEO", "AAI", "AFT", "QZ", "RF"),
             values = c("Audobon", "Atlanta", "Ann Arbor", "Austin", "Queensland", "Raleigh"))

longest_match(trie = trie, to_match = labels)

 [1] "Audobon"    "Atlanta"    "Ann Arbor"  "Austin"     "Queensland" "Queensland" "Raleigh"    "Audobon"    "Austin"    
[10] "Queensland"

This pulls out, for each label, the trie value where the associated key has the longest prefix-match to the label. We can also just grab all the values where the key starts with, say, A:

prefix_match(trie = trie, to_match = "A")

[[1]]
[1] "Ann Arbor" "Atlanta"   "Austin"    "Audobon"  

And finally if we want we can match very, very fuzzily using “greedy” matching:

greedy_match(trie = trie, to_match = "AO")

[[1]]
[1] "Ann Arbor" "Atlanta"   "Austin"    "Audobon"  

These operations are very, very efficient. If we use longest-match as an example, since that’s the most useful thing, with a one-million element vector of things to match against:

library(triebeard)
library(microbenchmark)

trie <- trie(keys = c("AO", "AEO", "AAI", "AFT", "QZ", "RF"),
             values = c("Audobon", "Atlanta", "Ann Arbor", "Austin", "Queensland", "Raleigh"))

labels <- rep(c("AO-1002", "AEO-1004", "AAI-1009", "AFT-1403", "QZ-9065", "QZ-1021", "RF-0901",
                "AO-1099", "AFT-1101", "QZ-4933"), 100000)

microbenchmark({longest_match(trie = trie, to_match = labels)})

Unit: milliseconds
                                                  expr      min       lq     mean   median       uq      max neval
 {     longest_match(trie = trie, to_match = labels) } 284.6457 285.5902 289.5342 286.8775 288.4564 327.3878   100

I think we can call <300 milliseconds for a million matches against an entire set of possible values pretty fast.

Radix modification

There’s always the possibility that (horror of horrors) you’ll have to add or remove entries from the trie. Fear not; you can do just that with trie_add and trie_remove respectively, both of which silently modify the trie they’re provided with to add or remove whatever key-value pairs you provide:

to_match = "198.0.0.1"
trie_inst <- trie(keys = "197", values = "fake range")

longest_match(trie_inst, to_match)
[1] NA

trie_add(trie_inst, keys = "198", values = "home range")
longest_match(trie_inst, to_match)
[1] "home range"

trie_remove(trie_inst, keys = "198")
longest_match(trie_inst, to_match)
[1] NA

Metadata and coercion

You can also extract information from tries without using them. dim, str, print and length all work for tries, and you can use get_keys(trie) and get_values(trie) to extract, respectively, the keys and values from a trie object.

In addition, you can also coerce tries into other R data structures, specifically lists and data.frames:

trie <- trie(keys = c("AO", "AEO", "AAI", "AFT", "QZ", "RF"),
             values = c("Audobon", "Atlanta", "Ann Arbor", "Austin", "Queensland", "Raleigh"))

str(as.data.frame(trie))
'data.frame':   6 obs. of  2 variables:
 $ keys  : chr  "AAI" "AEO" "AFT" "AO" ...
 $ values: chr  "Ann Arbor" "Atlanta" "Austin" "Audobon" ...

str(as.list(trie))

List of 2
 $ keys  : chr [1:6] "AAI" "AEO" "AFT" "AO" ...
 $ values: chr [1:6] "Ann Arbor" "Atlanta" "Austin" "Audobon" ...

Other trie operations

If you have ideas for other trie-like structures, or functions that would be useful with these tries, the best approach is to either request it or add it!