Fuzzy string matching for R

R-CMD-check

levitate is based on the Python fuzzywuzzy package for fuzzy string matching. An R port of this already exists, but unlike fuzzywuzzyR, levitate is written entirely in R with no external dependencies on reticulate or Python. It also offers a couple of extra bells and whistles in the form of vectorised functions.

View the docs at https://lewinfox.github.io/levitate/.

Why “levitate”?

A common measure of string similarity is the Levenshtein distance, and the name was available on CRAN.

Installation

Install the development version from Github:

devtools::install_github("lewinfox/levitate")

Examples

lev_distance()

The edit distance is the number of additions, subtractions or substitutions needed to transform one string into another. Base R provides the adist() function to compute this. levitate provides lev_distance() which is powered by the stringdist package.

lev_distance("cat", "bat")
#> [1] 1

lev_distance("rat", "rats")
#> [1] 1

lev_distance("cat", "rats")
#> [1] 2

The function can accept vectorised input. Where the inputs have a length() greater than 1 the results are returned as a vector unless pairwise = FALSE, in which case a matrix is returned.

lev_distance(c("cat", "dog", "clog"), c("rat", "log", "frog"))
#> [1] 1 1 2

lev_distance(c("cat", "dog", "clog"), c("rat", "log", "frog"), pairwise = FALSE)
#>      rat log frog
#> cat    1   3    4
#> dog    3   1    2
#> clog   4   1    2

If at least one (or both) of the inputs is scalar (length 1) the result will be a vector. The elements of the vector are named based on the longer input (unless useNames = FALSE).

lev_distance(c("cat", "dog", "clog"), "rat")
#>  cat  dog clog 
#>    1    3    4

lev_distance("cat", c("rat", "log", "frog", "other"))
#>   rat   log  frog other 
#>     1     3     4     5

lev_distance("cat", c("rat", "log", "frog", "other"), useNames = FALSE)
#> [1] 1 3 4 5

lev_ratio()

More useful than the edit distance, lev_ratio() makes it easier to compare similarity across different strings. Identical strings will get a score of 1 and entirely dissimilar strings will get a score of 0.

This function behaves exactly like lev_distance():

lev_ratio("cat", "bat")
#> [1] 0.6666667

lev_ratio("rat", "rats")
#> [1] 0.75

lev_ratio("cat", "rats")
#> [1] 0.5

lev_ratio(c("cat", "dog", "clog"), c("rat", "log", "frog"))
#> [1] 0.6666667 0.6666667 0.5000000

lev_partial_ratio()

If a and b are different lengths, this function compares all the substrings of the longer string that are the same length as the shorter string and returns the highest lev_ratio() of all of them. E.g. when comparing "actor" and "tractor" we would compare "actor" with "tract", "racto" and "actor" and return the highest score (in this case 1).

lev_partial_ratio("actor", "tractor")
#> [1] 1

# What's actually happening is the max() of this result is being returned
lev_ratio("actor", c("tract", "racto", "actor"))
#> tract racto actor 
#>   0.2   0.6   1.0

lev_token_sort_ratio()

The inputs are tokenised and the tokens are sorted alphabetically, then the resulting strings are compared.

x <- "Episode IV - Star Wars: A New Hope"
y <- "Star Wars Episode IV - New Hope"

# Because the order of words is different the simple approach gives a low match ratio.
lev_ratio(x, y)
#> [1] 0.3529412

# The sorted token approach ignores word order.
lev_token_sort_ratio(x, y)
#> [1] 0.9354839

lev_token_set_ratio()

Similar to lev_token_sort_ratio() this function breaks the input down into tokens. It then identifies any common tokens between strings and creates three new strings:

x <- {common_tokens}
y <- {common_tokens}{remaining_unique_tokens_from_string_a}
z <- {common_tokens}{remaining_unique_tokens_from_string_b}

and performs three pairwise lev_ratio() calculations between them (x vs y, y vs z and x vs z). The highest of those three ratios is returned.

x <- "the quick brown fox jumps over the lazy dog"
y <- "my lazy dog was jumped over by a quick brown fox"

lev_ratio(x, y)
#> [1] 0.2916667

lev_token_sort_ratio(x, y)
#> [1] 0.6458333

lev_token_set_ratio(x, y)
#> [1] 0.7435897

Porting code from fuzzywuzzy or fuzzywuzzyR

Results differ between levitate and fuzzywuzzy, not least because stringdist offers several possible similarity measures. Be careful if you are porting code that relies on hard-coded or learned cutoffs for similarity measures.