library(fedmatch)
clean_strings
is the way to prepare strings for name matching, either within tier_match
(see the Using-tier-match
vignette). There are several useful options that allow for many different options.
Here’s the example string we’ll be using:
<- corp_data1[, Company]
name_vec
name_vec#> [1] "Walmart" "Bershire Hataway" "Apple"
#> [4] "Exxon Mobile" "McKesson " "UnitedHealth Group"
#> [7] "CVS Health" "General Motors" "AT&T"
#> [10] "Ford Motor Company"
First, we can use the basic string cleaning defaults:
clean_strings(name_vec)
#> [1] "walmart" "bershire hataway" "apple"
#> [4] "exxon mobile" "mckesson" "unitedhealth group"
#> [7] "cvs health" "general motors" "atandt"
#> [10] "ford motor company"
Without any additional arguments, clean_strings
does the following:
Then, we have a few different options we can use.
sp_char_words
is a data.frame with 2 columns: the first column is symbols to replace, and the second is their replacement. fedmatch
as a built-in set of symbols:
print(sp_char_words)
#> character replacement
#> 1: \\& and
#> 2: \\$ dollar
#> 3: \\% percent
#> 4: \\@ at
But, you can use any data.frame you’d like, to make whatever replacements you’d like:
<- data.table::data.table(character = c("o"), replacement = c("apple"))
new_sp_char clean_strings(name_vec, sp_char_words = new_sp_char)
#> [1] "walmart" "bershire hataway"
#> [3] "apple" "exxapplen mapplebile"
#> [5] "mckessapplen" "unitedhealth grappleup"
#> [7] "cvs health" "general mappletapplers"
#> [9] "at t" "fapplerd mappletappler capplempany"
common_words
is similar, but it respects word boundaries (so you don’t replace every usage of ‘Corp’ with ‘Corporation’, for example.) fedmatch
has a built-in set of 54 words and their replacements:
print(corporate_words[1:5])
#> abbr long.names
#> 1: accep acceptance
#> 2: amer america
#> 3: assoc associates
#> 4: cl company listed
#> 5: cmnty community
But, you can use whatever words you’d like:
clean_strings(name_vec, common_words = data.table::data.table(word = c("general", "almart"),
replacement = c("bananas", "oranges")))
#> [1] "walmart" "bershire hataway" "apple"
#> [4] "exxon mobile" "mckesson" "unitedhealth group"
#> [7] "cvs health" "bananas motors" "atandt"
#> [10] "ford motor company"
(bananas motors sounds like a lovely place to work). Note that the ‘almart’ in ‘walmart’ didn’t get replaced, because common_words respects word boundaries.,
You can also use a related function, word_frequency
, to look for the most common strings in your data:
word_frequency(sample(c("hi", "Hello", "bye "), 1e4, replace = TRUE))
#> Word Count
#> 1: hello 3376
#> 2: bye 3323
#> 3: hi 3301
remove_words and remove_char are booleans that let you simply remove the words in ‘common_words’ or specify a set of characters to remove rather than replacing them.
clean_strings(name_vec, sp_char_words = new_sp_char, remove_char = c("a", "c"))
#> [1] "w lm rt" "bershire h t w y"
#> [3] "pple" "exxapplen mapplebile"
#> [5] "m kessapplen" "unitedhe lth grappleup"
#> [7] "vs he lth" "gener l mappletapplers"
#> [9] "t t" "fapplerd mappletappler applemp ny"
clean_strings(name_vec, common_words = data.table::data.table(word = c("general", "company"),
replacement = c("bananas", "oranges")),
remove_words = TRUE)
#> [1] "walmart" "bershire hataway" "apple"
#> [4] "exxon mobile" "mckesson" "unitedhealth group"
#> [7] "cvs health" "motors" "atandt"
#> [10] "ford motor"
stem
is a boolean that lets you stem words, using SnowballC::wordStem
. ‘stemming’ words means removing common suffixes:
clean_strings(c( "call", "calling", "called"), stem = TRUE)
#> [1] "call" "call" "call"
See the documentation in SnowballC::wordStem
for details.