tier_match
is the ultimate wrapper function in fedmatch.
tier_match
puts together all of the pieces from the package into one function, letting the user perform many matches in one call. The function is excellent both as an exploratory tool, while the user is still figuring out how they want to execute their matches, and as a final matching tool that can be used in production code.
‘tiers’ of a match are useful because there are hierarchies of matches. An exact name match between two companies is a higher-quality match than a fuzzy match, and fuzzy matches with various levels of cleaning can be different levels of quality.
The syntax of tier_match
is providing a core list of arguments to the function itself, and then passing a named list to the tier match. Each element in this list is itself a list, each of which is a tier to match on, and it contains all of the arguments necessary for that tier. All of these arguments will be passed to ‘merge_plus’ in sequence, and each of the matches from each tier are saved and combined.
<- list(
tier_list a = build_tier(match_type = "exact"),
b = build_tier(match_type = "fuzzy"),
c = build_tier(match_type = "multivar", multivar_settings = build_multivar_settings(
logit = NULL, missing = FALSE, wgts = 1,
compare_type = "stringdist", blocks = NULL, blocks.x = NULL, blocks.y = NULL,
top = 1, threshold = NULL
))
)# tier_list
This list will perform three matches: ‘a’, an exact match; ‘b’, a fuzzy match, and ‘c’, a multivar match. We can get a bit fancier and add more settings to each, if we’d like. Remember that each element of each tier has to be an argument for merge_plus
.
<- list(
tier_list_v2 a = build_tier(match_type = "exact", clean = TRUE),
b = build_tier(match_type = "fuzzy", clean = TRUE,
fuzzy_settings = build_fuzzy_settings(method = "wgt_jaccard",
maxDist = .7,
nthread = 1),
clean_settings = build_clean_settings(remove_words = TRUE)),
c = build_tier(match_type = "multivar",
multivar_settings = build_multivar_settings(
logit = NULL, missing = FALSE, wgts = 1,
compare_type = "stringdist", blocks = NULL, blocks.x = NULL, blocks.y = NULL,
top = 1, threshold = NULL
)) )
Let’s take a look at the rest of the syntax for tier_match
:
<- tier_match(corp_data1, corp_data2,
result by.x = "Company", by.y = "Name",
unique_key_1 = "unique_key_1", unique_key_2 = "unique_key_2",
tiers = tier_list_v2, takeout = "neither", verbose = TRUE,
score_settings = build_score_settings(score_var_x = "Company",
score_var_y = "Name",
wgts = 1,
score_type = "stringdist")
)#> Matching tier 'a'...
#> Time elapsed: 0.01 secs.
#> Matching tier 'b'...
#> Time elapsed: 0.02 secs.
#> Matching tier 'c'...
#> Time elapsed: 0.06 secs.
There are two types of arguments for tier_match
: those that can be passed to merge_plus
, and those that are unique to tier_match
. If anything of the merge_plus
arguments are listed in tier_match
directly (rather than in tier_list
), those arguments are used in every tier. In this example, we are always matching on ‘Company’ and ‘Name,’ so those are placed in the arguments for tier_match directly. The arguments unique to tier_match
and their defaults are:
tiers
is the tier list create by iterations of build_tier()
. Required, no default.takeout
is a character vector, either “neither”, “both”, “data1”, or “data2”. These settings describe whether or not to take out matches in between each tier, and if so, what dataset to remove the matches for.verbose
is a boolean. If TRUE
, prints tier names and time taken to match each tier.The other arguments are all present in merge_plus
, see documentation there for details.
The result for tier_match is a list with 4 items: the matched dataset, the unmatched data, and a match evaluation. Here’s what the matches look like:
$matches[1:5]
result#> Company Country State SIC Revenue unique_key_1 country state_code
#> 1: walmart USA OH 3300 485 1 USA OH
#> 2: walmart USA OH 3300 485 1 USA OH
#> 3: Walmart USA OH 3300 485 1 USA OH
#> 4: Bershire Hataway USA 2222 223 2 USA NE
#> 5: apple USA CA 3384 215 3 USA CA
#> SIC_code earnings unique_key_2 Name matchscore Company_score
#> 1: 3380 490,000 1 walmart 1.0000000 1.0000000
#> 2: 3380 490,000 1 walmart 1.0000000 1.0000000
#> 3: 3380 490,000 1 Walmart 1.0000000 1.0000000
#> 4: 2220 220,000 2 Bershire Hathaway 0.9882353 0.9882353
#> 5: NA 220,000 3 apple computer 0.8714286 0.8714286
#> tier Company_compare multivar_score
#> 1: a NA NA
#> 2: b NA NA
#> 3: c 1.0000000 1.0000000
#> 4: c 0.9882353 0.9882353
#> 5: b NA NA
As you can see, the matches dataset has a column called ‘tier’ that indicates which tier the match was from. It also adds any additional columns added by the matching process. In this example, we see ‘Company_score’, created from the from the post-hoc scoring; ‘wgt_jaccard_sim’, the Weighted Jaccard similarity, created when using the ‘wgt_jaccard’ setting of fuzzy_match
(see the ‘Fuzzy-matching’ vignette for more details); and ‘Company_compare’, created from the multivar matching tier.
We also have a match evaluation, now filled out with more details broken down by tier:
$match_evaluation
result#> tier matches in_tier_unique_1 in_tier_unique_2 pct_matched_1 pct_matched_2
#> 1: a 2 2 2 0.2 0.2
#> 2: b 7 7 7 0.7 0.7
#> 3: c 10 10 9 1.0 0.9
#> 4: all 19 10 9 1.0 0.9
#> new_unique_1 new_unique_2
#> 1: 2 2
#> 2: 5 5
#> 3: 3 2
#> 4: NA NA
We can use this evaluation to figure out which tiers did the ‘best’ job matching, getting the most unique matches.