{domir}
’s domin
The purpose of this vignette is to briefly discuss the conceptual
underpinnings of the relative importance method implemented in
{domir}
and provide several extensive examples that
illustrate these concepts as applied to data.
This vignette is intended to serve as a refresher for users familiar with these concepts as well as an brief introduction to them for those who are not.
By the end of this vignette, the reader should have a sense for what the key relative importance method is attempting to do as well as an understanding of how it is accomplished as applied to data.
The relative importance method implemented in {domir}
produces results that are relatively easy to interpret but is itself a
complex method in terms of implementation.
The discussion below outlines the conceptual origins of the method,
what the relative importance method does, and some details about how the
DA method is implemented in {domir}
.
The focus of the {domir}
package is, currently, on
dominance analysis (DA). DA can be thought of as an extension of Shapley
value decomposition from Cooperative Game Theory (see Grömping, 2007 for
a discussion) which seeks to find a solutions to the problem of how to
subdivide payoffs to players in a cooperative game based on their
relative contribution to the payoff.
This methodology can be applied to predictive modeling in a conceptually straightforward way. Predictive models are, in a sense, a game in which independent variables (IVs)/predictors/features cooperate to produce a payoff in the form of predicting the dependent variable (DV)/outcome/response. The component of the decomposition/the proportion of the payoff ascribed to each IV can then be interpreted as the IVs importance in the context of the model as that is the contribution it makes to predicting the DV.
In application, DA determines the relative importance of IVs in a
predictive model based on each IV’s contribution to an overall model fit
statistic—a value that describes the entire model’s predictions on a
dataset at once. DA’s goal extends beyond just the decomposition of the
focal model fit statistic. In fact, DA produces three different results
that it uses to compare the contribution each IV makes in the predictive
model against the contributions attributed to each other IV. The use of
these three results to compare IVs is the reason DA is an extension
of Shapley value decomposition. The three different results are
discussed in greater detail in the context of an example discussed below
after a brief introduction to the domin
function.
domir::domin
The domin
function1 of the {domir}
package is an
API for applying DA to predictive modeling functions in R. The sections
below will use this function to illustrate how DA is implemented and
discuss conceptual details of each computation.
The purpose of the {domir}
package is to apply DA to
predictive models. This section builds on the last by providing an
example predictive model with which to illustrate the computation and
interpretation of the dominance results produced by DA.
DA was developed originally using linear regression (lm
)
with the explained variance \(R^2\)
metric as a fit statistic (Budescu, 1993). The examples below use this
model and fit statistic as both are widely used and understood in
statistics and data science.
Consider this model using the mtcars data in the
{datasets}
package.
library(datasets)
<-
lm_cars lm(mpg ~ am + cyl + carb, data = mtcars)
summary(lm_cars)
#>
#> Call:
#> lm(formula = mpg ~ am + cyl + carb, data = mtcars)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -5.8853 -1.1581 0.2646 1.4885 5.4843
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 32.1731 2.4914 12.914 2.59e-13 ***
#> am 4.2430 1.3094 3.240 0.003074 **
#> cyl -1.7175 0.4298 -3.996 0.000424 ***
#> carb -1.1304 0.4058 -2.785 0.009481 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.755 on 28 degrees of freedom
#> Multiple R-squared: 0.8113, Adjusted R-squared: 0.7911
#> F-statistic: 40.13 on 3 and 28 DF, p-value: 2.855e-10
The results show that all three IVs are statistically significant at the traditional level (i.e., \(p < .05\)) and that, in total, the predictors—am, cyl, and carb—explain ~80% of the variance in mpg.
I intend to conduct a DA on this model using
domir::domin
and implement the DA as follows:
library(domir)
domin(mpg ~ am + cyl + carb,
lm, list(summary, "r.squared"),
data = mtcars)
#> Overall Fit Statistic: 0.8113023
#>
#> General Dominance Statistics:
#> General Dominance Standardized Ranks
#> am 0.2156848 0.2658501 2
#> cyl 0.4173094 0.5143698 1
#> carb 0.1783081 0.2197801 3
#>
#> Conditional Dominance Statistics:
#> IVs: 1 IVs: 2 IVs: 3
#> am 0.3597989 0.2164938 0.07076149
#> cyl 0.7261800 0.4181185 0.10762967
#> carb 0.3035184 0.1791172 0.05228872
#>
#> Complete Dominance Designations:
#> Dmnated?am Dmnated?cyl Dmnated?carb
#> Dmnates?am NA FALSE TRUE
#> Dmnates?cyl TRUE NA TRUE
#> Dmnates?carb FALSE FALSE NA
The domin
function above prints out results in four
sections:
Below I “replay” and discuss each result in turn.
#> Overall Fit Statistic: 0.8113023
The first result domin
prints is related to the overall
fit statistic value for the model. In game theory terms, this value is
the total payoff all players/IVs produced in the cooperative
game/model.
The value produced serves as the fit statistic “to be decomposed” by the DA and is limiting value for how much each IV will be able to explain. The DA will ascribe the three IVs in this model separate components of this ~\(.8113\) value related to their contributions to predicting mpg.
Other fit statistic value adjustments are reported in this section as well in particular those associated with the all subsets and constant model adjustments (to be discussed further below; sections under development) when used in the DA.
#> General Dominance Statistics:
#> General Dominance Standardized Ranks
#> am 0.2156848 0.2658501 2
#> cyl 0.4173094 0.5143698 1
#> carb 0.1783081 0.2197801 3
The second result printed reports the general dominance statistics related to how the overall fit statistic’s value is divided up among the IVs. These also represent the Shapley value decompositions of the fit statistic showing how each player/IV is ascribed a component of the payoff/fit statistic from the game/model.
The General Dominance column of statistics can be interpreted in terms of the fit metric it decomposed. For example, am has a value of ~\(0.2157\) which means am is associated with an \(R^2\) of about twenty-two percentage points of mpg’s variance given the predictive model and other IVs.
The Standardized column of statistics expresses the general dominance statistic value as a percentage of the overall fit statistic value and thus sums to 100%. am’s contributions to the \(R^2\)’s total value is ~27% (i.e., \(\frac{.2157}{.8113} = .2659\)).
The final Ranks column is most relevant to the focal purpose of determining the relative importance of the IVs in this model as it provides a rank ordering of the IVs based on their general dominance statistics. It is here that DA moves beyond Shapley value decomposition in that DA, and this rank ordering based on general dominance statistics, allows for applying labels to the relationships between IVs in terms of their importance.
am is ranked second because it has a smaller general dominance statistic than the first ranked cyl. As such, am “is generally dominated by” cyl. By contrast, am is ranked higher than third ranked carb and thus am “generally dominates” carb. These labels are known as general dominance designations.
The general dominance statistics always sum to the value of the overall fit statistic and, because they represent parts of a whole, are widely considered the easiest of the dominance statistics to interpret (i.e., as compared to conditional and complete dominance discussed next).
The general dominance statistics, however simple to interpret, are the least stringent of the dominance statistics/designations (the reasons why are discussed later). Thus, the requirements to assign the “generally dominates” label to a relationship between two IVs are the easiest to fulfill.
#> Conditional Dominance Statistics:
#> IVs: 1 IVs: 2 IVs: 3
#> am 0.3597989 0.2164938 0.07076149
#> cyl 0.7261800 0.4181185 0.10762967
#> carb 0.3035184 0.1791172 0.05228872
The third section reported on by domin
prints the
conditional dominance statistics associated with each IV. Each
IV has a separate conditional dominance statistic related to the number
of IVs that are in a sub-model; why this matrix is useful delves into
the computation of these statistics which is discussed later. For the
time being it suffices to note that, conceptually, these values can be
thought of as Shapley values for a specific number of players in the
game. Thus, the IVs: 1 column reports on the value of the fit
statistic/payoff when the IV is “playing alone” (i.e., by itself in the
model with no other IVs). Similarly, the IVs: 2 column reports
on the average value of the fit statistic/payoff when the IV is “playing
with one other” (i.e., with another IV irrespective of which) and so on
until the column with all IVs in the model (here IVs: 3).
The primary utility of the conditional dominance matrix is that it can be used to designate importance in a way that is more stringent/harder to achieve than the general dominance statistics. Unfortunately, this matrix does not have a ranking like the general dominance statistics and it may not be obvious as to how one might use this matrix for determining importance.
To determine importance, the conditional dominance matrix is used is ‘row-wise’, comparing the results of an entire row/IV against those of another row/IV. If the value of each entry for a row/IV is greater than the value of another row/IV at the same position (i.e., comparing models with the same number of IVs) than an IV is said to “conditionally dominate” the other IV. The matrix above shows that am “is conditionally dominated by” cyl as its conditional dominance statistics are smaller than cyl’s at positions 1, 2, and 3. Conversely, am “conditionally dominates” carb as its conditional dominance statistics are greater than carb’s at positions 1, 2, and 3.
Conditional dominance statistics provide more information about each IV than general dominance statistics as they reveal the effect that IV redundancy has on prediction for each IV. To put this a little differently, conditional dominance statistics show more clearly the utility a specific player/IV adds to the payoff/fit metric. As the game/model gets more players/IVs, the contribution any one IV can make becomes more limited. This limiting effect with more IVs is reflected in the trajectory of conditional dominance statistics.
The increase in complexity with conditional dominance over that of general dominance also results in a more stringent set of comparisons. Because the label “conditionally dominates” is only ascribed to a relationship that shows more contribution to the fit metric at all positions of the conditional dominance matrix, it is a more difficult criterion to achieve and is therefore a stronger designation.
Note that conditional dominance implies general dominance–but the reverse is not true. An IV can generally, but not conditionally, dominate another IV.
#> Complete Dominance Designations:
#> Dmnated?am Dmnated?cyl Dmnated?carb
#> Dmnates?am NA FALSE TRUE
#> Dmnates?cyl TRUE NA TRUE
#> Dmnates?carb FALSE FALSE NA
The fourth section reported on by domin
prints the
complete dominance designations associated with each IV pair.
Each IV is compared to each other IV and has two entries in this matrix.
The IV noted in the row labels represent a “completely dominates”
relationship with the IV noted in the column label. By contrast, the IV
noted in the column labels represent a “is completely dominated by”
relationship with the IV in the row label. Each designation is then
assigned a logical value or NA
(i.e., when no complete
dominance designation can be made).
The complete dominance designations are useful beyond the general and conditional dominance results as they are the most stringent sets of comparisons. Complete dominance reflects the relative performance of each IV to another IV in all the sub-models where their relative predictive performance can be compared. The results of this section are then not statistics but are the aggregate of an extensive series of inequality comparisons (i.e., in the mathematical sense: \(>\), \(<\)) of individual sub-models expressed as logical designations.
Complete dominance designations are the most stringent of the designations as all comparable sub-models for an IV must be larger than its comparison IV for the “conditionally dominates” label to be ascribed to their relationship. The sorts of sub-models that qualify as ‘comparable’ is discussed later.
Also note that complete dominance implies both conditional and general dominance but, again, not the reverse. An IV can conditionally or generally, but not completely, dominate another IV.
The DA methodology currently implemented in {domir}
is a
relatively assumption-free and model agnostic but computationally
expensive methodology that follows from the way Shapley value
decomposition was originally formulated.
The sections below begin by providing an analogy for how to think about the computation of DA results, outline exactly how each dominance statistic and designation is determined, as well as extend the example above by applying each computation in the context of the example.
Shapley value decomposition and DA have traditionally been implemented by treating the cooperative game/predictive model as though it was an experimental design which seeks to evaluate the impact of the players/IVs on the payoff/fit statistic. When designing the experiment applied to the model, it is assumed that the only factors are the IVs and that they all have two levels: 1) the IV is included in the sub-model or 2) the IV is excluded from the sub-model—all other potential inputs to the model are constant.
The specific type of design applied to the model is a full-factorial design where all possible combinations of the IVs included or excluded are estimated as sub-models. When there are \(p\) IVs in the model there will be \(2^p\) sub-models estimated. The full-factorial design required for computing dominance statistics and designations is why the traditional approach is considered computationally expensive as each additional IV added to the model results in a geometric increase in the number of requires sub-models to estimate.
The DA results related to the lm
model with three IVs
discussed above is composed of 8 sub-models and their \(R^2\) values. The domin
function, if supplied a predictive modeling function that can record
each sub-model’s results, can be adapted to capture each sub-model’s
\(R^2\) value along with the IVs that
comprise it.
The code below constructs a wrapper function to export results from
each sub-model to an external data frame. The code to produce these
results is complex and each line is commented to note its purpose. This
wrapper function then replaces lm
in the call to
domin
so that, as the DA is executed, all the sub-models’
data are captured for the illustration to come.
<-
lm_capture function(formula, ...) { # wrapper program that accepts formula and ellipsis arguments
<<- count + 1 # increment counter in enclosing environment
count <- lm(formula, ...) # estimate 'lm' model and save object
lm_obj "formula"] <<-
DA_results[count, deparse(formula) # record string version of formula passed in 'DA_results' in enclosing environment
"R^2"] <<-
DA_results[count, summary(lm_obj)[["r.squared"]] # record R^2 in 'DA_results' in enclosing environment
return(lm_obj) # return 'lm' class-ed object
}
<- 0 # initialize the count indicating the row in which the results will fill-in
count
<- # container data frame in which to record results
DA_results data.frame(formula = rep("", times = 2^3-1),
`R^2` = rep(NA, times = 2^3-1),
check.names = FALSE)
<- domin(mpg ~ am + cyl + carb, # implement the DA with the wrapper
lm_da
lm_capture, list(summary, "r.squared"),
data = mtcars)
DA_results#> formula R^2
#> 1 mpg ~ am 0.3597989
#> 2 mpg ~ cyl 0.7261800
#> 3 mpg ~ am + cyl 0.7590135
#> 4 mpg ~ carb 0.3035184
#> 5 mpg ~ am + carb 0.7036726
#> 6 mpg ~ cyl + carb 0.7405408
#> 7 mpg ~ am + cyl + carb 0.8113023
The printed result from DA_results shows that
domin
runs 7 sub-models; each a different combination of
the IVs. Note that, by default, the sub-model where all IVs are excluded
is assumed to result in a fit statistic value of 0 and is not estimated
directly (which can be changed with the consmodel
argument
to domin
).
The \(R^2\) values recorded in DA_results are used to compute the dominance statistics and designations reported on above.
Complete dominance between two IVs is designated by:
\[X_vDX_z\; if\;2^{p-2}\, =\, \Sigma^{2^{p-2}}_{j=1}{ \{\begin{matrix} if\, F_{X_v\; \cup\;S_j}\, > F_{X_z\; \cup\;S_j}\, \,then\, 1\, \\ if\, F_{X_v\; \cup\;S_j}\, \le F_{X_z\; \cup\;S_j}\,then\, \,0\end{matrix} }\]
Where \(X_v\) and \(X_z\) are two IVs, \(S_j\) is a distinct set of the other IVs in the model not including \(X_v\) and \(X_z\) which can include the null set (\(\emptyset\)) with no other IVs, and \(F\) is a model fit statistic. Conceptually, this computation implies that when all \(2^{p-2}\) comparisons show that \(X_v\) is greater than \(X_z\), then \(X_v\) completely dominates \(X_z\).
The results from DA_results can then be used to make the
comparisons required to determine whether each pair of IVs completely
dominates the other. The comparison begins with the results for
am
and cyl
.
formula | R^2 | formula | R^2 |
---|---|---|---|
mpg ~ am | 0.360 | mpg ~ cyl | 0.726 |
mpg ~ am + carb | 0.704 | mpg ~ cyl + carb | 0.741 |
The rows in the table above are aligned such that comparable models are in the rows. As applied to this example, the \(S_j\) sets are \(\emptyset\) (i.e., the null set) with no other IVs and the set also including carb.
The \(R^2\) values across the comparable models show that cyl has larger \(R^2\) values than am.
formula | R^2 | formula | R^2 |
---|---|---|---|
mpg ~ am | 0.360 | mpg ~ carb | 0.304 |
mpg ~ am + cyl | 0.759 | mpg ~ cyl + carb | 0.741 |
Here the \(S_j\) sets are, again, \(\emptyset\) and the set also including cyl.
The \(R^2\) values across the comparable models show that am has larger \(R^2\) values than carb.
formula | R^2 | formula | R^2 |
---|---|---|---|
mpg ~ cyl | 0.726 | mpg ~ carb | 0.304 |
mpg ~ am + cyl | 0.759 | mpg ~ am + carb | 0.704 |
Finally, the \(S_j\) sets are the \(\emptyset\) and the set also including am.
The \(R^2\) values across the comparable models show that cyl has larger \(R^2\) values than carb.
Each of these three sets of comparisons are represented in the Complete_Dominance matrix.
$Complete_Dominance
lm_da#> Dmnated_am Dmnated_cyl Dmnated_carb
#> Dmnates_am NA FALSE TRUE
#> Dmnates_cyl TRUE NA TRUE
#> Dmnates_carb FALSE FALSE NA
Conditional dominance statistics are computed as:
\[C^i_{X_v} = \Sigma^{\begin{bmatrix}p-1\\i-1\end{bmatrix}}_{i=1}{\frac{F_{X_v\; \cup\; S_i}\, - F_{S_i}}{\begin{bmatrix}p-1\\i-1\end{bmatrix}}}\]
Where \(S_i\) is a subset of IVs not
including \(X_v\) and \(\begin{bmatrix}p-1\\i-1\end{bmatrix}\) is
the number of distinct combinations produced choosing the number of
elements in the bottom value (\(i-1\))
given the number of elements in the top value (\(p-1\); i.e., the value produced by
choose(p-1, i-1)
).
In effect, the formula above amounts to an average of the differences between each model containing \(X_v\) from the comparable model not containing it by the number of IVs in the model total. As applied to the results from DA_results, am’s conditional dominance statistics are computed with the following differences:
formula minuend | R^2 minuend | formula subtrahend | R^2 subtrahend | difference |
---|---|---|---|---|
mpg ~ am | 0.36 | mpg ~ 1 | 0 | 0.36 |
formula minuend | R^2 minuend | formula subtrahend | R^2 subtrahend | difference |
---|---|---|---|---|
mpg ~ am + cyl | 0.759 | mpg ~ cyl | 0.726 | 0.033 |
mpg ~ am + carb | 0.704 | mpg ~ carb | 0.304 | 0.400 |
formula minuend | R^2 minuend | formula subtrahend | R^2 subtrahend | difference |
---|---|---|---|---|
mpg ~ am + cyl + carb | 0.811 | mpg ~ cyl + carb | 0.741 | 0.071 |
The rows of each table represent a difference to be recorded for the conditional dominance statistics computation. In the one, two, and three IV comparison tables, the model with am is presented first (as the minuend) and the comparable model without am is presented second (as the subtrahend)—for the 1 IV comparison table, this model is the intercept only model that, as is noted above, is assumed to have a value of 0. The difference is presented last.
By table, the differences are averaged resulting in the 0.36 value at one IV, the 0.216 at two IVs, and 0.071 at three IVs.
Next the computations for cyl are reported.
formula minuend | R^2 minuend | formula subtrahend | R^2 subtrahend | difference |
---|---|---|---|---|
mpg ~ cyl | 0.726 | mpg ~ 1 | 0 | 0.726 |
formula minuend | R^2 minuend | formula subtrahend | R^2 subtrahend | difference |
---|---|---|---|---|
mpg ~ am + cyl | 0.759 | mpg ~ am | 0.360 | 0.399 |
mpg ~ cyl + carb | 0.741 | mpg ~ carb | 0.304 | 0.437 |
formula minuend | R^2 minuend | formula subtrahend | R^2 subtrahend | difference |
---|---|---|---|---|
mpg ~ am + cyl + carb | 0.811 | mpg ~ am + carb | 0.704 | 0.108 |
Again, the differences are averaged resulting in the 0.726 value at one IV, the 0.418 at two IVs, and 0.108 at three IVs.
Finally, the computations for carb.
formula minuend | R^2 minuend | formula subtrahend | R^2 subtrahend | difference |
---|---|---|---|---|
mpg ~ carb | 0.304 | mpg ~ 1 | 0 | 0.304 |
formula minuend | R^2 minuend | formula subtrahend | R^2 subtrahend | difference |
---|---|---|---|---|
mpg ~ am + carb | 0.704 | mpg ~ am | 0.360 | 0.344 |
mpg ~ cyl + carb | 0.741 | mpg ~ cyl | 0.726 | 0.014 |
formula minuend | R^2 minuend | formula subtrahend | R^2 subtrahend | difference |
---|---|---|---|---|
mpg ~ am + cyl + carb | 0.811 | mpg ~ am + cyl | 0.759 | 0.052 |
And again, the differences are averaged resulting in the 0.304 value at one IV, the 0.179 at two IVs, and 0.052 at three IVs.
These nine values then populate the conditional dominance statistic matrix.
$Conditional_Dominance
lm_da#> IVs_1 IVs_2 IVs_3
#> am 0.3597989 0.2164938 0.07076149
#> cyl 0.7261800 0.4181185 0.10762967
#> carb 0.3035184 0.1791172 0.05228872
The conditional dominance matrix’s values can then be used in a way similar to the complete dominance designations above in creating a series of logical designations indicating whether each IV conditionally dominates each other.
Below the comparisons begin with am and cyl
am | cyl | comparison | |
---|---|---|---|
IVs_1 | 0.360 | 0.726 | FALSE |
IVs_2 | 0.216 | 0.418 | FALSE |
IVs_3 | 0.071 | 0.108 | FALSE |
The table above is a transpose of the conditional dominance statistic matrix with an additional comparison column indicating whether the first IV/am’s conditional dominance statistic at that number of IVs is greater than the second IV/cyl’s.
Conditional dominance is determined by all values being
TRUE
or FALSE
; in this case, cyl is
seen to conditionally dominate am as all values are
FALSE
.
Next is the comparison between am and carb
am | carb | comparison | |
---|---|---|---|
IVs_1 | 0.360 | 0.304 | TRUE |
IVs_2 | 0.216 | 0.179 | TRUE |
IVs_3 | 0.071 | 0.052 | TRUE |
Here am conditionally dominates carb as all values
are TRUE
.
The final comparison between cyl and carb
cyl | carb | comparison | |
---|---|---|---|
IVs_1 | 0.726 | 0.304 | TRUE |
IVs_2 | 0.418 | 0.179 | TRUE |
IVs_3 | 0.108 | 0.052 | TRUE |
cyl also conditionally dominates carb as all values
are TRUE
.
Another way of looking at conditional dominance is by graphing the trajectory of each IV across all positions in the conditional dominance matrix. If an IV’s line crosses another IV’s line, then a conditional dominance relationship between those two IVs cannot be determined. A graphic depicting the trajectory of the three IVs in the focal model is depicted below.
The graph above confirms that all three IV’s lines never cross and thus have a clear set of conditional dominance designations.
As was mentioned in the Concepts in Application section, because we knew all three IVs have complete dominance designations relative to one another, they necessarily also had conditional dominance designations relative to one another.
General dominance is computed as:
\[C_{X_v} = \Sigma^p_{i=1}{\frac{C^i_{X_v}}{p}}\]
Where, \(C^{i}_{X_x}\) are the conditional dominance statistics for \(X_v\) with \(i\) IVs. Hence, the general dominance statistics are the arithmetic average of all the conditional dominance statistics for an IV.
When applied to am’s results, the general dominance statistic value is:
IVs_1 | IVs_2 | IVs_3 | general dominance |
---|---|---|---|
0.36 | 0.216 | 0.071 | 0.216 |
Next to cyl.
IVs_1 | IVs_2 | IVs_3 | general dominance |
---|---|---|---|
0.726 | 0.418 | 0.108 | 0.417 |
And lastly carb.
IVs_1 | IVs_2 | IVs_3 | general dominance |
---|---|---|---|
0.304 | 0.179 | 0.052 | 0.178 |
Taken as a set, these values represent the general dominance statistic/Shapley value decomposition vector:
$General_Dominance
lm_da#> am cyl carb
#> 0.2156848 0.4173094 0.1783081
The general dominance statistic vector can also be used in a way similar to that of the complete and conditional dominance designations by ranking each value.
IV | general dominance | ranks |
---|---|---|
am | 0.216 | 2 |
cyl | 0.417 | 1 |
carb | 0.178 | 3 |
The rank ordering above shows that am is generally dominated by cyl, am generally dominates carb, and cyl also generally dominates carb.
Again because we knew all three IVs have complete and conditional dominance designations relative to one another, they necessarily also had general dominance designations relative to one another.
It is also worth pointing out a subtle feature of the general dominance statistics that tends to be more explicit discussions about the Shapley value decomposition. This feature is that each general dominance statistic is a weighted average of all \(2^p\) fit statistics.
To see how this is the case, first recall the computations related to obtaining conditional dominance statistics for the am IV. If you look at all the entries in the three tables, all 8 models are included either as a minuend or a subtrahend. The general dominance statistics are then just an average of these three conditional dominance statistics. Hence, the general dominance statistics include the value for every model.
The conditional dominance statistics for cyl and carb re-arrange these same models but otherwise use the same information to produce their general dominance statistics.
Whereas all dominance designations have been made in the example above, the strongest designation between two IVs is likely of primary interest as the strongest designation, as is noted above, implies all weaker designations.
To access the strongest dominance designations, the DA object can be
submitted to the summary
function.
summary(lm_da)$Strongest_Dominance
#>
#> "am" "am" "cyl"
#> "is completely dominated by" "completely dominates" "completely dominates"
#> "cyl" "carb" "carb"
The result the summary
function produces in the
Strongest_Dominance element is consistent with expectation in
that all three IV interrelationships have complete dominance
designations between them.
The DA method implemented by the domir::domin
function
is relatively assumption-free but does make an assumption about
the nature of the model that is dominance analyzed. DA assumes that the
predictive model used is “pre-selected”or has passed through model
selection procedures and the user is confident that the IVs/players in
the model/game and are, in fact, reasonable to include. DA is
not intended for use as a model selection tool.
“Relative importance” as a concept is used in many different ways in statistics and data science. In many cases, methods that focus on relative importance are probably best used for model selection/identifying trivial IVs for removal. DA, by contrast, is a method that is more focused on importance in a “model evaluation” sense. What I mean by model evaluation is an application where the user describes/interprets IVs’ effects in the context of a finalized, predictive model.
Note that the domin
function has been
superseded by the domir
function. Despite its programming
designation, this vignette’s purpose is to illustrate dominance analysis
concepts and will use domin
for this purpose as both
functions produce identical results.↩︎