Variable selection in DEA is a question that requires full attention before the results of an analysis can be used in a real case, because its results can be significantly modified depending on the variables included in the model. So, variable selection is a keystone step in each DEA application.
adea
provides a measure called load of the contribution of a variable into a DEA model. In an ideal case, when all variables contribute in same way, all loads will be 1. Thus, for example, if an output variable load is 0.75, means that its contribution is 75% of the average value for all outputs. A value for variable load lower than 0.6 means that its contribution to DEA model is negligible.
For more information see (Fernandez-Palacin, Lopez-Sanchez, and Munoz-Marquez 2018) and (Villanueva-Cantillo and Munoz-Marquez 2021).
Let’s load and have a look at the tokyo_libraries
dataset with
data(tokyo_libraries)
head(tokyo_libraries)
#> Area.I1 Books.I2 Staff.I3 Populations.I4 Regist.O1 Borrow.O2
#> 1 2.249 163.523 26 49.196 5.561 105.321
#> 2 4.617 338.671 30 78.599 18.106 314.682
#> 3 3.873 281.655 51 176.381 16.498 542.349
#> 4 5.541 400.993 78 189.397 30.810 847.872
#> 5 11.381 363.116 69 192.235 57.279 758.704
#> 6 10.086 541.658 114 194.091 66.137 1438.746
Two step wise variable selection functions are provided. The first one drops variables one by one giving a set of nested models. The following code setup input and output variables and do the call
tokyo_libraries[, 1:4]
input <- tokyo_libraries[, 5:6]
output <-adea_hierarchical(input, output)
#> Load #Efficients #Variables #Inputs #Outputs
#> 6 inoutput 6 6 4 2
#> 5 inoutput 6 5 3 2
#> 4 inoutput 4 4 3 1
#> 3 inoutput 2 3 2 1
#> 2 inoutput 1 2 1 1
#> 1 inoutput 0 1 0 0
#> Inputs Outputs
#> 6 Area.I1, Books.I2, Staff.I3, Populations.I4 Regist.O1, Borrow.O2
#> 5 Books.I2, Staff.I3, Populations.I4 Regist.O1, Borrow.O2
#> 4 Books.I2, Staff.I3, Populations.I4 Borrow.O2
#> 3 Books.I2, Populations.I4 Borrow.O2
#> 2 Books.I2 Borrow.O2
#> 1
The load of the first model is 0.455467 which is under the minimum significance level, so Area.I1
can be removed from the model.
When a variable is removed what one can expect is that the load of all variables raise, but after the second model this not happen. So third model is poorer than second and there is no statistical reason to select it.
To avoid that a second step wise selection variable is provided, the new call is
adea_parametric(input, output)
#> Load #Efficients #Variables #Inputs #Outputs
#> 6 0.455467 6 6 4 2
#> 5 0.990164 6 5 3 2
#> 2 1.000000 1 2 1 1
#> Inputs Outputs
#> 6 Area.I1, Books.I2, Staff.I3, Populations.I4 Regist.O1, Borrow.O2
#> 5 Books.I2, Staff.I3, Populations.I4 Regist.O1, Borrow.O2
#> 2 Books.I2 Borrow.O2
In both case, all variables have been taken into account to remove them, but load.orientation
parameter allows to select which variables have to be included in load analysis, input
for only input variables, output
for only output variables, and inoutput
, the default value for all variables. The next call consider only output variables as candidate variables to be removed:
adea_parametric(input, output, load.orientation = 'output')
#> Load #Efficients #Variables #Inputs #Outputs
#> 6 1 6 6 4 2
#> 5 1 4 5 4 1
#> Inputs Outputs
#> 6 Area.I1, Books.I2, Staff.I3, Populations.I4 Regist.O1, Borrow.O2
#> 5 Area.I1, Books.I2, Staff.I3, Populations.I4 Borrow.O2
adea_hierarchical
and adea_parametric
return a list, called models
, with all computed model that can be accessed through the following call
adea_hierarchical(input, output)
m <- m$models[[4]]
m4 <-
m4#> 1 2 3 4 5 6 7 8
#> 0.3026132 0.6425505 0.5733000 0.7164871 0.6733832 1.0000000 0.6967419 0.4476942
#> 9 10 11 12 13 14 15 16
#> 1.0000000 0.7051438 0.5336592 0.7583527 0.5915395 0.7215430 0.7832606 0.5822710
#> 17 18 19 20 21 22 23
#> 0.8451129 0.7867065 1.0000000 0.8485716 0.7285929 0.7849437 1.0000000
where the number in square brackets is the number of total variables in the model.
By default, when print
function is called with an adea
model, it prints only efficiencies. summary
results in a wider output:
summary(m4)
#> Model name:
#> Orientation is input
#> Inputs: Books.I2 Staff.I3 Populations.I4
#> Outputs: Borrow.O2
#> Input loads: 1.193651 0.9031744 0.9031744
#> Output loads: 1
#> Model load: 0.903174350658053
#> #Efficients: 4
#> Efficiencies:
#> 1 2 3 4 5 6 7 8
#> 0.3026132 0.6425505 0.5733000 0.7164871 0.6733832 1.0000000 0.6967419 0.4476942
#> 9 10 11 12 13 14 15 16
#> 1.0000000 0.7051438 0.5336592 0.7583527 0.5915395 0.7215430 0.7832606 0.5822710
#> 17 18 19 20 21 22 23
#> 0.8451129 0.7867065 1.0000000 0.8485716 0.7285929 0.7849437 1.0000000
#> Summary of efficiencies:
#> Mean sd Min. 1st Qu. Median 3rd Qu. Max.
#> 0.7270638 0.1793772 0.3026132 0.6170450 0.7215430 0.8159097 1.0000000
Fernandez-Palacin, Fernando, Marı́a Auxiliadora Lopez-Sanchez, and Manuel Munoz-Marquez. 2018. “Stepwise selection of variables in DEA using contribution loads.” Pesquisa Operacional 38 (1): 31–52. http://dx.doi.org/10.1590/0101-7438.2018.038.01.0031.
Villanueva-Cantillo, Jeyms, and Manuel Munoz-Marquez. 2021. “Methodology for Calculating Critical Values of Relevance Measures in Variable Selection Methods in Data Envelopment Analysis.” European Journal of Operational Research 290 (2): 657–70. https://doi.org/10.1016/j.ejor.2020.08.021.
Universidad de Cádiz, fernando.fernandez@uca.es↩︎
Universidad de Cádiz, manuel.munoz@uca.es↩︎