Given a dataset of careers and incomes, how large a difference of income between any pair of careers would be? Given a dataset of travel time records, how long do we need to spend more when choosing a public transportation mode A instead of B to travel? In this work, we developed a framework to solve these problems named “EDOIF”.
EDOIF is a nonparametric framework based on “Estimation Statistics” principle. Its main purpose is to infer orders of empirical distributions from different categories based on a probability of finding a value in one distribution that is greater than an expectation of another distribution. Given a set of ordered-pair of real-category values the framework is capable of
You can install our package from CRAN
For the newest version on github, please call the following command in R terminal.
This requires a user to install the “remotes” package before installing EDOIF.
library(EDOIF)
#== simulation: Generating distributuions of five categories:
# Category5 dominates Category4
# Category4 dominates Category3
# Category3 dominates Category2
# Category2 dominates Category1
nInv=150 # number of samples per categories
initMean=10
stepMean=20
std=8
simData1<-c()
simData1$Values<-rnorm(nInv,mean=initMean,sd=std)
simData1$Group<-rep(c("Category1"),times=nInv)
simData1$Values<-c(simData1$Values,rnorm(nInv,mean=initMean,sd=std) )
simData1$Group<-c(simData1$Group,rep(c("Category2"),times=nInv))
simData1$Values<-c(simData1$Values,rnorm(nInv,mean=initMean+2*stepMean,sd=std) )
simData1$Group<-c(simData1$Group,rep(c("Category3"),times=nInv) )
simData1$Values<-c(simData1$Values,rnorm(nInv,mean=initMean+3*stepMean,sd=std) )
simData1$Group<-c(simData1$Group, rep(c("Category4"),times=nInv) )
simData1$Values<-c(simData1$Values,rnorm(nInv,mean=initMean+4*stepMean,sd=std) )
simData1$Group<-c(simData1$Group, rep(c("Category5"),times=nInv) )
#== parameter setting
bootT=1000 # number of times of sample with replacement in bootstrap function.
alpha=0.05 # Significance level
#== Calling the class constructor
A1<-EDOIF(simData1$Values,simData1$Group, bootT=bootT, alpha=alpha, methodType ="perc")
#== Visualizing results
print(A1) # print the results in text mode
plot(A1, fontSize=10) # print the results in graphic mode
Graphic mode results 1. An alpha-confidence-interval of mean plot for five categories. The horizontal axis represents categories and the vertical axis represents values within distributions of categories.
2. A dominant-distribution network of five categories. A node represents categories and an edge represents a dominant-distribution relation between categories. If there is an edge from category A to B, then A dominates B. A larger node size implies a higher mean value of a category.
Text mode results
EDOIF (Empirical Distribution Ordering Inference Framework)
=======================================================
Alpha = 0.050000, Number of bootstrap resamples = 1000, CI type = perc
Using Mann-Whitney test to report whether A ≺ B
A dominant-distribution network density:0.900000
Distribution: Category1
Mean:10.840671 95CI:[ 9.706981,12.014179]
Distribution: Category2
Mean:11.044785 95CI:[ 9.806991,12.446037]
Distribution: Category3
Mean:50.462935 95CI:[ 49.208005,51.757706]
Distribution: Category4
Mean:70.299726 95CI:[ 69.103924,71.502505]
Distribution: Category5
Mean:91.190505 95CI:[ 89.895480,92.518455]
=======================================================
Mean difference of Category2 (n=150) minus Category1 (n=150): Category1 ⊀ Category2
:p-val 0.4463
Mean Diff:0.204114 95CI:[ -1.545130,1.930609]
Mean difference of Category3 (n=150) minus Category1 (n=150): Category1 ≺ Category3
:p-val 0.0000
Mean Diff:39.622264 95CI:[ 37.984831,41.378232]
Mean difference of Category4 (n=150) minus Category1 (n=150): Category1 ≺ Category4
:p-val 0.0000
Mean Diff:59.459055 95CI:[ 57.921328,61.127817]
Mean difference of Category5 (n=150) minus Category1 (n=150): Category1 ≺ Category5
:p-val 0.0000
Mean Diff:80.349835 95CI:[ 78.620391,82.133270]
Mean difference of Category3 (n=150) minus Category2 (n=150): Category2 ≺ Category3
:p-val 0.0000
Mean Diff:39.418150 95CI:[ 37.543210,41.241722]
Mean difference of Category4 (n=150) minus Category2 (n=150): Category2 ≺ Category4
:p-val 0.0000
Mean Diff:59.254941 95CI:[ 57.304359,61.098774]
Mean difference of Category5 (n=150) minus Category2 (n=150): Category2 ≺ Category5
:p-val 0.0000
Mean Diff:80.145720 95CI:[ 78.313321,82.040234]
Mean difference of Category4 (n=150) minus Category3 (n=150): Category3 ≺ Category4
:p-val 0.0000
Mean Diff:19.836791 95CI:[ 18.047421,21.762239]
Mean difference of Category5 (n=150) minus Category3 (n=150): Category3 ≺ Category5
:p-val 0.0000
Mean Diff:40.727570 95CI:[ 39.004372,42.627946]
Mean difference of Category5 (n=150) minus Category4 (n=150): Category4 ≺ Category5
:p-val 0.0000
Mean Diff:20.890780 95CI:[ 19.079287,22.625807]
For more examples, please see the vignettes in this link .
Amornbunchornvej, Chainarong, Navaporn Surasvadi, Anon Plangprasopchok, Suttipong Thajchayapong. “A nonparametric framework for inferring orders of categorical data from category-real pairs.” Heliyon 6, no. 11 (2020): e05435, ISSN 2405-8440, https://doi.org/10.1016/j.heliyon.2020.e05435. arXiv