README

Given a dataset of careers and incomes, how large a difference of income between any pair of careers would be? Given a dataset of travel time records, how long do we need to spend more when choosing a public transportation mode A instead of B to travel? In this work, we developed a framework to solve these problems named “EDOIF”.

EDOIF is a nonparametric framework based on “Estimation Statistics” principle. Its main purpose is to infer orders of empirical distributions from different categories based on a probability of finding a value in one distribution that is greater than an expectation of another distribution. Given a set of ordered-pair of real-category values the framework is capable of

Installation

install.packages("EDOIF")

For the newest version on github, please call the following command in R terminal.

remotes::install_github("DarkEyes/EDOIF")

This requires a user to install the “remotes” package before installing EDOIF.

Example: Inferring orders of categories based on their empirical distributions

library(EDOIF)

#== simulation: Generating distributuions of five categories: 
# Category5 dominates Category4
# Category4 dominates Category3
# Category3 dominates Category2
# Category2 dominates Category1

nInv=150 # number of samples per categories
initMean=10
stepMean=20
std=8

simData1<-c()
simData1$Values<-rnorm(nInv,mean=initMean,sd=std)
simData1$Group<-rep(c("Category1"),times=nInv)
simData1$Values<-c(simData1$Values,rnorm(nInv,mean=initMean,sd=std) )
simData1$Group<-c(simData1$Group,rep(c("Category2"),times=nInv))
simData1$Values<-c(simData1$Values,rnorm(nInv,mean=initMean+2*stepMean,sd=std) )
simData1$Group<-c(simData1$Group,rep(c("Category3"),times=nInv) )
simData1$Values<-c(simData1$Values,rnorm(nInv,mean=initMean+3*stepMean,sd=std) )
simData1$Group<-c(simData1$Group, rep(c("Category4"),times=nInv) )
simData1$Values<-c(simData1$Values,rnorm(nInv,mean=initMean+4*stepMean,sd=std) )
simData1$Group<-c(simData1$Group, rep(c("Category5"),times=nInv) )

#== parameter setting
bootT=1000 # number of times of sample with replacement in bootstrap function.
alpha=0.05 # Significance level

#== Calling the class constructor
A1<-EDOIF(simData1$Values,simData1$Group, bootT=bootT, alpha=alpha, methodType ="perc") 

#== Visualizing results
print(A1) # print the results in text mode
plot(A1, fontSize=10) # print the results in graphic mode

Graphic mode results 1. An alpha-confidence-interval of mean plot for five categories. The horizontal axis represents categories and the vertical axis represents values within distributions of categories.

2. A dominant-distribution network of five categories. A node represents categories and an edge represents a dominant-distribution relation between categories. If there is an edge from category A to B, then A dominates B. A larger node size implies a higher mean value of a category.

Citation

Amornbunchornvej, Chainarong, Navaporn Surasvadi, Anon Plangprasopchok, Suttipong Thajchayapong. “A nonparametric framework for inferring orders of categorical data from category-real pairs.” Heliyon 6, no. 11 (2020): e05435, ISSN 2405-8440, https://doi.org/10.1016/j.heliyon.2020.e05435. arXiv

Empirical Distribution Ordering Inference Framework (EDOIF)

Installation

Example: Inferring orders of categories based on their empirical distributions

Citation

Contact