ELCIC is a robust and consistent model selection criterion based upon the empirical likelihood function which is data-driven. In particular, this framework adopts plug-in estimators that can be achieved by solving external estimating equations, not limited to the empirical likelihood, which avoids potential computational convergence issues and allows versatile applications, such as generalized linear models, generalized estimating equations, penalized regressions, and so on. The formulation of our proposed criterion is initially derived from the asymptotic expansion of the marginal likelihood under the variable selection framework, but more importantly, the consistent model selection property is established under a general context.
if (!require("devtools")) {
install.packages("devtools")
}::install_github("chencxxy28/ELCIC") devtools
library(ELCIC)
library(MASS)
library(mvtnorm)
We provide some basic tutorial for illustrating the usage of ELCIC package. More technical details are referred to the Chen et al. (2021). In the following, we list three applications of ELCIC, which is the variable selection in generalized linear model, model selection in longitudinal data, and model selection in longitudinal data with missingness.
First, let us generate a pseudo data from negative binomial distribution with overdispersion parameter (ov) equal to 8. We provide a function glm.generator
to generate responses and covariates. Different outcome distributions are considered in this function including Gaussian, Poisson, Binomial, and Negative Binomial distributions.
We first test how ELCIC works when the distribution is correctly specified. The outcomes are generated from Poisson distribution. Then, we consider seven candidate models with different covariate combinations and compare ELCIC with most commonly used criteria including AIC, BIC, and GIC. The function ELCIC.glm
can produce the values of ELCIC, AIC, BIC, and GIC given a candidate model. The selection rates from four criteria are present based on 20 Monte Carlo runs (may take 1-2 minutes to run).
set.seed(28582)
=20
inter.total<-matrix(0,nrow=4,ncol=7)
count.matrix<-list(c(1,2),c(1,3),c(1,4),c(1,2,3),c(1,2,4),c(1,3,4),c(1,2,3,4))
candidate.setsrownames(count.matrix)<-c("ELCIC","AIC","BIC","GIC")
colnames(count.matrix)<-candidate.sets
<-100
samplesize=2
ov
for(iter in 1:inter.total)
{#generate data
<-glm.generator(beta=c(0.5,0.5,0.5,0),samplesize=samplesize,rho=0.5,dist="poisson")
simulated.data<-simulated.data[["y"]]
y<-simulated.data[["x"]]
x
<-ELCIC.glm(x=x,y=y,candidate.sets=candidate.sets,name.var=NULL,dist="poisson")
criterion.all
#print(iter)
<-cbind(1:4,apply(criterion.all,1,which.min))
index.used<-count.matrix[index.used]+1
count.matrix[index.used]
}/inter.total count.matrix
Based on the results, we find ELCIC and BIC has the best performance due to the highest selection rate among others. It shows that ELCIC is as powerful as BIC when the distribution is correctly specified. Next, we consider the situation where the AIC, BIC, and GIC is based on misspecified distribution. Accordingly, we generate outcomes from Negative Binomial distribution with the dispersion parameter equal to 2, while AIC, BIC, and GIC consider Poisson distribution as the input (may take 1-2 minutes to run).
set.seed(28582)
=20
inter.total<-matrix(0,nrow=4,ncol=7)
count.matrix<-list(c(1,2),c(1,3),c(1,4),c(1,2,3),c(1,2,4),c(1,3,4),c(1,2,3,4))
candidate.setsrownames(count.matrix)<-c("ELCIC","AIC","BIC","GIC")
colnames(count.matrix)<-candidate.sets
<-100
samplesize=2
ov
for(iter in 1:inter.total)
{#generate data
<-glm.generator(beta=c(0.5,0.5,0.5,0),samplesize=samplesize,rho=0.5,dist="NB",ov=2)
simulated.data<-simulated.data[["y"]]
y<-simulated.data[["x"]]
x
<-ELCIC.glm(x=x,y=y,candidate.sets=candidate.sets,name.var=NULL,dist="poisson")
criterion.all
#print(iter)
<-cbind(1:4,apply(criterion.all,1,which.min))
index.used<-count.matrix[index.used]+1
count.matrix[index.used]
}/inter.total count.matrix
The above results show that ELCIC has the highest selection rate. The better performance is attributed to its distribution free property, which is highly valuable in real applications where the underlying distribution is hard to be correctly specified.
Since ELCIC is distribution free, it is most suitable to statistical models in semi-parametric framework. This section evaluates how ELCIC works for joint selection of marginal mean and correlation structures in longitudinal data analysis. We assume the data is complete or missing completely at random. The following packages are used to generate different types of outcomes.
library(PoisNor)
library(bindata)
library(geepack)
In the following evaluation, we consider generalized estimating equations GEEs framework, which is a semi-parametric framework and widely used in longitudinal data analysis. We jointly select marginal mean and correlation structures. To be noted, correctly identifying working correlation structure will be beneficial to estimation efficiency in the mean structure. We provide a function gee.generator
to generate responses and covariates. Different outcome distributions are considered in this function including Gaussian, Poisson, and Binomial, and both time-dependent covariates and time-independent covariates are allowed in this generator function. We compare ELCIC with another widely adopted information criterion QIC.
set.seed(28582)
=20
inter.total<-300
samplesize=3
time="poisson"
dist<-list(c(1,2),c(1,3),c(1,4),c(1,2,3),c(1,2,4),c(1,3,4),c(1,2,3,4))
candidate.sets<-c("independence","exchangeable","ar1")
candidate.cor.sets
<-count.matrix.qic<-matrix(0,nrow=length(candidate.cor.sets),ncol=length(candidate.sets))
count.matrix.elcicrownames(count.matrix.elcic)<-rownames(count.matrix.qic)<-candidate.cor.sets
colnames(count.matrix.elcic)<-colnames(count.matrix.qic)<-candidate.sets
for(iter in 1:inter.total)
{#generate data
<-gee.generator(beta=c(-1,1,0.5,0),samplesize=samplesize,time=time,num.time.dep=2,num.time.indep=1,rho=0.4,x.rho=0.2,dist="poisson",cor.str="exchangeable",x.cor.str="exchangeable")
data.corpos
<-data.corpos$y
y<-data.corpos$x
x<-data.corpos$id
id<-rep(1,samplesize*time)
r
<-ELCIC.gee(x=x,y=y,r=r,id=id,time=time,candidate.sets=candidate.sets,name.var=NULL,dist="poisson",candidate.cor.sets=candidate.cor.sets)
criterion.elcic
<-QICc.gee(x=x,y=y,id=id,dist=dist,candidate.sets=candidate.sets, name.var=NULL, candidate.cor.sets=candidate.cor.sets)
criterion.qic
<-which(criterion.elcic==min(criterion.elcic),arr.ind = TRUE)
index.used.elcic<-count.matrix.elcic[index.used.elcic]+1
count.matrix.elcic[index.used.elcic]
<-which(candidate.cor.sets==rownames(criterion.qic))
index.used.qic.row<-which.min(criterion.qic)
index.used.qic.col<-count.matrix.qic[index.used.qic.row,index.used.qic.col]+1
count.matrix.qic[index.used.qic.row,index.used.qic.col]
print(iter)
}/inter.total
count.matrix.elcic/inter.total count.matrix.qic
The results above show that ELCIC has much higher power to implement joint selection, especially for the selection of marginal mean structure.
coming soon