Getting started with EICMs

The model

EICMs are an extension of binomial Generalized Linear Models for joint modelling of species communities, that incorporate both the effects of species biotic interactions and the effects of missing covariates. The model is a system of simultaneous equations, where species interactions are modelled explicitly as direct effects of each species on each of the others, and are estimated alongside the effects of missing covariates, modelled as latent factors. The eicm package includes a Penalized Maximum Likelihood fitting function, and a Genetic Algorithm for selecting the most parsimonious species interaction network topology. There are methods for computing confidence intervals and for plotting results.

Basic fitting workflow

Select interaction network topology

Fitting and selecting the species interaction network topology is pretty simple. Assuming all you have is a presence/absence data matrix in occurrences, you just have to run (assuming you trust the defaults):

# fit & select an EICM with 2 latent variables
m <- eicm(occurrences, n.latent=2)

# display estimated coefficients (note they are organized in matrices)
coef(m$selected.model)

This will end up with a model where the species interaction network topology has been selected in respect to a parsimony criterion, and the respective coefficients have been estimated. We started with the most complex (and realistic) scenario:

  • all environmental predictors are unmeasured, so we should/must estimate latent variables;
  • we don't have clues of possible species interactions, so we assume all interactions are possible (the default).

If, however, you need more control on model hyperparameters (and you will need), these are the most important arguments to define:

m <- eicm(occurrences, n.latent=2, regularization=c(6, 0.5), penalty=4, theta.threshold=0.5)

Some explanation on the arguments (but see ?eicm):

  • occurrences: the occurrence data matrix (binary, for the time being)
  • n.latent: how many latent variables we want to estimate
  • regularization: the ridge regularization lambdas for penalized ML fitting. The first element applies to environmental coefficients and latents, the second element to species interaction coefficients
  • penalty: the penalty applied to the number of species interactions to include (multiplier), during network selection
  • theta.threshold: the threshold to exclude species interactions a priori. Species interactions whose preliminary coefficient (in absolute value) is lower than theta.threshold are excluded from the network selection stage

Note that, by default, this function will run in all available CPU cores, which greatly speeds up the network selection stage. Set n.cores, if otherwise desired.

Excluding unlikely interactions

Selecting the network topology for communities with 32 species or less, assuming that all interactions are possible (\(32*31/2=496\)), takes usually less than a day. For larger communities, hence, it becomes important to have a priori exclusion criteria for impossible or unlikely interactions. There are two ways of doing this, either with formulas or by providing a square binary matrix specifying which interactions to include (1) or exclude (0).

Formula syntax

# excluding interactions with the formula syntax
m <- eicm(occurrences, forbidden=list(
    sp3 ~ sp4 + sp5,        # sp3 must not be affected by sp4 nor sp5
    sp4 ~ .,                # sp4 must not be affected by any other
    sp1 ~ . - sp8           # sp1 must not be affected by any other except sp8
    ))

# display species interaction coefficients
# note the zeroed coefficients are those that were excluded
coef(m$fitted.model)$sp

Matrix syntax

The non-zero elements of the square binary species x species matrix are read as follows: species A (column) affects species B (row).

# Excluding interactions with the matrix syntax

# create a square matrix species x species, all zeroes
mask <- matrix(0, nrow=ncol(occurrences), ncol=ncol(occurrences))

# set to 1 those interactions we want to include
mask[4, 2] <- 1     # species #2 may affect species #4
mask[6, 1] <- 1     # species #1 may affect species #6
mask[, 7] <- 1      # species #7 may affect any other
mask[2, ] <- 1      # species #2 may be affected by any other

m <- eicm(occurrences, mask.sp=mask)

# display species interaction coefficients
# note the zeroed coefficients are those that were excluded
coef(m$fitted.model)$sp

You can also easily exclude interactions which depart from rare species, because they will be very difficult to detect anyway, so excluding them from the start may compensate the risk. Note that this does not exclude the possibility that the rare species are affected by others, only that rare species do not affect others.

# exclude interactions departing from species with 20 or less occurrences
m <- eicm(occurrences, mask.sp=mask, exclude.prevalence=20)

Fitting without selecting network

If you have already a predefined interaction network topology and only want to estimate the respective coefficients, you may skip the selection stage (which is the most time-consuming task). We illustrate this with 2 latent variables:

# Load the included true model (32 species, 30 interactions, 2 environmental predictors)
data(truemodel)

# realize the model
occurrences <- predict(truemodel, nrepetitions=1)

# Pre-define a network topology
# For illustrative purposes, let's assume we know the true network topology so we build
#  the mask from the true model coefficients:
# 0: don't estimate interaction
# 1: estimate interaction
mask <- ifelse(truemodel$model$sp == 0, 0, 1)

# And now we estimate their values
# here we discard all the predictors - note that we don't provide any predictor matrix,
# only the presence/absence data.
m <- eicm(occurrences, n.latent=2, mask.sp=mask, do.selection=FALSE,
    regularization=c(1, 0.1))

-> running time of the above code chunk: 1.6 minutes

We can readily plot the estimated species interaction network.

# Plot estimated species interaction network
plot(m, type="network")
Estimated interaction network with a given topology. Blue: positive interactions; red: negative interactions. Line thickness is proportional to interaction strength.

Estimated interaction network with a given topology. Blue: positive interactions; red: negative interactions. Line thickness is proportional to interaction strength.

The package also provides a plotting function that allows to visually compare the estimated model coefficients and network topology with those of the true model, when data is simulated.

# Plot the network topology comparison between true and estimated models
# (plot omitted from vignette)
plot(m, true.model=truemodel, type="network")

# Plot estimated versus true coefficients
# Note that the environmental coefficients here relate to estimated latent variables,
# not the true predictors.
plot(m, true.model=truemodel, nenv.to.plot=2, legend=FALSE)
Estimated versus true coefficients of a fitted EICM model over simulated data, with the true environmental predictors dropped, and estimated as latent variables.

Estimated versus true coefficients of a fitted EICM model over simulated data, with the true environmental predictors dropped, and estimated as latent variables.

# display estimated species interaction coefficients (head only)
# the zeroed coefficients are those that were excluded a priori
head(round(coef(m$fitted.model)$sp, 3))
##       sp001 sp002 sp003 sp004 sp005 sp006 sp007 sp008 sp009 sp010  sp011 sp012
## sp001     0     0     0     0     0     0     0     0     0     0  0.000     0
## sp002     0     0     0     0     0     0     0     0     0     0  0.000     0
## sp003     0     0     0     0     0     0     0     0     0     0  0.000     0
## sp004     0     0     0     0     0     0     0     0     0     0 -2.261     0
## sp005     0     0     0     0     0     0     0     0     0     0  0.000     0
## sp006     0     0     0     0     0     0     0     0     0     0  0.000     0
##       sp013 sp014 sp015 sp016 sp017 sp018 sp019 sp020 sp021 sp022 sp023 sp024
## sp001     0     0     0     0     0     0     0 1.693     0     0     0     0
## sp002     0     0     0     0     0     0     0 0.000     0     0     0     0
## sp003     0     0     0     0     0     0     0 0.000     0     0     0     0
## sp004     0     0     0     0     0     0     0 0.000     0     0     0     0
## sp005     0     0     0     0     0     0     0 0.000     0     0     0     0
## sp006     0     0     0     0     0     0     0 0.000     0     0     0     0
##       sp025 sp026  sp027 sp028 sp029 sp030  sp031 sp032
## sp001 0.000     0  0.000     0     0     0  0.000     0
## sp002 1.801     0 -0.169     0     0     0  0.000     0
## sp003 0.000     0  0.000     0     0     0  0.000     0
## sp004 0.000     0  0.000     0     0     0 -1.698     0
## sp005 0.000     0  0.000     0     0     0  0.000     0
## sp006 0.000     0  0.000     0     0     0  0.000     0

Fitting with NAs in the response

Model fitting, including estimation of latent variables, is straightforward also in the presence of NAs anywhere in the occurrence data matrix. No cases or variables are removed or replaced by predictions, instead, model fitting occurs normally, with the only consequence being that estimated coefficients become less accurate. Let's illustrate this by setting one half of the observations to NA.

# Randomly set 1/2 of the occurrence data to NA
# (8000 records out of 16000)
occurrences[sample(length(occurrences), 8000)] <- NA

# Fit the model
m <- eicm(occurrences, n.latent=2, mask.sp=mask, do.selection=FALSE,
    regularization=c(1, 0.1))

plot(m, true.model=truemodel, nenv.to.plot=2, legend=FALSE)
Estimated versus true coefficients of a fitted EICM model over simulated data in the presence of a large amount of NAs in the response matrix. Note that the true environmental predictors were dropped, and estimated as latent variables, and that their estimation is accurate even in the presence of a large amount of NAs in the occurrence data.

Estimated versus true coefficients of a fitted EICM model over simulated data in the presence of a large amount of NAs in the response matrix. Note that the true environmental predictors were dropped, and estimated as latent variables, and that their estimation is accurate even in the presence of a large amount of NAs in the occurrence data.

-> running time of the above code chunk: 1.2 minutes

It is noteworthy that even when all the environmental predictors were dropped and estimated as latent variables, there is a very good match between estimated and true environmental coefficients even in the presence of a large amount of NAs in the occurrence data (one half of the data was set to NA in this example). The major accuracy drop happens in the estimated interaction coefficients.

Note that, however, there must always be 0s and 1s in the data in a per-species basis, irrespectively of the amount of NAs. Species which only have presences and NAs, cannot have their coefficients estimated, for "obvious" statistical reasons (flat likelihood).

Confidence intervals

Confidence intervals are computed by penalized profile likelihood, with confint. Also note that confidence intervals should not be computed on a model whose terms have been selected.

# Confidence intervals are appended to the model object.
m <- confint(m$fitted.model)
plot(m)     # by default, plots CIs

It may take some time with many parameters, so be sure to do it in multicore (the default).

Prediction

Any model whose species interaction network is a Directed Acyclic Graph can be used for prediction. Note that a selected model may not be a DAG. For reference, let's plot the species interaction network that will be used for prediction.

# load the included parameterized model
data(truemodel)

# for reference, plot the species interaction network
plot(truemodel, type="network")

Unconditional predictions

Unconditional predictions are those that are made only with environmental predictors. If the model was fitted with no regularization, they should be numerically equal to binomial GLM predictions.

# Unconditional predictions
# let's fix environmental predictors at 0, for simplicity.
predict(truemodel, newdata=cbind(env01=0, env02=0))
##      sp001  sp002  sp003  sp004  sp005  sp006  sp007  sp008  sp009  sp010
## [1,] 0.774 0.0306 0.0572 0.4054 0.0259 0.1468 0.0823 0.0781 0.0088 0.3909
##       sp011  sp012  sp013  sp014  sp015  sp016  sp017  sp018 sp019 sp020  sp021
## [1,] 0.6261 0.1067 0.4449 0.2088 0.0463 0.3605 0.8078 0.4079 0.555 0.243 0.0907
##       sp022  sp023  sp024  sp025  sp026  sp027  sp028  sp029  sp030  sp031
## [1,] 0.1964 0.5905 0.0565 0.2592 0.8625 0.0949 0.0702 0.3204 0.6615 0.2058
##      sp032
## [1,] 0.093

Conditional predictions

Conditional predictions incorporate the effects of other species that may have been observed or proven absent. To provide data on observed species, add columns to the new data matrix with the names of the species for which there is available data. In the current version, NAs cannot be mixed with 0s or 1s, and latents are not estimated in new data (but in a future version both will be possible).

# Conditional predictions
# predict probabilities for all species conditional on the
# known presence of sp011 (compare sp014 and sp004 with the above)
predict(truemodel, newdata=cbind(env01=0, env02=0, sp011=1))
## Excluded species 11
##       sp001  sp002  sp003  sp004  sp005  sp006  sp007  sp008  sp009 sp010 sp011
## [1,] 0.7723 0.0314 0.0563 0.2386 0.0252 0.1464 0.0801 0.0785 0.0083 0.339     1
##      sp012  sp013  sp014  sp015  sp016  sp017  sp018 sp019  sp020  sp021  sp022
## [1,] 0.105 0.4516 0.0538 0.0475 0.3923 0.8107 0.4073 0.556 0.2427 0.0932 0.1926
##       sp023  sp024 sp025  sp026  sp027  sp028  sp029  sp030  sp031  sp032
## [1,] 0.5903 0.0574 0.253 0.8666 0.0914 0.0664 0.3212 0.6679 0.2046 0.0899

Indirect effects

Indirect effects of species are propagated through the interaction network. This illustrates it:

# Propagation of indirect species effects
# predict probabilities for all species conditional on the known 
# absence (first line) and known presence (second line) of sp005 and sp023
newdata <- cbind(env01=0, env02=0, sp012=c(0, 1), sp018=c(0, 1))
newdata
##      env01 env02 sp012 sp018
## [1,]     0     0     0     0
## [2,]     0     0     1     1
predict(truemodel, newdata=newdata, nrep=100000)
## Excluded species 12
## Excluded species 18
##        sp001   sp002   sp003   sp004   sp005   sp006   sp007   sp008   sp009
## [1,] 0.77262 0.02899 0.05396 0.40476 0.02502 0.14038 0.08110 0.06911 0.00812
## [2,] 0.77584 0.03003 0.05664 0.40617 0.02570 0.14024 0.08058 0.14566 0.00828
##        sp010   sp011 sp012   sp013   sp014   sp015   sp016   sp017 sp018
## [1,] 0.39308 0.62821     0 0.49569 0.20626 0.04402 0.36151 0.81396     0
## [2,] 0.39337 0.62439     1 0.21377 0.20718 0.04296 0.36176 0.81172     1
##        sp019   sp020   sp021   sp022   sp023   sp024   sp025   sp026   sp027
## [1,] 0.55424 0.24388 0.09287 0.17097 0.59518 0.05296 0.25588 0.96151 0.09423
## [2,] 0.55613 0.24728 0.09269 0.42868 0.59247 0.05267 0.25751 0.27325 0.09308
##        sp028   sp029   sp030   sp031   sp032
## [1,] 0.06650 0.32088 0.66466 0.19945 0.08808
## [2,] 0.06673 0.36258 0.66269 0.20025 0.10449

Notice the effects on sp026 and the effect propagation to those species which depend on it (sp013, sp008). Also compare with unconditional predictions.

Simulating species communities

Simple basic simulation

For testing purposes, you may need to simulate communities with known parameters. We first simulate some occurrence data using a naive model where we set the parameters manually, to illustrate the basics. The core function is as.eicm, which constructs a model from given coefficients.

nenv <- 2           # 2 environmental predictors
nsp <- 20           # 20 species
nsamples <- 400     # 400 samples

# random responses to environment
env.coefs <- matrix(runif((nenv + 1) * nsp, -4, 4), nrow=nsp)

# let's define some true species interactions
sp.coefs <- matrix(0, nrow=nsp, ncol=nsp)
sp.coefs[3, 5] <- 2  # this reads: species #5 affects species #3 with a "strength" of 2
sp.coefs[4, 1] <- 3
sp.coefs[1, 2] <- -2

# random environmental predictors
env <- matrix(rnorm(nenv * nsamples), ncol=nenv, nrow=nsamples)

# define the true model
true <- as.eicm(env=env, env.coefs=env.coefs, sp.coefs=sp.coefs)

# realize the model
occurrences <- predict(true, nrepetitions=1)

# we now have a plain presence/absence data matrix that follows from the true model

Simulating realistic communities

However, in real communities, most species are rare. That is to say, the frequency distribution of species in samples approximately follows a Beta distribution. Simulating realistic communities then requires that this constraint is taken into account. eicm provides a function for this purpose.

# Generate a model that, when realized, shows a frequency distribution that
# follows a Beta distribution, yet accounting for all the desired effects
# (including species interactions)

truemodel <- generateEICM(nspecies=32, nsamples=500, nenv=2,
    ninteractions=30, shape1=1.5, shape2=3)

Now plot some realizations.

# plot frequency histogram of 4 realizations of truemodel:
# should approx. follow a Beta distribution.
par(mfrow=c(2, 2))
for(i in 1:4) {
    occurrences <- predict(truemodel, nrepetitions=1)
    spcounts <- apply(occurrences, 2, sum)

    hist(spcounts / nrow(occurrences), breaks=seq(0, 1, by=0.1), xlim=c(0, 1),
        main="Frequency distribution of one realization", xlab=
        "Frequency in samples",ylab="Number of species")
}
Frequency disrtibutions of four realizations of the same model, generated with generateEICM(...)

Frequency disrtibutions of four realizations of the same model, generated with generateEICM(...)


  1. CIBIO/InBio – Centro de Investigação em Biodiversidade e Recursos Genéticos, Universidade do Porto, Campus de Vairão, 4485-661 Vairão, Portugal; Instituto Superior de Agronomia, Universidade de Lisboa, Tapada da Ajuda, 1349-017 Lisboa, Portugal