The mosquito occurrence data from Golding et al 2015 (published in Parasites & Vectors) available from figshare here is useful for exploring how datasets need to be prepped for running Conditional Random Fields (CRF) models. Here, we will download the raw data from figshare (note, an internet connection will be needed for this step) change ‘dipping_round’ to a factor variable and remove un-needed columns
temp <- tempfile()
download.file('https://ndownloader.figshare.com/files/2075362',
temp)
dataset <- read.csv(temp, as.is = T)
unlink(temp)
We can now change the categorical dipping_round
and field_site
variables to factors and remove some un-needed variables
dataset$dipping_round <- as.factor(dataset$dipping_round)
dataset$field_site <- as.factor(dataset$field_site)
dataset[,c(1,2,5,6)] <- NULL
It is important here to examine the level names of factor variables, as the 1st level (i.e. the dummy level) will be dropped from the dataset during conversion to model matrix format (as in standard lme4
analysis of factor covariates)
The next step is to convert any factor variables into model matrix format. As mentioned above, this step will drop the first level of a factor and then create an additional column for each additional level (i.e. dipping_round
levels "3", "5" and "6"
will all be assigned their own unique columns, while dipping_round
level "2"
will be dropped and treated as the reference level). It is also convenient to change names of the new covariate columns so they are easier to view and interpret (done here using dplyr::rename_all
)
library(dplyr)
analysis.data = dataset %>%
cbind(.,data.frame(model.matrix(~.[,'field_site'],
.)[,-1])) %>%
cbind(.,data.frame(model.matrix(~.[,'dipping_round'],
.)[,-1])) %>%
dplyr::select(-field_site,-dipping_round) %>%
dplyr::rename_all(funs(gsub("\\.|model.matrix", "", .)))
Finally, we need to convert species abundances to binary presence-absence format (as we are only estimating co-occurrences, not co-abundances). It is also highly advisable to scale any continuous variables so they all have mean = 0
and sd = 1