Introduction

Interpretation of multi-omics data is very challenging. The IntLIM package aims to integrate multiple types of omics data. Unlike other approaches, IntLIM is focused on understanding how specific multi-omic associations are affected by phenotypic features. To this end, we develop a linear modeling approach that describes how multi-omic associations are affected by phenotype. More information can be found on our publication “IntLIM: integration using linear models of metabolomics and gene expression data”. The workflow involves the following steps: 1) Input multi-omic data files. 2) Filter data sets by analyte abundance and imputed values. 3) Run the linear model to extract pairwise interaction significance. 4) Filter results by p-values, interaction coefficient percentiles, and r-squared values. 5) Plot/visualize specific analyte associations.

Installation

IntLIM is an R package and can be run on version >= 3.2.0.

The function install_github() from the “devtools” package (Wickham and Chang, 2015) installs IntLIM directly from GitHub. To install IntLIM, enter the following:

if(!require("devtools")){
  install.packages("devtools")
}
library("devtools")
install_github("ncats/IntLIM")

IntLIM can be loaded using the library function

library(IntLIM)

Inputting Analyte Level Data

The first step is importing in the multi-omic data. To this end we have provided a sample data file for users wishing to use IntLIM.

IntLIM requires a specific format for multi-omic data. For IntLIM, we require the data corresponding to two analyte types, optional metadata for each of these types, and sample meta data. We also need a CSV meta-file that lists the location of the other files. These need to be in the same folder. The formats are described below. In addition, we provide a sample set of files.

Please be sure that all files noted in the CSV file, including the CSV file, are in the same folder. Do not include path names in the filenames.

Users need to input a CSV file with two required columns: ‘type’ and ‘filenames’.

The CSV file is expected to have the following 2 columns and 6 rows:

  1. type,filenames
  2. analyteType1,myfilename (optional if analyteType2 is provided)
  3. analyteType2,myfilename (optional if analyteType1 is provided)
  4. analyteType1MetaData,myfilename (optional)
  5. analyteType2MetaData,myfilename (optional)
  6. sampleMetaData,myfilename”

The data and meta-data is stored in a series of comma-separated-values (.CSV) files. The 5 files consist of data for two analyte types, metadata for two analyte types, and sample meta data. A meta-file lists the location of the other 5 files. This meta-file is input into IntLIM.

Please be sure to normalize your data appropriately before inputting it into IntLIM.

Input data files should be in a specific format:

File type Description
analyteType1 rows are analytes from type 1 (e.g. metabolites), columns are samples
analytetype2 rows are analytes from type 2 (e.g. genes), columns are samples
analyteType1MetaData rows are analytes, features are columns
analyteType2MetaData rows are analytes, features are columns
sampleMetaData rows are samples, features are columns

For the analyte data files, the first row contains the feature IDs and the first column contains the sample IDs.

For the sampleMetaData, the first column of the sampleMetaData file is assumed to be the sample ID, and those sample IDs should match the first row of analyte data (e.g. it is required that all sample IDs in the analyte data are also in the sampleMetaDatafile).

Additionally, the analyte data files and SampleMetaData need to contain an ‘id’ column that contains the name of the features (analytes) or sample (sample id, name, etc).

A small data set is embedded in a package. This consists of the National Cancer Institute-60 (NCI-60) cell line data with a reduced number of genes for faster calculation. To access it use the following commands. csvfile describes the location of the meta-file describing the location of the other 5 input files.

dir <- system.file("extdata", package="IntLIM", mustWork=TRUE)
csvfile <- file.path(dir, "NCItestinput.csv")
csvfile
## [1] "/tmp/RtmpUXz1do/Rinst1a2c522c51a23/IntLIM/extdata/NCItestinput.csv"

Through the ReadData() function, users input the above .csv files containing analyte data and metadata. Note: It is assumed that all data has been pre-scaled and preprocessed The input data, sample data, and metadata are read into an IntLimData object, which will be used for all further analysis.

inputData <- IntLIM::ReadData(inputFile = csvfile,analyteType1id='id',analyteType2id='id',
                              class.feat = list(PBO_vs_Leukemia = "factor",
                                                drugscore = "numeric"))
## IntLIMData created

The ShowStats() function allows the user to summarize the data (how many total analytes of each types and samples in each data-set as well as samples in common). We have a data-set that consists of 20 cell lines (samples), 1448 genes, and 257 metabolites.

IntLIM::ShowStats(IntLimObject = inputData)
##   Num_Analytes_Type1 Num_Analytes_Type2 Num_Samples
## 1                257               1448          20

Filtering and Observing Data

Optionally, the FilterData() function allows the user to filter out the features (analytes) based on their mean values. Users should input a percentile cutoff and any feature with a mean value below that cutoff will be removed. Furthermore we can filter out analytes by percentage of missing or imputed values. For the analysis, we filter out the genes with the lowest 10% of gene expression and metabolites with more than 80% imputed values. This is done as below. We henceforth have 1303 genes and 212 metabolites for 20 cell line samples.

inputDatafilt <- IntLIM::FilterData(inputData,
                                    analyteType1perc = 0.10, analyteType2perc = 0.10,
                                    analyteMiss = 0.80)

The PlotDistribution() function allows users to produce a boxplot of the distribution of analyte level data. This is done as below.

IntLIM::PlotDistributions(inputData = inputDatafilt)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

Prior to running the model, the user can also perform a principal component analysis of the analyte data using the PlotPCA() function. The stype command allows the user to select a continuous column from the sample meta data that color-codes the samples For the sample set, we select drugscore, a continuous metric that measures the overall drug responsiveness of the cell line. We additionally run a separate plot that color-codes by cancer cell type (Prostate-Breast-Ovarian or PBO vs. Leukemia).

IntLIM::PlotPCA(inputData = inputDatafilt,stype = "drugscore")
IntLIM::PlotPCA(inputData = inputDatafilt,stype = "PBO_vs_Leukemia")
## Warning in 1:uniqtypes: numerical expression has 2 elements: only the first used

Run IntLIM

The linear models are run by the RunIntLim() function. The stype command allows the user to select a column from the sample meta data for the continuous or discrete phenotype of interest. The resulting object from the analysis is an IntLimResults object containing slotsfor un-adjusted and False Discovery Rate (FDR)-adjusted p-values for the “(g:p)” interaction coefficient. A significant FDR-adjusted p-value implies that the gene-metabolite association is dependent on the phenotype. The RunIntLim() function is based heavily on the MultiLinearModel functions developed for the ClassComparison package part of oompa.

The following variations of this function are available: - Setting save.covar.pvals to TRUE returns the p-values and coefficients of all terms in the model (not only the interaction term p-value and coefficient). - Setting remove.duplicates to TRUE filters analyte association models for the same type of output and independent variable type to include only the model with the highest p-value. e.g. if m1 ~ m2 has a higher p-value than m2 ~ m1, then the m2 ~ m1 model will be returned. By default, only self-associations (i.e. m1 ~ m1) are removed. - Setting continuous to TRUE signals to the function that a continuous phenotype is being used.

Here, we show examples for a continuous output (drug score) and a discrete output (cancer subtype).

myres.drugscore <- IntLIM::RunIntLim(inputData = inputDatafilt,stype="drugscore",
                              outcome = 1,
                              independent.var.type = 2,
                              save.covar.pvals = TRUE, continuous = TRUE)
## Running the analysis on continuous data with range -1.092312217 0.816074095
## 10 % complete
## 20 % complete
## 30 % complete
## 40 % complete
## 50 % complete
## 60 % complete
## 70 % complete
## 80 % complete
## 90 % complete
## 100 % complete
myres.cancertype <- IntLIM::RunIntLim(inputData = inputDatafilt,stype="PBO_vs_Leukemia",
                              outcome = 1,
                              independent.var.type = 2,
                              save.covar.pvals = TRUE)
## Running the analysis on discrete data with group sizes 6 14
## 10 % complete
## 20 % complete
## 30 % complete
## 40 % complete
## 50 % complete
## 60 % complete
## 70 % complete
## 80 % complete
## 90 % complete
## 100 % complete

The DistPValues() function allows the user to observe a histogram of the p-values. By default, FDR-adjusted p-values are shown. However, users can also choose to plot nominal p-values.

IntLIM::DistPvalues(IntLimResults = myres.drugscore, adjusted = FALSE)

IntLIM::DistPvalues(IntLimResults = myres.cancertype, adjusted = FALSE)

The DistRSquared() function allows the user to observe a histogram of the r-squared values.

IntLIM::DistRSquared(IntLimResults = myres.drugscore)

IntLIM::DistRSquared(IntLimResults = myres.cancertype)

The PValueBoxPlots() function allows the user to observe the p-value distributions of all covariates in the linear models prior to adjustment. Note: To view these values, you must set save.covar.pvals to TRUE when running RunIntLim.

IntLIM::PValueBoxPlots(IntLimResults = myres.drugscore)

IntLIM::PValueBoxPlots(IntLimResults = myres.cancertype)

The InteractionCoefficientGraph() function allows the user to observe the interaction coefficients of all models, sorted in ascending order.

IntLIM::InteractionCoefficientGraph(inputResults = myres.drugscore,
                                    interactionCoeffPercentile = 0.9)

IntLIM::InteractionCoefficientGraph(inputResults = myres.cancertype,
                                    interactionCoeffPercentile = 0.9)

The pvalCoefVolcano() function allows users to observe a volcano plot comparing the interaction coefficient to the –log10(FDR-adjusted p-value).

IntLIM::pvalCoefVolcano(inputResults = myres.drugscore, inputData = inputDatafilt,
                        pvalcutoff = 0.05)
## 289266 pairs found given cutoffs

IntLIM::pvalCoefVolcano(inputResults = myres.cancertype, inputData = inputDatafilt,
                        pvalcutoff = 0.05)
## 289266 pairs found given cutoffs

Filter Results

The ProcessResults() function allows the user to filter the results by FDR p-values (default set at 0.05) and by the R^2 and interaction coefficient percentile.

Here, we find 851 significant drug score dependent pairs and 5,293 cancer type specific pairs.

myres.sig.drugscore <- IntLIM::ProcessResults(inputResults = myres.drugscore,
                                       inputData = inputDatafilt,
                                       pvalcutoff = 0.10,
                                       rsquaredCutoff = 0.2,
                                       interactionCoeffPercentile = 0.5)
## 851 pairs found given cutoffs
myres.sig.cancertype <- IntLIM::ProcessResults(inputResults = myres.cancertype,
                                       inputData = inputDatafilt,
                                       pvalcutoff = 0.10,
                                       rsquaredCutoff = 0.2,
                                       interactionCoeffPercentile = 0.5)
## 5293 pairs found given cutoffs

The PlotPair() function operates differently for discrete and continuous phenotypes. For continuous phenotypes, it allows the user to observe the marginal effects of phenotype for a given pair of analytes. An example is shown below for DLG4 vs. (p-Hydroxyphenyl)lactic acid with respect to drug score.

IntLIM::PlotPair(inputData = inputDatafilt,
                 inputResults = myres.drugscore,
                 outcome = 1,
                 independentVariable = 2,
                 outcomeAnalyteOfInterest = "(p-Hydroxyphenyl)lactic acid",
                 independentAnalyteOfInterest = "DLG4")

## 
## Call:  stats::glm(formula = form, data = dataframe)
## 
## Coefficients:
## (Intercept)            g         type       g:type  
##     -3.5135       0.5237       7.3158      -1.8765  
## 
## Degrees of Freedom: 19 Total (i.e. Null);  16 Residual
## Null Deviance:       44.36 
## Residual Deviance: 35.35     AIC: 78.15

For discrete phenotypes, it allows the user to plot a chosen analyte association for selected groups. An example is shown below for DLG4 vs. (p-Hydroxyphenyl)lactic acid with respect to cancer type.

IntLIM::PlotPair(inputData = inputDatafilt,
                 inputResults = myres.cancertype,
                 outcome = 1,
                 independentVariable = 2,
                 outcomeAnalyteOfInterest = "(p-Hydroxyphenyl)lactic acid",
                 independentAnalyteOfInterest = "DLG4")

The HistogramPairs() function plots the count of analyte pairs by either outcome or independent variable analyte.

IntLIM::HistogramPairs(myres.sig.drugscore, type = "outcome")

IntLIM::HistogramPairs(myres.sig.drugscore, type = "independent")

IntLIM::HistogramPairs(myres.sig.cancertype, type = "outcome")

IntLIM::HistogramPairs(myres.sig.cancertype, type = "independent")

Various writing functions are implemented. The OutputData() and OutputResults() function allows the users to output the data and results of the analyses into zipped CSV files.

IntLIM::OutputData(inputData=inputDatafilt,filename=paste(result_dir,
                                                         "FilteredData.zip", sep = "\\"))
IntLIM::OutputResults(inputResults=myres.sig.drugscore,filename=paste(result_dir,
                                                              "MyResultsDrugscore.csv", sep = "\\"))
IntLIM::OutputResults(inputResults=myres.sig.cancertype,filename=paste(result_dir,
                                                              "MyResultsCancertype.csv", sep = "\\"))

Remove the directory.

unlink(result_dir, recursive = TRUE)

Run Cross-Validation

You can also choose to run end-to-end cross-validation. This combines the steps of reading, filtering, running IntLIM, and processing the results for multiple folds. PlotFoldOverlapUpSet creates an UpSet plot of the significant pairs within each fold and common between folds. If the phenotype is continuous (e.g. drug score), set continuous = TRUE.

The example below runs cross-validation where drug score is the phenotype.

crossValResults <- RunCrossValidation(inputData = inputData, analyteType1perc = 0.10,
                   analyteType2perc = 0.10, analyteMiss = 0.80,
                   stype="drugscore", outcome = c(1),
                   independent.var.type = c(2), save.covar.pvals = TRUE,
                   pvalcutoff = 0.10, rsquaredCutoff = 0.2,
                   folds = 4, continuous = TRUE,
                   interactionCoeffPercentile = 0.5)
IntLIM::PlotFoldOverlapUpSet(crossValResults$processed)

The example below runs cross-validation where cancer type is the phenotype.

crossValResults <- RunCrossValidation(inputData = inputData, analyteType1perc = 0.10,
                   analyteType2perc = 0.10, analyteMiss = 0.80,
                   stype="PBO_vs_Leukemia", outcome = c(1),
                   independent.var.type = c(2), save.covar.pvals = TRUE,
                   pvalcutoff = 0.10, rsquaredCutoff = 0.2,
                   folds = 4, interactionCoeffPercentile = 0.5)
IntLIM::PlotFoldOverlapUpSet(crossValResults$processed)

Run Permutation

You can also choose to run end-to-end permutation. This combines the steps of permuting, running IntLIM, and processing results for multiple permutations. PermutationCountSummary returns a summary of the permutation results and (optionally) a violin plot of the number of significant pairs and analytes. PermutationPairSummary returns a bar chart with a horizontal bar for each significant pair. The bar length measures the number of permutations in which the pair is also significant. If the phenotype is continuous (e.g. drug score), set continuous = TRUE.

The example below runs permutation where drug score is the phenotype.

perm.res <- IntLIM::PermuteIntLIM(data = inputDatafilt, stype = "drugscore", outcome = 1,
                      independent.var.type = 2, pvalcutoff = 0.10, interactionCoeffPercentile = 0.5,
                   rsquaredCutoff = 0.2,
                   num.permutations = 5, continuous = TRUE)
countSummary <- IntLIM::PermutationCountSummary(permResults = perm.res, inputResults = myres.sig.drugscore,
                              plot = TRUE)
pairSummary <- IntLIM::PermutationPairSummary(permResults = perm.res, inputResults = myres.sig.drugscore,
                              plot = TRUE)

The example below runs permutation where cancer type is the phenotype.

perm.res <- IntLIM::PermuteIntLIM(data = inputDatafilt, stype = "PBO_vs_Leukemia", outcome = 1,
                      independent.var.type = 2, pvalcutoff = 0.10, interactionCoeffPercentile = 0.5,
                   rsquaredCutoff = 0.2,  num.permutations = 5)
countSummary <- IntLIM::PermutationCountSummary(permResults = perm.res, inputResults = myres.sig.cancertype,
                              plot = TRUE)
pairSummary <- IntLIM::PermutationPairSummary(permResults = perm.res, inputResults = myres.sig.cancertype,
                              plot = TRUE)

References

Siddiqui JK, Baskin E, Liu M, Cantemir-Stone CZ, Zhang B, Bonneville R, McElroy JP, Coombes KR, Mathé EA. IntLIM: integration using linear models of metabolomics and gene expression data. BMC bioinformatics. 2018 Dec;19(1):81.

Sievert, C., et al. plotly: Create Interactive Web Graphics via’plotly. js’. 2016. R package version 3.6. 0. In.

Wickham, H. and Chang, W. devtools: Tools to make developing R code easier. R package version 2015;1(0).