This vignette complements the more basic vignette ‘Getting started with wrProteo’ of this package and shows in more detail how UPS1 spike-in experiments may be analyzed, using this package wrProteo, wrMisc and wrGraph. All these packages are available on CRAN.
Furthermore, the Bioconductor package limma will be used internally for it’s moderated statistical testing.
## This is R code, you can run this to redo all analysis presented here.
## If not already installed, you'll have to install wrMisc and wrProteo first.
install.packages("wrMisc")
install.packages("wrProteo")
## These packages are used for the graphics
install.packages("wrGraph")
install.packages("RColorBrewer")
## Installation of limma from Bioconductor
if(!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager")
::install("limma")
BiocManager
## You cat also see all vignettes for this package by typing :
browseVignettes("wrProteo") # ... and the select the html output
As you will see in the interactive window, this package has 2 vignettes, a more general introductory vignette and this UPS-1 dedicated vignette.
Now let’s load the packages needed :
## Let's assume this is a fresh R-session
library(knitr)
library(wrMisc)
library(wrGraph)
library(wrProteo)
# Version number for wrProteo :
packageVersion("wrProteo")
#> [1] '1.6.0'
The main aim of the experimental setup using heterologous spike-in experiments is to provide a framework to test identification and quantitation procedures in proteomics.
By mixing known amounts of a collection of human proteins (UPS1) in various concentrations on top of a constant level yeast total protein extract, one expects to find only the spiked human UPS1 proteins varying between samples. In terms of ROC curves (see also ROC on Wikipedia) the spike-in proteins are expected to show up as true positives (TP). In contrast, all yeast proteins were added in the same quantity to same samples and should thus be observed as constant, ie as true negatives (TN) when looking for proteins changing abundance.
The data used in this vignette were published with the article : Ramus et al 2016 “Benchmarking quantitative label-free LC-MS data processing workflows using a complex spiked proteomic standard dataset” in J Proteomics 2016 Jan 30;132:51-62.
This dataset is available on PRIDE as PXD001819 (and on ProteomeXchange).
Briefly, this experiment aims to evaluate and compare various quantification appoaches of the heterologous spike-in UPS1 (available from Sigma-Aldrich) in yeast protein extracts as constant matrix. 9 different concentrations of the heterologous spike-in (UPS1) were run in triplicates. The proteins were initially digested by Trypsin and then analyzed by LC-MS/MS in DDA mode.
The project Proteomics Sample Metadata Format aims to provide a framework of providing a uniform format for documenting experimental meta-data (sdrf-format). The meta-data for experiments already integrated can be directly read/accessed from wrProteo.
Either you download the meta-data as file ‘sdrf.tsv’ from Pride/PXD001819, or you may read it directly from github/bigbio.
## Read meta-data from github.com/bigbio/proteomics-metadata-standard/
<- readSdrf("PXD001819")
pxd001819meta
## The concentration of the UPS1 spike-in proteins in the samples
if(length(pxd001819meta) >0) {
<- sort(unique(as.numeric(sub(" amol;CN.+","",sub(".+;QY=","",pxd001819meta$characteristics.spiked.compound.)))))
UPSconc else UPSconc <- c(50, 125, 250, 500, 2500, 5000, 12500, 25000, 50000) }
## A few elements and functions we'll need lateron
<- c("ProteomeDiscoverer","MaxQuant","Proline")
methNa names(methNa) <- c("PD","MQ","PL")
## The accession numbers for the UPS1 proteins
<- data.frame(ac=c("P00915", "P00918", "P01031", "P69905", "P68871", "P41159", "P02768", "P62988",
UPS1 "P04040", "P00167", "P01133", "P02144", "P15559", "P62937", "Q06830", "P63165",
"P00709", "P06732", "P12081", "P61626", "Q15843", "P02753", "P16083", "P63279",
"P01008", "P61769", "P55957", "O76070", "P08263", "P01344", "P01127", "P10599",
"P99999", "P06396", "P09211", "P01112", "P01579", "P02787", "O00762", "P51965",
"P08758", "P02741", "P05413", "P10145", "P02788", "P10636-8", "P00441", "P01375"),
species=rep("Homo sapiens",48),
name=NA)
# UPS1ac <- c("P00915", "P00918", "P01031", "P69905", "P68871", "P41159", "P02768", "P62988",
# "P04040", "P00167", "P01133", "P02144", "P15559", "P62937", "Q06830", "P63165",
# "P00709", "P06732", "P12081", "P61626", "Q15843", "P02753", "P16083", "P63279",
# "P01008", "P61769", "P55957", "O76070", "P08263", "P01344", "P01127", "P10599",
# "P99999", "P06396", "P09211", "P01112", "P01579", "P02787", "O00762", "P51965",
# "P08758", "P02741", "P05413", "P10145", "P02788", "P10636-8", "P00441", "P01375" )
## additional functions
<- function(x, annCol="SpecType", replBy=cbind(old=c("mainSpe","species2"), new=c("Yeast","UPS1")), silent=TRUE) {
replSpecType ## rename $annot[,"SpecType"] to more specific names
<- annCol[1] %in% colnames(x$annot)
chCol if(chCol) { chCol <- which(colnames(x$annot)==annCol[1])
<- replBy[,1] %in% unique(x$annot[,chCol]) # check items to replace if present
chIt if(any(chIt)) for(i in which(chIt)) {useLi <- which(x$annot[,chCol] %in% replBy[i,1]); cat("useLi",head(useLi),"\n"); x$annot[useLi,chCol] <- replBy[i,2]}
else if(!silent) message(" replSpecType: 'annCol' not found in x$annot !")
}
x }
<- function(mat, ref, refColumn=3:4, matCluNa="cluNo", lev=NULL, ylab=NULL, tit=NULL) {
plotConcHist ## plot histogram like counts of UPS1 concentrations
if(is.null(tit)) tit <- "Frequency of UPS1 Concentrations Appearing in Cluster"
<- unique(mat[,matCluNa])
gr <- ref[,refColumn]
ref if(length(lev) <2) lev <- sort(unique(as.numeric(as.matrix(ref))))
if(length(ylab) !=1) ylab <- "Frequency"
<- table(factor( as.numeric(ref[which(rownames(ref) %in% rownames(mat)),]), levels=lev))
tbl ::barplot(tbl, las=1, beside=TRUE, main=paste(tit,gr), col=grDevices::gray(0.8), ylab=ylab)
graphics
}
<- function(dat, methInd, tit=NULL, useColumn=c("logp","slope","medAbund","startFr"), lineGuide=list(v=c(-12,-10),h=c(0.7,0.75),col="grey"), xlim=NULL,ylim=NULL,subTit=NULL) {
plotMultRegrPar ## scatter plot logp (x) vs slope (y) for all UPS proteins, symbol by useColumn[4], color by hist of useColumn[3]
## dat (array) UPS1 data
## useColumn (character) 1st as 'logp', 2nd as 'slope', 3rd as median abundance, 4th as starting best regression from this point
<- "plotMultRegrPar"
fxNa #fxNa <- wrMisc::.composeCallName(callFrom,newNa="plotMultRegrPar")
if(length(dim(dat)) !=3) stop("invalid input, expecting as 'dat' array with 3 dimensions (proteins,Softw,regrPar)")
if(any(length(methInd) >1, methInd > dim(dat)[2], !is.numeric(methInd))) stop("invalid 'methInd'")
<- useColumn %in% dimnames(dat)[[3]]
chCol if(any(!chCol)) stop("argument 'useColumn' does not fit to 3rd dim dimnames of 'dat'")
<- colorAccording2(dat[,methInd,useColumn[3]], gradTy="rainbow", revCol=TRUE, nEndOmit=14)
useCol ::plot(dat[,methInd,useColumn[1:2]], main=tit, type="n",xlim=xlim,ylim=ylim) #col=1, bg.col=useCol, pch=20+lmPDsum[,"startFr"],
graphics::points(dat[,methInd,useColumn[1:2]], col=1, bg=useCol, pch=20+dat[,methInd,useColumn[4]],)
graphics::legend("topright",paste("best starting from ",1:5), text.col=1, pch=21:25, col=1, pt.bg="white", cex=0.9, xjust=0.5, yjust=0.5)
graphicsif(length(subTit)==1) graphics::mtext(subTit,cex=0.9)
if(is.list(lineGuide) & length(lineGuide) >0) {if(length(lineGuide$v) >0) graphics::abline(v=lineGuide$v,lty=2,col=lineGuide$col)
if(length(lineGuide$h) >0) graphics::abline(h=lineGuide$h,lty=2,col=lineGuide$col)}
<- graphics::hist(dat[,methInd,useColumn[3]], plot=FALSE)
hi1 ::legendHist(sort(dat[,methInd,useColumn[3]]), colRamp=useCol[order(dat[,methInd,useColumn[3]])][cumsum(hi1$counts)],
wrGraphcex=0.5, location="bottomleft", legTit="median raw abundance") #
}
Multiple algorithms and software implementations have been developed for quantitation label-free proteomics experiments, in particular for extracted ion chromatograms (XIC). For background information you may look at Wikipedia labell-free Proteomics. Here, the use of the output for 3 such implementations for extracting peptide/protein quantifications is shown. These 3 software implementations were run individually using equivalent settings, ie identifcation based on the same fasta-database, starting at a single peptide with 1% FDR, MS mass tolerance for ion precursors at 0.7 ppm, oxidation of methionins and N-terminal acetylation as fixed as well as carbamidomethylation of cysteins as variable modifications.
Since in this context it is crucial to recognize all UPS1 proteins as such, the import-functions make use of the specPref argument, allowing to define custom tags. Most additional arguments to the various import-functions have been kept common for conventient use and for generating output structured the same way. Indeed, simply separating proteins by their species origin is not sufficient since common contaminants like human keratin might get considered by error as UPS1.
MaxQuant is free software provided by the Max-Planck-Institute, see also Tyanova et al 2016. Later in this document data from MaxQuant will by frequently abbreviated as MQ.
Typically MaxQuant exports quantitation data on level of consensus-proteins by default to a folder called txt with a file called “proteinGroups.txt” . So in a standard case (when the file name has not been changed manually) it is sufficient to provide the path to this file. Of course, you can explicitely point to a specific file, as shown below. With the data presented here MaxQuant version 1.6.10 was run. Files compressed as .gz can be read, too (like in the example below).
<- system.file("extdata", package="wrProteo")
path1 <- "proteinGroups.txt.gz"
fiNaMQ <- list(conta="CON_|LYSC_CHICK", mainSpecies="OS=Saccharomyces cerevisiae", spike=UPS1$ac)
specPrefMQ
<- readMaxQuantFile(path1, file=fiNaMQ, specPref=specPrefMQ, refLi="mainSpe")
dataMQ #> -> readMaxQuantFile : Suspicious filename, this function was designed for reading tabulated text files produced by MaxQuant
#> -> readMaxQuantFile : Note: Found 11 proteins marked as 'REV_' (reverse peptide identification) - Removing
#> -> readMaxQuantFile : Display 2845(9.5%) initial '0' values as 'NA'
#> -> readMaxQuantFile : Could not find peptide counts columns (argument 'pepCountCol') matching to 'Unique peptides MS\.MS\.count '
#> -> readMaxQuantFile : Note: 6 proteins with unknown species
#> by species : Homo sapiens: 49, Mus musculus: 1, Saccharomyces cerevisiae: 1047, Sus scrofa: 1,
#> -> readMaxQuantFile : Found 75 composite accession-numbers, truncating (eg P00761;CON__P00761)
#> -> readMaxQuantFile : Use column 'Accession' as identifyer (has fewest, ie 0 duplicated entries) as rownames
#> -> readMaxQuantFile : Normalize using subset of 1047
#> -> readMaxQuantFile -> vioplotW -> asSepList : argument 'fxArg' is depreciated, please use argument 'exclElem' (still using since 'exclElem=NULL'
#> -> readMaxQuantFile -> vioplotW -> asSepList : argument 'fxArg' is depreciated, please use argument 'exclElem' (still using since 'exclElem=NULL'
The data were imported and median-normalized, the protein annotation was parsed to automatically extract IDs, protein-names and species information.
## the number of lines and colums
dim(dataMQ$quant)
#> [1] 1104 27
## a summary of the quantitation data
summary(dataMQ$quant[,1:8]) # the first 8 cols
#> 12500amol_1 12500amol_2 12500amol_3 125amol_1
#> Min. :17.52 Min. :15.64 Min. :14.90 Min. :15.19
#> 1st Qu.:22.50 1st Qu.:22.49 1st Qu.:22.50 1st Qu.:22.39
#> Median :23.46 Median :23.46 Median :23.46 Median :23.43
#> Mean :23.69 Mean :23.65 Mean :23.67 Mean :23.60
#> 3rd Qu.:24.82 3rd Qu.:24.76 3rd Qu.:24.78 3rd Qu.:24.82
#> Max. :30.31 Max. :30.29 Max. :30.35 Max. :30.25
#> NA's :98 NA's :100 NA's :104 NA's :114
#> 125amol_2 125amol_3 25000amol_1 25000amol_2
#> Min. :14.86 Min. :14.91 Min. :15.78 Min. :18.42
#> 1st Qu.:22.36 1st Qu.:22.40 1st Qu.:22.52 1st Qu.:22.54
#> Median :23.42 Median :23.44 Median :23.53 Median :23.53
#> Mean :23.59 Mean :23.62 Mean :23.74 Mean :23.73
#> 3rd Qu.:24.81 3rd Qu.:24.79 3rd Qu.:24.95 3rd Qu.:24.89
#> Max. :30.25 Max. :30.30 Max. :30.31 Max. :30.13
#> NA's :106 NA's :114 NA's :109 NA's :115
Now we can summarize the presence of UPS1 proteins after treatment by MaxQuant : In sum 48 UPS1 proteins were found, 0 are missing.
ProteomeDiscoverer is commercial software from ThermoFisher (www.thermofisher.com). Later in this document data from ProteomeDiscoverer will by frequently abbreviated as PD.
With the data used here, the identification was performed using the XCalibur module of ProteomeDiscoverer version 2.4 . Quantitation data at the level of consensus-proteins can be exported to tabulated text files, which can be treated by the function shown below. The resultant data were export in tablulated format and the file automatically named ‘_Proteins.txt’ (the option R-headers was checked, however this option is not mandatory). Files compressed as .gz can be read, too (like in the example below).
<- system.file("extdata", package="wrProteo")
path1 <- "pxd001819_PD24_Proteins.txt.gz"
fiNaPd
## Note: data exported from ProteomeDiscoverer frequently do not have proper column-names, providing names here
<- paste(rep(UPSconc, each=3),"amol_",rep(1:3,length(UPSconc)),sep="")
sampNa <- list(conta="Bos tauris|Gallus", mainSpecies="OS=Saccharomyces cerevisiae", spike=UPS1$ac)
specPrefPD
<- readProtDiscovFile(file=fiNaPd, path=path1, sampleNames=sampNa, refLi="mainSpe", specPref=specPrefPD)
dataPD #> -> readProtDiscovFile : setting 'annotCol' to export of 'R-friendly' colnames
#> -> readProtDiscovFile : Note: 10 (out of 1297) lines with unrecognized species
#> -> readProtDiscovFile : Count by 'specPref' : Gallus gallus: 1 ; Homo sapiens: 47 ; Saccharomyces cerevisiae: 1239 ;
#> -> readProtDiscovFile : Removing 1 lines due to duplicated Accessions (typically due to contaminants)
#> -> readProtDiscovFile : Normalize using (custom) subset of 1239 lines
#> -> readProtDiscovFile -> vioplotW -> asSepList : argument 'fxArg' is depreciated, please use argument 'exclElem' (still using since 'exclElem=NULL'
#> -> readProtDiscovFile -> vioplotW -> asSepList : argument 'fxArg' is depreciated, please use argument 'exclElem' (still using since 'exclElem=NULL'
The data were imported and median-normalized, the protein annotation was parsed to atomatically extract IDs, protein-names and species information.
Note, that the original column-heads in the file exported from ProteomeDiscoverer has simply increasing numbers as names. Thus, much care should be taken on the order when preparing the vector sampleNames with the names to use instead.
## the number of lines and colums
dim(dataPD$quant)
#> [1] 1296 27
## a summary of the quantitation data
summary(dataPD$quant[,1:8]) # the first 8 cols
#> 50amol_1 50amol_2 50amol_3 125amol_1
#> Min. :13.68 Min. :11.60 Min. : 9.609 Min. :10.92
#> 1st Qu.:18.15 1st Qu.:18.19 1st Qu.:18.237 1st Qu.:18.23
#> Median :19.36 Median :19.37 Median :19.380 Median :19.37
#> Mean :19.52 Mean :19.53 Mean :19.539 Mean :19.52
#> 3rd Qu.:20.83 3rd Qu.:20.88 3rd Qu.:20.895 3rd Qu.:20.84
#> Max. :26.33 Max. :26.38 Max. :26.418 Max. :26.41
#> NA's :92 NA's :93 NA's :87 NA's :88
#> 125amol_2 125amol_3 250amol_1 250amol_2
#> Min. :10.51 Min. :11.15 Min. :10.04 Min. :10.21
#> 1st Qu.:18.21 1st Qu.:18.20 1st Qu.:18.15 1st Qu.:18.17
#> Median :19.40 Median :19.40 Median :19.40 Median :19.37
#> Mean :19.51 Mean :19.53 Mean :19.50 Mean :19.47
#> 3rd Qu.:20.82 3rd Qu.:20.83 3rd Qu.:20.78 3rd Qu.:20.78
#> Max. :26.37 Max. :26.47 Max. :26.50 Max. :26.45
#> NA's :99 NA's :86 NA's :87 NA's :77
# there are proteins where the 'OS='-tag won't be visible as Species (orig Fasta-header and Protein-name not accessible) :
which(is.na(dataPD$annot[,"Species"]) & dataPD$annot[,"SpecType"]=="species2") # is NA as Species
#> P02768
#> 25
Confirming the presence of UPS1 proteins by ProteomeDiscoverer:
Now we can summarize the presence of UPS1 proteins after treatment by ProteomeDiscoverer : In sum 48 UPS1 proteins were found, 0 are missing.
Proline is open-source software provided by the Profi-consortium (see also proline-core on github), published by Bouyssié et al 2020. Later in this document data from Proline will by frequently abbreviated as PL.
Protein identification in Proline gets performed by SearchGUI, see also Vaudel et al 2015. In this case X!Tandem (see also Duncan et al 2005) was used as search engine.
Quantitation data at the level of consensus-proteins can be exported from Proline as .xlsx or tabulated text files, both formats can be treated by the import-functions shown below. Here, Proline version 1.6.1 was used.
<- system.file("extdata", package="wrProteo")
path1 <- "pxd001819_PL.xlsx"
fiNaPl
<- c(conta="_conta", mainSpecies="OS=Saccharomyces cerevisiae", spike="_ups")
specPrefPL <- readProlineFile(file.path(path1,fiNaPl), specPref=specPrefPL, normalizeMeth="median", refLi="mainSpe", silent=TRUE)
dataPL #> -> readProlineFile -> vioplotW -> asSepList : argument 'fxArg' is depreciated, please use argument 'exclElem' (still using since 'exclElem=NULL'
#> -> readProlineFile -> vioplotW -> asSepList : argument 'fxArg' is depreciated, please use argument 'exclElem' (still using since 'exclElem=NULL'
In addition, we need to correct the quantification column-heads (like ‘Levure2ug+ UPS1-100amol-1’) and bring them to a simpler version :
head(colnames(dataPL$raw),8)
#> [1] "Levure2ug+ UPS1-100amol-1" "Levure2ug+ UPS1-100amol-2"
#> [3] "Levure2ug+ UPS1-100amol-3" "Levure2ug+ UPS1-250amol-1"
#> [5] "Levure2ug+ UPS1-250amol-2" "Levure2ug+ UPS1-250amol-3"
#> [7] "Levure2ug+ UPS1-500amol-1" "Levure2ug+ UPS1-500amol-2"
<- cbind(ini=paste0("-",c("100a","250a","500a","1f","5f","10f","25f","50f","100f"),"mol-"),
sub1 paste0(fin=c("50a","125a","250a","500a","2500a","5000a","12500a","25000a","50000a"),"mol_"))
<- cleanListCoNames(dataPL, rem=c("abundance_","Levure2ug+ UPS1"), subst=sub1)
dataPL ## let's check the result
head(colnames(dataPL$raw),8)
#> [1] "50amol_1" "50amol_2" "50amol_3" "125amol_1" "125amol_2" "125amol_3"
#> [7] "250amol_1" "250amol_2"
## the number of lines and colums
dim(dataPL$quant)
#> [1] 1186 27
## a summary of the quantitation data
summary(dataPL$quant[,1:8]) # the first 8 cols
#> 50amol_1 50amol_2 50amol_3 125amol_1
#> Min. :14.29 Min. :14.53 Min. :13.77 Min. :14.41
#> 1st Qu.:19.64 1st Qu.:19.66 1st Qu.:19.70 1st Qu.:19.75
#> Median :21.64 Median :21.55 Median :21.65 Median :21.65
#> Mean :21.66 Mean :21.66 Mean :21.71 Mean :21.75
#> 3rd Qu.:23.64 3rd Qu.:23.61 3rd Qu.:23.63 3rd Qu.:23.70
#> Max. :29.52 Max. :29.42 Max. :29.44 Max. :29.50
#> NA's :44 NA's :40 NA's :40 NA's :46
#> 125amol_2 125amol_3 250amol_1 250amol_2
#> Min. :13.53 Min. :14.76 Min. :14.55 Min. :14.74
#> 1st Qu.:19.76 1st Qu.:19.76 1st Qu.:19.74 1st Qu.:19.72
#> Median :21.64 Median :21.65 Median :21.65 Median :21.61
#> Mean :21.73 Mean :21.73 Mean :21.71 Mean :21.69
#> 3rd Qu.:23.69 3rd Qu.:23.65 3rd Qu.:23.62 3rd Qu.:23.63
#> Max. :29.49 Max. :29.43 Max. :29.42 Max. :29.40
#> NA's :46 NA's :48 NA's :54 NA's :51
Now we can summarize the presence of UPS1 proteins after treatment by Proline : In sum 48 UPS1 proteins were found, 0 are missing.
For easy and proper comparisons we need to make sure all columns are in the same order.
# get all results (MaxQuant,ProteomeDiscoverer, ...) in same order
<- paste0(rep(UPSconc,each=3),"amol")
grp9
## it is more convenient to re-order columns this way in each project
<- corColumnOrder(dataPD, sampNames=sampNa) # already in good order
dataPD #> -> corColumnOrder : Order already correct
<- corColumnOrder(dataMQ, sampNames=sampNa)
dataMQ <- corColumnOrder(dataPL, sampNames=sampNa)
dataPL #> -> corColumnOrder : Order already correct
At import we made use of the argument specPref (specifying _‘mainSpecies’, ‘conta’ and ‘spike’) which allows to build categories based on searching keywords based on the initial annotation. In turn, we obtain the labels : ‘main Spe’ for yeast (ie matrix), ‘species2’ for the UPS1 (ie spike) ‘conta’ for contaminants. Let’s replace the first two generic terms by more specific ones (ie ‘Yeast’ and ‘UPS1’) :
## Need to rename $annot[,"SpecType"]
<- replSpecType(dataPD, replBy=cbind(old=c("mainSpe","species2"), new=c("Yeast","UPS1")))
dataPD #> useLi 1 2 3 4 5 6
#> useLi 24 25 36 37 45 70
<- replSpecType(dataMQ, replBy=cbind(old=c("mainSpe","species2"), new=c("Yeast","UPS1")))
dataMQ #> useLi 1 12 13 14 15 16
#> useLi 4 11 21 27 39 47
<- replSpecType(dataPL, replBy=cbind(old=c("mainSpe","species2"), new=c("Yeast","UPS1")))
dataPL #> useLi 1 2 3 4 5 6
#> useLi 20 22 48 53 54 71
## Need to address missing ProteinNames (UPS1) due to missing tags in Fasta
<- replMissingProtNames(dataPD)
dataPD #> -> replreplMissingProtNames : ..trying to replace 1296 'EntryName'
<- replMissingProtNames(dataMQ)
dataMQ #> -> replreplMissingProtNames : ..trying to replace 5 'EntryName'
<- replMissingProtNames(dataPL) dataPL
## extract names of quantified UPS1-proteins
<- dataPD$annot[which(dataPD$annot[,"SpecType"]=="UPS1"),"Accession"]
NamesUpsPD <- dataMQ$annot[which(dataMQ$annot[,"SpecType"]=="UPS1"),"Accession"]
NamesUpsMQ <- dataPL$annot[which(dataPL$annot[,"SpecType"]=="UPS1"),"Accession"] NamesUpsPL
<- mergeVectors(PD=table(dataPD$annot[,"SpecType"]), MQ=table(dataMQ$annot[,"SpecType"]), PL=table(dataPL$annot[,"SpecType"]))
tabS <- mergeVectors(PD=table(dataPD$annot[,"Species"]), MQ=table(dataMQ$annot[,"Species"]), PL=table(dataPL$annot[,"Species"]))
tabT which(is.na(tabT))] <- 0
tabT[kable(cbind(tabS[,2:1], tabT), caption="Number of proteins identified, by custom tags, species and software")
Yeast | UPS1 | Gallus gallus | Homo sapiens | Mus musculus | Saccharomyces cerevisiae | Sus scrofa | |
---|---|---|---|---|---|---|---|
PD | 1239 | 48 | 1 | 47 | 0 | 1239 | 0 |
MQ | 1047 | 48 | 0 | 49 | 1 | 1047 | 1 |
PL | 1137 | 48 | 0 | 48 | 0 | 1137 | 1 |
The initial fasta file also contained the yeast strain number, this has been stripped off when using default parameters.
No additional normalization is needed, all data were already median normalized to the host proteins (ie Saccharomyces cerevisiae) after importing the initial quantification-output using ‘readMaxQuantFile()’, ‘readProlineFile()’ and ‘readProtDiscovFile()’.
As mentioned in the general vignette of this package, ‘wrProteoVignette1’, it is important to investigate the nature of NA-values. In particular, checking the hypothesis that NA-values originate from very low abundance instances is very important for deciding how to treat NA-values furtheron.
## Let's inspect NA values from ProteomeDiscoverer as graphic
matrixNAinspect(dataPD$quant, gr=grp9, tit="ProteomeDiscoverer")
## Let's inspect NA values from MaxQuant as graphic
matrixNAinspect(dataMQ$quant, gr=gl(length(UPSconc),3), tit="MaxQuant")
## Let's inspect NA values from Proline as graphic
matrixNAinspect(dataPL$quant, gr=grp9, tit="Proline")
A key element to understand the nature of NA-value is to investigate their NA-neighbours. If a given protein has for just one of the 3 replicates an NA, the other two valid quantifications can be considered as NA-neighbours. In the figures above all NA-neighbours are shown in the histogram and their mode is marked by an arrow. One can see, that NA-neighbours are predominantely (but not exclusively) part of the lower quantitation values. This supports the hypothesis that NAs occur most frequently with low abundance proteins.
NA-values represent a challange for statistical testing. In addition, techniques like PCA don’t allow NAs, neither.
The number of NAs varies between samples : Indeed, very low concentrations of UPS1 are difficult to get detected and contribute largely to the NAs (as we will see later in more detail). Since the amout of yeast proteins (ie the matrix in this setup) stays constant across all samples, yeast proteins should always get detected the same way.
## Let's look at the number of NAs. Is there an accumulated number in lower UPS1 samples ?
<- rbind(PD=sumNAperGroup(dataPD$raw, grp9), MQ=sumNAperGroup(dataMQ$raw, grp9), PL=sumNAperGroup(dataPL$raw, grp9) )
tabSumNA kable(tabSumNA, caption="Number of NAs per group of samples", align="r")
50amol | 125amol | 250amol | 500amol | 2500amol | 5000amol | 12500amol | 25000amol | 50000amol | |
---|---|---|---|---|---|---|---|---|---|
PD | 272 | 273 | 257 | 234 | 205 | 207 | 195 | 209 | 220 |
MQ | 318 | 334 | 323 | 337 | 282 | 297 | 302 | 330 | 322 |
PL | 124 | 140 | 157 | 140 | 137 | 139 | 131 | 141 | 131 |
In the section above we investigated the circumstances of NA-instances and provided evidence that NA-values typically represent proteins with low abundance which frequently ended up as non-detectable (NA). Thus, we hypothesize that (in most cases) NA-values might also have been detected in quantities like their NA-neighbours. In consequence, we will model a normal distribution based on the NA-neighbours and use for substituting.
The function testRobustToNAimputation()
from this package (wrProteo) allows to perform NA-imputation and subsequent statistical testing (after repeated imputation) between all groups of samples (see also the general vignette). One of the advantages of this implementation, is that multiple rounds of imputation are run, so that final results (including pair-wise testing) get stabilized to (rare) stochastic effects. For this reason one may also speak of stabilized NA-imputations.
The statistical tests used underneith make use of the shrinkage-procedure provided from the empirical Bayes procedure as implemented to the Bioconductor package limma, see also Ritchie et al 2015. In addition, various formats of multiple testing correction can be added to the results : Benjamini-Hochberg FDR (lateron referred to as BH or BH-FDR, see FDR on Wikipedia, see also Benjamini and Hochberg 1995), local false discovery rate (lfdr, using the package fdrtool, see Strimmer 2008), or modified testing by ROTS, etc … In this vignette we will make use of the BH-FDR.
We are ready to launch the NA-imputation and testing for data from ProteomeDiscoverer :
<- testRobustToNAimputation(dataPD, gr=grp9, imputMethod="informed") # ProteomeDiscoverer
testPD #> -> testRobustToNAimputation -> matrixNAneighbourImpute -> stableMode : Method='density', length of x =976, 'bandw' has been set to 44
#> -> testRobustToNAimputation -> matrixNAneighbourImpute : n.woNA= 32920 , n.NA = 2072
#> impute based on 'informed' using mode= 17.34 and sd= 0.5
#> note mean for impuation is below 0.05 quantile !!
#> -> testRobustToNAimputation -> combineMultFilterNAimput : at presenceFilt: 1259 1252 1245 1248 1234 1235 1247 1259 1253 1251 1263 1235 1247 1238 1264 1255 1257 1253 1255 1254 1259 1246 1257 1252 1252 1248 1262 1255 1247 1257 1251 1249 1245 1261 1251 1248 out of 1296
#> -> testRobustToNAimputation -> combineMultFilterNAimput : at abundanceFilt: 1213 1206 1203 1203 1202 1194 1229 1237 1232 1229 1250 1213 1225 1216 1241 1240 1238 1235 1235 1235 1238 1229 1238 1234 1233 1230 1236 1235 1231 1240 1236 1232 1229 1240 1236 1232
#> -> testRobustToNAimputation -> combineMultFilterNAimput : at NA> mean: 1172, 1169, 1172, 1170, 1174, 1179, 1166, 1175, 1182, 1191, 1171, 1174, 1184, 1184, 1207, 1175, 1180, 1186, 1197, 1208, 1209, 1165, 1175, 1183, 1193, 1205, 1206, 1213, 1160, 1173, 1175, 1186, 1194, 1196, 1204 and 1199
<- testRobustToNAimputation(dataPD, gr=grp9) # ProteomeDiscoverer
testPD2 #> -> testRobustToNAimputation -> matrixNAneighbourImpute -> stableMode : Method='density', length of x =976, 'bandw' has been set to 44
#> -> testRobustToNAimputation -> matrixNAneighbourImpute : n.woNA= 32920 , n.NA = 2072
#> impute based on 'mode2' using mean= 17.34 and sd= 0.5
#>
#> -> testRobustToNAimputation -> combineMultFilterNAimput : at presenceFilt: 1259 1252 1245 1248 1234 1235 1247 1259 1253 1251 1263 1235 1247 1238 1264 1255 1257 1253 1255 1254 1259 1246 1257 1252 1252 1248 1262 1255 1247 1257 1251 1249 1245 1261 1251 1248 out of 1296
#> -> testRobustToNAimputation -> combineMultFilterNAimput : at abundanceFilt: 1245 1241 1229 1235 1216 1215 1236 1245 1241 1238 1253 1223 1236 1226 1252 1245 1245 1242 1242 1240 1248 1235 1245 1242 1241 1235 1250 1243 1236 1248 1242 1237 1234 1252 1241 1239
#> -> testRobustToNAimputation -> combineMultFilterNAimput : at NA> mean: 1174, 1172, 1176, 1172, 1177, 1183, 1168, 1178, 1182, 1194, 1171, 1176, 1185, 1186, 1209, 1176, 1182, 1187, 1200, 1210, 1210, 1166, 1177, 1184, 1196, 1208, 1207, 1214, 1161, 1177, 1177, 1189, 1197, 1198, 1207 and 1201
<- testRobustToNAimputation(dataPD, gr=grp9, imputMethod="modeadopt") # ProteomeDiscoverer
testPD3 #> -> testRobustToNAimputation -> matrixNAneighbourImpute : Substituting dynamically based on mean per number of NAs
#> -> testRobustToNAimputation -> matrixNAneighbourImpute : n.woNA= 32920 , n.NA = 2072
#> impute based on 'modeadopt' using mean= 17.09, 16.3 and 15.51 (for 1, 2 and 3 NAs) and sd= 0.5
#> note mean for impuation is below 0.05 quantile !!
#> -> testRobustToNAimputation -> combineMultFilterNAimput : at presenceFilt: 1259 1252 1245 1248 1234 1235 1247 1259 1253 1251 1263 1235 1247 1238 1264 1255 1257 1253 1255 1254 1259 1246 1257 1252 1252 1248 1262 1255 1247 1257 1251 1249 1245 1261 1251 1248 out of 1296
#> -> testRobustToNAimputation -> combineMultFilterNAimput : at abundanceFilt: 1184 1183 1178 1179 1182 1178 1210 1212 1213 1208 1233 1190 1205 1191 1206 1221 1214 1212 1215 1206 1206 1209 1211 1213 1213 1202 1208 1204 1212 1211 1214 1213 1211 1217 1213 1210
#> -> testRobustToNAimputation -> combineMultFilterNAimput : at NA> mean: 1169, 1164, 1164, 1160, 1169, 1171, 1164, 1167, 1174, 1185, 1166, 1167, 1178, 1176, 1197, 1167, 1169, 1178, 1188, 1201, 1198, 1158, 1166, 1174, 1186, 1195, 1199, 1201, 1158, 1159, 1166, 1177, 1186, 1189, 1195 and 1191
Then for MaxQuant …
<- testRobustToNAimputation(dataMQ, gr=grp9, imputMethod="informed") # MaxQuant
testMQ #> -> testRobustToNAimputation -> matrixNAneighbourImpute -> stableMode : Method='density', length of x =1142, 'bandw' has been set to 47
#> -> testRobustToNAimputation -> matrixNAneighbourImpute : n.woNA= 26963 , n.NA = 2845
#> impute based on 'informed' using mode= 21.46 and sd= 0.5
#> note mean for impuation is below 0.05 quantile !!
#> -> testRobustToNAimputation -> combineMultFilterNAimput : at presenceFilt: 1051 1043 1022 1051 1017 1019 1027 1051 1046 1048 1067 1022 1032 1028 1061 1040 1050 1050 1056 1030 1061 1032 1047 1040 1045 1037 1060 1049 1027 1050 1047 1051 1027 1066 1034 1039 out of 1104
#> -> testRobustToNAimputation -> combineMultFilterNAimput : at abundanceFilt: 1001 997 993 1001 993 990 997 1017 1011 1011 1046 1006 1012 1008 1034 1028 1037 1035 1038 1014 1036 1023 1034 1025 1031 1023 1035 1031 1020 1040 1039 1039 1017 1047 1025 1028
#> -> testRobustToNAimputation -> combineMultFilterNAimput : at NA> mean: 963, 958, 965, 953, 956, 965, 939, 951, 955, 956, 941, 944, 947, 944, 975, 940, 948, 950, 952, 969, 974, 932, 937, 941, 946, 974, 969, 977, 926, 937, 938, 940, 958, 961, 972 and 964
<- testRobustToNAimputation(dataMQ, gr=grp9) # MaxQuant
testMQ2 #> -> testRobustToNAimputation -> matrixNAneighbourImpute -> stableMode : Method='density', length of x =1142, 'bandw' has been set to 47
#> -> testRobustToNAimputation -> matrixNAneighbourImpute : n.woNA= 26963 , n.NA = 2845
#> impute based on 'mode2' using mean= 21.46 and sd= 0.5
#>
#> -> testRobustToNAimputation -> combineMultFilterNAimput : at presenceFilt: 1051 1043 1022 1051 1017 1019 1027 1051 1046 1048 1067 1022 1032 1028 1061 1040 1050 1050 1056 1030 1061 1032 1047 1040 1045 1037 1060 1049 1027 1050 1047 1051 1027 1066 1034 1039 out of 1104
#> -> testRobustToNAimputation -> combineMultFilterNAimput : at abundanceFilt: 1039 1034 1009 1037 1002 999 1018 1037 1034 1030 1064 1015 1024 1019 1051 1038 1046 1044 1049 1021 1055 1030 1042 1033 1039 1031 1054 1043 1024 1045 1044 1048 1024 1061 1031 1035
#> -> testRobustToNAimputation -> combineMultFilterNAimput : at NA> mean: 967, 960, 966, 956, 956, 965, 943, 954, 959, 959, 943, 944, 947, 945, 978, 944, 950, 952, 954, 972, 976, 934, 938, 942, 946, 975, 969, 979, 927, 938, 938, 941, 961, 962, 974 and 964
<- testRobustToNAimputation(dataMQ, gr=grp9, imputMethod="modeadopt") # MaxQuant
testMQ3 #> -> testRobustToNAimputation -> matrixNAneighbourImpute : Substituting based on mode of all 1142 NA-neighbours
#> -> testRobustToNAimputation -> matrixNAneighbourImpute : n.woNA= 26963 , n.NA = 2845
#> impute based on 'modeadopt' using mean= 21.46 and sd= 0.5
#>
#> -> testRobustToNAimputation -> combineMultFilterNAimput : at presenceFilt: 1051 1043 1022 1051 1017 1019 1027 1051 1046 1048 1067 1022 1032 1028 1061 1040 1050 1050 1056 1030 1061 1032 1047 1040 1045 1037 1060 1049 1027 1050 1047 1051 1027 1066 1034 1039 out of 1104
#> -> testRobustToNAimputation -> combineMultFilterNAimput : at abundanceFilt: 1039 1033 1009 1036 1002 1002 1017 1038 1034 1029 1063 1016 1025 1019 1052 1037 1046 1044 1049 1021 1055 1029 1042 1033 1039 1031 1054 1043 1024 1045 1044 1048 1024 1061 1031 1035
#> -> testRobustToNAimputation -> combineMultFilterNAimput : at NA> mean: 967, 960, 966, 956, 956, 966, 943, 954, 959, 958, 943, 944, 947, 945, 978, 944, 950, 952, 954, 972, 976, 934, 938, 942, 946, 975, 969, 979, 927, 938, 938, 941, 961, 962, 974 and 964
And finally for Proline :
<- testRobustToNAimputation(dataPL, gr=grp9) # Proline
testPL #> -> testRobustToNAimputation -> matrixNAneighbourImpute -> stableMode : Method='density', length of x =413, 'bandw' has been set to 28
#> -> testRobustToNAimputation -> matrixNAneighbourImpute : n.woNA= 30782 , n.NA = 1240
#> impute based on 'mode2' using mean= 17.74 and sd= 0.5
#> note mean for impuation is below 0.05 quantile !!
#> -> testRobustToNAimputation -> combineMultFilterNAimput : at presenceFilt: 1156 1154 1148 1155 1142 1145 1149 1156 1154 1153 1167 1161 1163 1159 1167 1150 1158 1155 1156 1150 1169 1147 1155 1153 1154 1148 1167 1153 1149 1156 1154 1154 1148 1167 1151 1149 out of 1186
#> -> testRobustToNAimputation -> combineMultFilterNAimput : at abundanceFilt: 1140 1138 1126 1142 1125 1128 1134 1140 1138 1137 1155 1145 1147 1142 1150 1138 1145 1141 1143 1136 1154 1135 1141 1138 1140 1133 1152 1140 1139 1144 1142 1142 1136 1154 1139 1137
#> -> testRobustToNAimputation -> combineMultFilterNAimput : at NA> mean: 1113, 1107, 1109, 1116, 1116, 1112, 1108, 1111, 1105, 1113, 1113, 1110, 1104, 1109, 1120, 1113, 1116, 1110, 1118, 1122, 1123, 1109, 1112, 1106, 1115, 1121, 1120, 1126, 1112, 1113, 1108, 1115, 1120, 1122, 1125 and 1123
<- testRobustToNAimputation(dataPL, gr=grp9) #
testPL2 #> -> testRobustToNAimputation -> matrixNAneighbourImpute -> stableMode : Method='density', length of x =413, 'bandw' has been set to 28
#> -> testRobustToNAimputation -> matrixNAneighbourImpute : n.woNA= 30782 , n.NA = 1240
#> impute based on 'mode2' using mean= 17.74 and sd= 0.5
#> note mean for impuation is below 0.05 quantile !!
#> -> testRobustToNAimputation -> combineMultFilterNAimput : at presenceFilt: 1156 1154 1148 1155 1142 1145 1149 1156 1154 1153 1167 1161 1163 1159 1167 1150 1158 1155 1156 1150 1169 1147 1155 1153 1154 1148 1167 1153 1149 1156 1154 1154 1148 1167 1151 1149 out of 1186
#> -> testRobustToNAimputation -> combineMultFilterNAimput : at abundanceFilt: 1140 1140 1126 1142 1125 1127 1133 1138 1136 1136 1155 1145 1146 1141 1150 1137 1144 1140 1142 1135 1153 1135 1141 1137 1139 1133 1151 1139 1139 1144 1142 1142 1136 1154 1138 1137
#> -> testRobustToNAimputation -> combineMultFilterNAimput : at NA> mean: 1114, 1108, 1109, 1116, 1116, 1112, 1107, 1109, 1103, 1112, 1113, 1110, 1104, 1109, 1120, 1112, 1115, 1109, 1117, 1121, 1122, 1109, 1112, 1105, 1114, 1121, 1120, 1125, 1112, 1113, 1108, 1115, 1120, 1122, 1124 and 1123
<- testRobustToNAimputation(dataPL, gr=grp9, imputMethod="modeadopt") #
testPL3 #> -> testRobustToNAimputation -> matrixNAneighbourImpute : Substituting based on mode of all 413 NA-neighbours
#> -> testRobustToNAimputation -> matrixNAneighbourImpute : n.woNA= 30782 , n.NA = 1240
#> impute based on 'modeadopt' using mean= 17.74 and sd= 0.5
#> note mean for impuation is below 0.05 quantile !!
#> -> testRobustToNAimputation -> combineMultFilterNAimput : at presenceFilt: 1156 1154 1148 1155 1142 1145 1149 1156 1154 1153 1167 1161 1163 1159 1167 1150 1158 1155 1156 1150 1169 1147 1155 1153 1154 1148 1167 1153 1149 1156 1154 1154 1148 1167 1151 1149 out of 1186
#> -> testRobustToNAimputation -> combineMultFilterNAimput : at abundanceFilt: 1140 1139 1126 1143 1123 1126 1133 1140 1137 1136 1155 1145 1145 1140 1149 1137 1144 1140 1141 1135 1153 1135 1141 1137 1139 1133 1151 1139 1139 1144 1142 1141 1136 1154 1138 1137
#> -> testRobustToNAimputation -> combineMultFilterNAimput : at NA> mean: 1113, 1107, 1109, 1116, 1115, 1111, 1107, 1111, 1104, 1112, 1113, 1110, 1103, 1109, 1119, 1112, 1115, 1109, 1117, 1121, 1122, 1109, 1112, 1105, 1114, 1121, 1120, 1125, 1112, 1113, 1108, 1115, 1120, 1122, 1124 and 1123
From these results we’ll use i) the NA-imputed version of our datasets for plotting principal components (PCA) and ii) the (stabilized) testing results for counting TP, FP, etc and to construct ROC curves.
Let’s add the NA-imputed data to our main object :
$datImp <- testPD$datImp # recuperate imputeded data to main data-object
dataPD$datImp <- testMQ$datImp
dataMQ$datImp <- testPL$datImp dataPL
In this section all proteins identified and quantified are compared in a pair-wise fashion based on the t-tests already run in the previous section. As mentioned, the experimental setup is very special, since all proteins that are truly changing are known in advance (the UPS1 spike-in proteins). Tables get constructed by counting based on various thresholds for considering given protein abundances as differential or not.
A traditional 5 percent FDR cut-off is used for Volcano-plots, while ROC-curves allow inspecting the entire range of potential cut-off values.
A very universal and simple way to analyze data is by checking as several pairwise comparisons, in particular, if the experimental setup does not include complete multifactorial plans.
This UPS1 spike-in experiment has 27 samples organized (according to meta-information) as 9 groups. Thus, one obtains in total 36 pair-wise comparisons which will make comparisons very crowded. The publication by Ramus et al 2016 focussed on 3 pairwise comparisons only. In this vignette it is shown how all of them can get considered.
Now, we’ll construct a table showing all possible pairwise-comparisons. Using the function numPairDeColNames() we can easily extract the UPS1 concentrations as numeric content and show the (log-)ratio of the pairwise comparisons (column ‘log2rat’), the final concentrations (columns ‘conc1’ and ‘conc2’, in amol) and the number of differentially abundant proteins passing 5% FDR (using classical Benjamini-Hochberg FDR (columns ‘sig.xx.BH’) or lfdr (Strimmer 2008, columns ‘sig._xx_.lfdr’ ).
## The number of differentially abundant proteins passing 5% FDR (ProteomeDiscoverer and MaxQuant)
<- cbind( sig.PD.BH=colSums(testPD$BH < 0.05, na.rm=TRUE), sig.PD.lfdr=if("lfdr" %in% names(testPD)) colSums(testPD$lfdr < 0.05, na.rm=TRUE),
signCount sig.MQ.BH=colSums(testMQ$BH < 0.05, na.rm=TRUE), sig.MQ.lfdr=if("lfdr" %in% names(testMQ)) colSums(testMQ$lfdr < 0.05, na.rm=TRUE),
sig.PL.BH=colSums(testPL$BH < 0.05, na.rm=TRUE), sig.PL.lfdr=if("lfdr" %in% names(testPL)) colSums(testPL$lfdr < 0.05, na.rm=TRUE) )
<- numPairDeColNames(testPD$BH, stripTxt="amol", sortByAbsRatio=TRUE)
table1 <- cbind(table1, signCount[table1[,1],])
table1 kable(table1, caption="All pairwise comparisons and number of significant proteins", align="c")
index | log2rat | conc1 | conc2 | sig.PD.BH | sig.PD.lfdr | sig.MQ.BH | sig.MQ.lfdr | sig.PL.BH | sig.PL.lfdr | |
---|---|---|---|---|---|---|---|---|---|---|
50000amol-50amol | 34 | 9.966 | 50 | 50000 | 572 | 492 | 377 | 304 | 572 | 487 |
25000amol-50amol | 31 | 8.966 | 50 | 25000 | 587 | 495 | 370 | 320 | 548 | 481 |
125amol-50000amol | 12 | 8.644 | 125 | 50000 | 356 | 289 | 237 | 165 | 415 | 380 |
12500amol-50amol | 29 | 7.966 | 50 | 12500 | 506 | 459 | 290 | 231 | 482 | 389 |
125amol-25000amol | 3 | 7.644 | 125 | 25000 | 327 | 278 | 182 | 157 | 311 | 228 |
250amol-50000amol | 15 | 7.644 | 250 | 50000 | 379 | 318 | 308 | 261 | 325 | 261 |
12500amol-125amol | 1 | 6.644 | 125 | 12500 | 220 | 168 | 84 | 60 | 236 | 191 |
25000amol-250amol | 9 | 6.644 | 250 | 25000 | 234 | 180 | 209 | 161 | 200 | 148 |
50000amol-500amol | 27 | 6.644 | 500 | 50000 | 360 | 310 | 251 | 174 | 363 | 287 |
5000amol-50amol | 35 | 6.644 | 50 | 5000 | 599 | 534 | 375 | 335 | 523 | 468 |
12500amol-250amol | 7 | 5.644 | 250 | 12500 | 131 | 88 | 55 | 47 | 130 | 109 |
25000amol-500amol | 24 | 5.644 | 500 | 25000 | 243 | 172 | 141 | 118 | 197 | 142 |
2500amol-50amol | 32 | 5.644 | 50 | 2500 | 556 | 476 | 306 | 235 | 491 | 474 |
125amol-5000amol | 17 | 5.322 | 125 | 5000 | 281 | 197 | 121 | 88 | 205 | 149 |
12500amol-500amol | 22 | 4.644 | 500 | 12500 | 156 | 119 | 56 | 31 | 129 | 84 |
125amol-2500amol | 5 | 4.322 | 125 | 2500 | 197 | 166 | 76 | 50 | 184 | 147 |
2500amol-50000amol | 14 | 4.322 | 2500 | 50000 | 284 | 238 | 169 | 152 | 254 | 190 |
250amol-5000amol | 20 | 4.322 | 250 | 5000 | 197 | 126 | 162 | 121 | 141 | 102 |
25000amol-2500amol | 6 | 3.322 | 2500 | 25000 | 44 | 32 | 14 | 10 | 69 | 49 |
2500amol-250amol | 10 | 3.322 | 250 | 2500 | 114 | 88 | 31 | 29 | 92 | 56 |
50000amol-5000amol | 21 | 3.322 | 5000 | 50000 | 331 | 272 | 158 | 112 | 324 | 282 |
5000amol-500amol | 28 | 3.322 | 500 | 5000 | 170 | 134 | 101 | 62 | 130 | 100 |
500amol-50amol | 36 | 3.322 | 50 | 500 | 426 | 385 | 216 | 191 | 380 | 340 |
12500amol-2500amol | 4 | 2.322 | 2500 | 12500 | 19 | 13 | 5 | 4 | 48 | 30 |
25000amol-5000amol | 18 | 2.322 | 5000 | 25000 | 52 | 34 | 22 | 18 | 73 | 56 |
2500amol-500amol | 25 | 2.322 | 500 | 2500 | 113 | 83 | 44 | 34 | 97 | 62 |
250amol-50amol | 33 | 2.322 | 50 | 250 | 354 | 268 | 183 | 136 | 315 | 253 |
12500amol-50000amol | 11 | 2.000 | 12500 | 50000 | 229 | 155 | 118 | 117 | 165 | 110 |
125amol-500amol | 23 | 2.000 | 125 | 500 | 21 | 17 | 6 | 2 | 30 | 18 |
12500amol-5000amol | 16 | 1.322 | 5000 | 12500 | 25 | 13 | 5 | 2 | 34 | 21 |
125amol-50amol | 30 | 1.322 | 50 | 125 | 277 | 261 | 120 | 94 | 216 | 175 |
12500amol-25000amol | 2 | 1.000 | 12500 | 25000 | 11 | 9 | 0 | 0 | 29 | 15 |
125amol-250amol | 8 | 1.000 | 125 | 250 | 15 | 12 | 0 | 0 | 2 | 1 |
25000amol-50000amol | 13 | 1.000 | 25000 | 50000 | 136 | 85 | 61 | 50 | 92 | 63 |
2500amol-5000amol | 19 | 1.000 | 2500 | 5000 | 13 | 11 | 2 | 1 | 13 | 10 |
250amol-500amol | 26 | 1.000 | 250 | 500 | 6 | 2 | 2 | 0 | 3 | 2 |
You can see that in numerous cases much more than the 48 UPS1 proteins showed up significant, ie yeast proteins supposed to remain constant also showed up in part as ‘sigificantly changing’. However, some proteins with enthousiastic FDR values have very low log-FC amplitude and will be removed by filtering in the following steps.
par(mar=c(5.5, 4.7, 4, 1))
imageW(table1[,c("sig.PD.BH","sig.MQ.BH","sig.PL.BH" )], col=RColorBrewer::brewer.pal(9,"YlOrRd"),
transp=FALSE, tit="Number of BH.FDR passing proteins by the quantification approaches")
mtext("dark red for high number signif proteins", cex=0.7)
In the original Ramus et al 2016 et al paper only 3 pairwise comparisons were further analyzed :
## Selection in Ramus paper
kable(table1[which(rownames(table1) %in% colnames(testPD$BH)[c(2,21,27)]),], caption="Selected pairwise comparisons (as in Ramus et al)", align="c")
index | log2rat | conc1 | conc2 | sig.PD.BH | sig.PD.lfdr | sig.MQ.BH | sig.MQ.lfdr | sig.PL.BH | sig.PL.lfdr | |
---|---|---|---|---|---|---|---|---|---|---|
50000amol-500amol | 27 | 6.644 | 500 | 50000 | 360 | 310 | 251 | 174 | 363 | 287 |
50000amol-5000amol | 21 | 3.322 | 5000 | 50000 | 331 | 272 | 158 | 112 | 324 | 282 |
12500amol-25000amol | 2 | 1.000 | 12500 | 25000 | 11 | 9 | 0 | 0 | 29 | 15 |
Volcano-plots offer additional insight in how statistical test results relate to log-fold-change of pair-wise comparisons. In addition, we can mark the different protein-groups (or species) by different symbols, see also the general vignette ‘wrProteoVignette1’ (from this package) and the vignette to the package wrGraph. Counting the number of proteins passing a classical threshold for differential expression combined with a filter for minimum log-fold-change is a good way to start.
As mentioned, the dataset from Ramus et al 2016 contains 9 different levels of UPS1 concentrations, in consequence 36 pair-wise comparisons are possible. Again, plotting all possible Volcano plots would make way too crowded plots, instead we’ll try to summarize (see ROC curves), cluster into groups and finally plot only a few representative ones.
Receiver Operator Curves (ROC) curves display sensitivity (True Positive Rate) versus 1-Specificity (False Positive Rate). They are typically used as illustrate and compare the discriminiative capacity of a yes/no decision system (here: differential abundance or not), see eg also the original publication Hand and Till 2001.
The data get constructed by sliding through a panel of threshold-values for the statistical tests instead of just using 0.05. Due to the experimental setup we know that all yeast proteins should stay constant and only UPS1 proteins are expected to change. For each of these threshold values one counts the number of true positives (TP), false positives (FP) etc, allowing then to calculate sensitivity and specificity.
In the case of bechmarking quantitation efforts, ROC curves are used to judge how well heterologous spikes UPS1 proteins can be recognized as differentially abundant while constant yeast matrix proteins should not get classified as differential. Finally, ROC curves let us also gain some additional insights in the question which cutoff may be optimal or if the commonly used 5-percent FDR threshld cutoff allows getting the best out of the testing system.
Below, these calculations of summarizeForROC() are run in batch.
layout(1)
<- lapply(table1[,1], function(x) summarizeForROC(testPD, useComp=x, annotCol="SpecType", spec=c("Yeast","UPS1"), tyThr="BH", plotROC=FALSE,silent=TRUE))
rocPD <- lapply(table1[,1], function(x) summarizeForROC(testMQ, useComp=x, annotCol="SpecType", spec=c("Yeast","UPS1"), tyThr="BH", plotROC=FALSE,silent=TRUE))
rocMQ <- lapply(table1[,1], function(x) summarizeForROC(testPL, useComp=x, annotCol="SpecType", spec=c("Yeast","UPS1"), tyThr="BH", plotROC=FALSE,silent=TRUE))
rocPL
# we still need to add the names for the pair-wise groups:
names(rocPD) <- names(rocMQ) <- names(rocPL) <- rownames(table1)
#names(rocPD) <- rownames(table1) #colnames(testPD$BH)[table1[,1]]
#names(rocMQ) <- colnames(testMQ$BH)[table1[,1]]
#names(rocPL) <- colnames(testPL$BH)[table1[,1]]
The next step consists in calculating the area under the curve (AUC) for the individual profiles of each pairwise comparison.
## calulate AUC for each ROC
<- cbind(ind=table1[match(names(rocPD), rownames(table1)),"index"], clu=NA,
AucAll PD=sapply(rocPD, AucROC), MQ=sapply(rocMQ, AucROC), PL=sapply(rocPL, AucROC) )
To provide a quick overview, the clustered AUC values are displayed as PCA :
biplot(prcomp(AucAll[,names(methNa)]), cex=0.7, main="PCA of AUC from ROC Curves")
On this PCA one can see a rather disperse/distinct group of low concentration-pairs (separated by PC1, mostly containg at least one 50 or 500fmol), low/med conc pairs (containing 2500, ev 5000) and the rest (med+high to any, high degree of text-superposition). Furthermore, we can see that AUC values from MaxQuant correlate somehow less to Proline and ProteomeDiscoverer (red arrows).
Now we are ready to inspect the 5 clusters in detail :
As mentioned, there are too many pair-wise combinations available for plotting and inspecting all ROC-curves. So we can try to group similar pairwise comparison AUC values into clusters and then easily display representative examples for each cluster/group. Again, we (pre)define that we want to obtain 5 groups (like customer-ratings from 5 to 1 stars), a k-Means clustering approach was chosen.
## number of groups for clustering
<- 5
nGr ## K-Means clustering
<- stats::kmeans(standardW(AucAll[,c("PD","MQ","PL")]), nGr)$cluster
kMAx table(kMAx)
#> kMAx
#> 1 2 3 4 5
#> 6 8 18 3 1
"clu"] <- kMAx AucAll[,
<- reorgByCluNo(AucAll, cluNo=kMAx, useColumn=c("PD","MQ","PL"))
AucAll <- cbind(AucAll, iniInd=table1[match(rownames(AucAll), rownames(table1)), "index"])
AucAll colnames(AucAll)[1:(which(colnames(AucAll)=="index")-1)] <- paste("Auc",colnames(AucAll)[1:(which(colnames(AucAll)=="index")-1)], sep=".")
"cluNo"] <- rep(nGr:1, table(AucAll[,"cluNo"])) # make cluNo descending
AucAll[,
<- AucAll[,"cluNo"] # update
kMAx table(AucAll[,"cluNo"])
#>
#> 1 2 3 4 5
#> 6 1 8 3 18
## note : column 'index' is relative to table1, iniInd to ordering inside objects from clustering
To graphically summarize the AUC values, the clustered AUC values are plotted accompagnied by the geometric mean:
profileAsClu(AucAll[,c(1:length(methNa),(length(methNa)+2:3))], clu="cluNo", meanD="geoMean", tit="Pairwise Comparisons as Clustered AUC from ROC Curves", xlab="Comparison number", ylab="AUC", meLty=1, meLwd=3)
From this figure we can see clearly that there are some pairwise comparisons where all initial analysis-software results yield high AUC values, while other pairwise comparisons less discriminative power.
Again, now we can select a representative pairwise-comparison for each cluster (from the center of each cluster):
<- table(AucAll[,"cluNo"])[rank(unique(AucAll[,"cluNo"]))] # representative for each cluster
AucRep <- round(cumsum(AucRep) -AucRep/2 +0.1)
AucRep
## select representative for each cluster
kable(round(AucAll[AucRep,c("Auc.PD","Auc.MQ","Auc.PL","cluNo")],3), caption="Selected representative for each cluster ", align="c")
Auc.PD | Auc.MQ | Auc.PL | cluNo | |
---|---|---|---|---|
12500amol-500amol | 0.970 | 0.992 | 0.993 | 5 |
5000amol-500amol | 0.966 | 0.908 | 0.977 | 4 |
5000amol-50amol | 0.923 | 0.857 | 0.932 | 3 |
125amol-2500amol | 0.725 | 0.577 | 0.944 | 2 |
125amol-250amol | 0.562 | 0.432 | 0.481 | 1 |
Now we can check if some experimental UPS1 log-fold-change have a bias for some clusters.
<- sapply(5:1, function(x) { y <- table1[match(rownames(AucAll),rownames(table1)),]
ratTab table(factor(signif(y[which(AucAll[,"cluNo"]==x),"log2rat"],1), levels=unique(signif(table1[,"log2rat"],1))) )})
colnames(ratTab) <- paste0("clu",5:1,"\nn=",rev(table(kMAx)))
layout(1)
imageW(ratTab, tit="Frequency of rounded log2FC in the 5 clusters", xLab="log2FC (rounded)", col=RColorBrewer::brewer.pal(9,"YlOrRd"),las=1)
mtext("dark red for high number signif proteins", cex=0.7)
We can see, that the cluster of best ROC-curves (cluster 5) covers practically all UPS1 log-ratios from this experiment without being restricted just to the high ratios.
<- 2:5
colPanel <- 5
gr <- match(rownames(AucAll)[AucRep[6-gr]], colnames(testPD$t))
j
## table of all proteins in cluster
<- which(AucAll[,"cluNo"]==gr)
useLi <- cbind(round(as.data.frame(AucAll)[useLi,c("cluNo","Auc.PD","Auc.MQ","Auc.PL")],3),
tmp as.data.frame(table1)[match(names(useLi),rownames(table1)), c(2,5,7,9)])
kable(tmp, caption="AUC details for best pairwise-comparisons ", align="c")
cluNo | Auc.PD | Auc.MQ | Auc.PL | log2rat | sig.PD.BH | sig.MQ.BH | sig.PL.BH | |
---|---|---|---|---|---|---|---|---|
250amol-50000amol | 5 | 0.982 | 1.000 | 0.995 | 7.644 | 379 | 308 | 325 |
25000amol-250amol | 5 | 0.980 | 0.999 | 0.997 | 6.644 | 234 | 209 | 200 |
50000amol-500amol | 5 | 0.982 | 0.999 | 0.995 | 6.644 | 360 | 251 | 363 |
125amol-50000amol | 5 | 0.977 | 0.999 | 0.996 | 8.644 | 356 | 237 | 415 |
25000amol-500amol | 5 | 0.978 | 0.998 | 0.994 | 5.644 | 243 | 141 | 197 |
12500amol-250amol | 5 | 0.975 | 0.996 | 0.997 | 5.644 | 131 | 55 | 130 |
50000amol-50amol | 5 | 0.974 | 0.998 | 0.990 | 9.966 | 572 | 377 | 572 |
25000amol-50amol | 5 | 0.972 | 0.996 | 0.989 | 8.966 | 587 | 370 | 548 |
12500amol-500amol | 5 | 0.970 | 0.992 | 0.993 | 4.644 | 156 | 56 | 129 |
2500amol-50000amol | 5 | 0.963 | 0.999 | 0.992 | 4.322 | 284 | 169 | 254 |
25000amol-2500amol | 5 | 0.950 | 1.000 | 0.993 | 3.322 | 44 | 14 | 69 |
50000amol-5000amol | 5 | 0.961 | 0.995 | 0.986 | 3.322 | 331 | 158 | 324 |
12500amol-2500amol | 5 | 0.959 | 0.998 | 0.982 | 2.322 | 19 | 5 | 48 |
125amol-25000amol | 5 | 0.960 | 0.972 | 0.995 | 7.644 | 327 | 182 | 311 |
25000amol-5000amol | 5 | 0.947 | 0.996 | 0.977 | 2.322 | 52 | 22 | 73 |
12500amol-50amol | 5 | 0.933 | 0.990 | 0.986 | 7.966 | 506 | 290 | 482 |
12500amol-50000amol | 5 | 0.935 | 0.976 | 0.977 | 2.000 | 229 | 118 | 165 |
12500amol-5000amol | 5 | 0.926 | 0.973 | 0.934 | 1.322 | 25 | 5 | 34 |
## frequent concentrations :
layout(matrix(1:2), heights=c(1,2.5))
plotConcHist(mat=tmp, ref=table1)
## representative ROC
<- match(rownames(AucAll)[AucRep[6-gr]], names(rocPD))
jR plotROC(rocPD[[jR]], rocMQ[[jR]], rocPL[[jR]], col=colPanel, methNames=methNa, pointSi=0.8, xlim=c(0,0.45),
txtLoc=c(0.12,0.1,0.033), tit=paste("Cluster",gr," Example: ",names(rocPD)[jR]), legCex=1)
#fcPar1 <- c(text="arrow: expected ratio at",loc="toright")
## This required package 'wrGraph' at version 1.2.5 (or higher)
if(packageVersion("wrGraph") >= "1.2.5"){
layout(matrix(1:4,ncol=2))
try(VolcanoPlotW(testPD, useComp=j, FCthrs=1.5, FdrThrs=0.05, annColor=c(4,2,3), ProjNa=methNa[1], expFCarrow=TRUE, silent=TRUE),silent=TRUE)
try(VolcanoPlotW(testMQ, useComp=j, FCthrs=1.5, FdrThrs=0.05, annColor=c(4,2,3), ProjNa=methNa[2], expFCarrow=TRUE, silent=TRUE),silent=TRUE)
try(VolcanoPlotW(testPL, useComp=j, FCthrs=1.5, FdrThrs=0.05, annColor=c(4,2,3), ProjNa=methNa[3], expFCarrow=TRUE, silent=TRUE),silent=TRUE)}
<- 4
gr <- match(rownames(AucAll)[AucRep[6-gr]], colnames(testPD$t))
j
## table of all proteins in cluster
<- which(AucAll[,"cluNo"]==gr)
useLi <- cbind(round(as.data.frame(AucAll)[useLi,c("cluNo","Auc.PD","Auc.MQ","Auc.PL")],3),
tmp as.data.frame(table1)[match(names(useLi),rownames(table1)), c(2,5,7,9)])
kable(tmp, caption="AUC details for best pairwise-comparisons ", align="c")
cluNo | Auc.PD | Auc.MQ | Auc.PL | log2rat | sig.PD.BH | sig.MQ.BH | sig.PL.BH | |
---|---|---|---|---|---|---|---|---|
250amol-5000amol | 4 | 0.972 | 0.925 | 0.979 | 4.322 | 197 | 162 | 141 |
5000amol-500amol | 4 | 0.966 | 0.908 | 0.977 | 3.322 | 170 | 101 | 130 |
2500amol-250amol | 4 | 0.929 | 0.886 | 0.963 | 3.322 | 114 | 31 | 92 |
## frequent concentrations :
layout(matrix(1:2), heights=c(1,2.5))
plotConcHist(mat=tmp, ref=table1)
## representative ROC
<- match(rownames(AucAll)[AucRep[6-gr]], names(rocPD))
jR plotROC(rocPD[[jR]], rocMQ[[jR]], rocPL[[jR]], col=colPanel, methNames=methNa, pointSi=0.8, xlim=c(0,0.45),
txtLoc=c(0.12,0.1,0.033), tit=paste("Cluster",gr," Example: ",names(rocPD)[jR]), legCex=1)
if(packageVersion("wrGraph") >= "1.2.5"){
layout(matrix(1:4,ncol=2))
try(VolcanoPlotW(testPD, useComp=j, FCthrs=1.5, FdrThrs=0.05, annColor=c(4,2,3), ProjNa=methNa[1], expFCarrow=TRUE, silent=TRUE),silent=TRUE)
try(VolcanoPlotW(testMQ, useComp=j, FCthrs=1.5, FdrThrs=0.05, annColor=c(4,2,3), ProjNa=methNa[2], expFCarrow=TRUE, silent=TRUE),silent=TRUE)
try(VolcanoPlotW(testPL, useComp=j, FCthrs=1.5, FdrThrs=0.05, annColor=c(4,2,3), ProjNa=methNa[3], expFCarrow=TRUE, silent=TRUE),silent=TRUE)}
<- 3
gr <- match(rownames(AucAll)[AucRep[6-gr]], colnames(testPD$t))
j
## table of all proteins in cluster
<- which(AucAll[,"cluNo"]==gr)
useLi <- cbind(round(as.data.frame(AucAll)[useLi,c("cluNo","Auc.PD","Auc.MQ","Auc.PL")],3),
tmp as.data.frame(table1)[match(names(useLi),rownames(table1)), c(2,5,7,9)])
kable(tmp, caption="AUC details for best pairwise-comparisons ", align="c")
cluNo | Auc.PD | Auc.MQ | Auc.PL | log2rat | sig.PD.BH | sig.MQ.BH | sig.PL.BH | |
---|---|---|---|---|---|---|---|---|
12500amol-125amol | 3 | 0.879 | 0.918 | 0.992 | 6.644 | 220 | 84 | 236 |
125amol-5000amol | 3 | 0.891 | 0.894 | 0.967 | 5.322 | 281 | 121 | 205 |
2500amol-500amol | 3 | 0.900 | 0.877 | 0.959 | 2.322 | 113 | 44 | 97 |
5000amol-50amol | 3 | 0.923 | 0.857 | 0.932 | 6.644 | 599 | 375 | 523 |
12500amol-25000amol | 3 | 0.888 | 0.882 | 0.934 | 1.000 | 11 | 0 | 29 |
2500amol-5000amol | 3 | 0.811 | 0.948 | 0.915 | 1.000 | 13 | 2 | 13 |
25000amol-50000amol | 3 | 0.850 | 0.882 | 0.926 | 1.000 | 136 | 61 | 92 |
2500amol-50amol | 3 | 0.862 | 0.846 | 0.882 | 5.644 | 556 | 306 | 491 |
## frequent concentrations :
layout(matrix(1:2), heights=c(1,2.5))
plotConcHist(mat=tmp, ref=table1)
## representative ROC
<- match(rownames(AucAll)[AucRep[6-gr]], names(rocPD))
jR plotROC(rocPD[[jR]],rocMQ[[jR]],rocPL[[jR]], col=colPanel, methNames=methNa, pointSi=0.8, xlim=c(0,0.45),
txtLoc=c(0.12,0.1,0.033), tit=paste("Cluster",gr," Example: ",names(rocPD)[jR]), legCex=1)
if(packageVersion("wrGraph") >= "1.2.5"){
layout(matrix(1:4,ncol=2))
try(VolcanoPlotW(testPD, useComp=j, FCthrs=1.5, FdrThrs=0.05, annColor=c(4,2,3), ProjNa=methNa[1], expFCarrow=TRUE, silent=TRUE),silent=TRUE)
try(VolcanoPlotW(testMQ, useComp=j, FCthrs=1.5, FdrThrs=0.05, annColor=c(4,2,3), ProjNa=methNa[2], expFCarrow=TRUE, silent=TRUE),silent=TRUE)
try(VolcanoPlotW(testPL, useComp=j, FCthrs=1.5, FdrThrs=0.05, annColor=c(4,2,3), ProjNa=methNa[3], expFCarrow=TRUE, silent=TRUE),silent=TRUE)}
<- 2
gr <- match(rownames(AucAll)[AucRep[6-gr]], colnames(testPD$t))
j
## table of all proteins in cluster
<- which(AucAll[,"cluNo"]==gr)
useLi <- cbind(round(as.data.frame(AucAll)[useLi,c("cluNo","Auc.PD","Auc.MQ","Auc.PL")],3),
tmp as.data.frame(table1)[match(names(useLi),rownames(table1)), c(2,5,7,9)])
kable(tmp, caption="AUC details for best pairwise-comparisons ", align="c")
cluNo | Auc.PD | Auc.MQ | Auc.PL | log2rat | sig.PD.BH | sig.MQ.BH | sig.PL.BH | |
---|---|---|---|---|---|---|---|---|
125amol-2500amol | 2 | 0.725 | 0.577 | 0.944 | 4.322 | 197 | 76 | 184 |
## frequent concentrations :
layout(matrix(1:2), heights=c(1,2.5))
plotConcHist(mat=tmp, ref=table1)
## representative ROC
<- match(rownames(AucAll)[AucRep[6-gr]], names(rocPD))
jR plotROC(rocPD[[jR]], rocMQ[[jR]], rocPL[[jR]], col=colPanel, methNames=methNa, pointSi=0.8, xlim=c(0,0.45),
txtLoc=c(0.12,0.1,0.033), tit=paste("Cluster",gr," Example: ",names(rocPD)[jR]), legCex=1)
if(packageVersion("wrGraph") >= "1.2.5"){
layout(matrix(1:4,ncol=2))
try(VolcanoPlotW(testPD, useComp=j, FCthrs=1.5, FdrThrs=0.05, annColor=c(4,2,3), ProjNa=methNa[1], expFCarrow=TRUE, silent=TRUE),silent=TRUE)
try(VolcanoPlotW(testMQ, useComp=j, FCthrs=1.5, FdrThrs=0.05, annColor=c(4,2,3), ProjNa=methNa[2], expFCarrow=TRUE, silent=TRUE),silent=TRUE)
try(VolcanoPlotW(testPL, useComp=j, FCthrs=1.5, FdrThrs=0.05, annColor=c(4,2,3), ProjNa=methNa[3], expFCarrow=TRUE, silent=TRUE),silent=TRUE)}
<- 1
gr <- match(rownames(AucAll)[AucRep[6-gr]], colnames(testPD$t))
j
## table of all proteins in cluster
<- which(AucAll[,"cluNo"]==gr)
useLi <- cbind(round(as.data.frame(AucAll)[useLi,c("cluNo","Auc.PD","Auc.MQ","Auc.PL")],3),
tmp as.data.frame(table1)[match(names(useLi),rownames(table1)), c(2,5,7,9)])
kable(tmp, caption="AUC details for best pairwise-comparisons ", align="c")
cluNo | Auc.PD | Auc.MQ | Auc.PL | log2rat | sig.PD.BH | sig.MQ.BH | sig.PL.BH | |
---|---|---|---|---|---|---|---|---|
250amol-500amol | 1 | 0.589 | 0.561 | 0.561 | 1.000 | 6 | 2 | 3 |
125amol-500amol | 1 | 0.539 | 0.451 | 0.677 | 2.000 | 21 | 6 | 30 |
125amol-250amol | 1 | 0.562 | 0.432 | 0.481 | 1.000 | 15 | 0 | 2 |
125amol-50amol | 1 | 0.493 | 0.326 | 0.459 | 1.322 | 277 | 120 | 216 |
500amol-50amol | 1 | 0.452 | 0.330 | 0.469 | 3.322 | 426 | 216 | 380 |
250amol-50amol | 1 | 0.377 | 0.367 | 0.385 | 2.322 | 354 | 183 | 315 |
## frequent concentrations :
layout(matrix(1:2, ncol=1), heights=c(1,2.5))
plotConcHist(mat=tmp, ref=table1)
## representative ROC
<- match(rownames(AucAll)[AucRep[6-gr]], names(rocPD))
jR plotROC(rocPD[[jR]], rocMQ[[jR]], rocPL[[jR]], col=colPanel, methNames=methNa, pointSi=0.8, xlim=c(0,0.45),
txtLoc=c(0.12,0.1,0.033), tit=paste("Cluster",gr," Example: ",names(rocPD)[jR]), legCex=1)
if(packageVersion("wrGraph") >= "1.2.5"){
layout(matrix(1:4,ncol=2))
try(VolcanoPlotW(testPD, useComp=j, FCthrs=1.5, FdrThrs=0.05, annColor=c(4,2,3), ProjNa=methNa[1], expFCarrow=TRUE, silent=TRUE),silent=TRUE)
try(VolcanoPlotW(testMQ, useComp=j, FCthrs=1.5, FdrThrs=0.05, annColor=c(4,2,3), ProjNa=methNa[2], expFCarrow=TRUE, silent=TRUE),silent=TRUE)
try(VolcanoPlotW(testPL, useComp=j, FCthrs=1.5, FdrThrs=0.05, annColor=c(4,2,3), ProjNa=methNa[3], expFCarrow=TRUE, silent=TRUE),silent=TRUE)}
We know from the experimental setup that there were 48 UPS1 proteins) present in the commercial mix added to a constant background of yeast-proteins. The lowest concentrations are extremely challenging and it is no surprise that many of them were not detected at the lowest concentration(s). In order to choose among the various concentrations of UPS1, let’s look how many NAs are in each group of replicates (ie before NA-imputation), and in particular, the number of NAs among the UPS1 proteins.
Previsouly we’ve looked at the total number of NAs, now let’s focus just on the UPS1 proteins. Obviously, instances of non-quantified UPS1 proteins make the following comparisons using these samples rather insecure, since NA-imputation is just an ‘educated guess’.
<- rbind(PD=sumNAperGroup(dataPD$raw[which(dataPD$annot[,"SpecType"]=="UPS1"),], grp9),
tab1 MQ=sumNAperGroup(dataMQ$raw[which(dataMQ$annot[,"SpecType"]=="UPS1"),], grp9),
PL= sumNAperGroup(dataPL$raw[which(dataPL$annot[,"SpecType"]=="UPS1"),], grp9) )
kable(tab1, caption="The number of NAs in the UPS1 proteins", align="c")
50amol | 125amol | 250amol | 500amol | 2500amol | 5000amol | 12500amol | 25000amol | 50000amol | |
---|---|---|---|---|---|---|---|---|---|
PD | 79 | 73 | 69 | 43 | 1 | 0 | 0 | 0 | 0 |
MQ | 112 | 113 | 109 | 98 | 19 | 10 | 3 | 4 | 1 |
PL | 27 | 32 | 34 | 20 | 0 | 0 | 0 | 0 | 0 |
One can see that starting the 5th level of UPS1 concentrations almost all UPS1 proteins were found in nearly all samples. In consequence we’ll avoid using all of them at all times, but this should be made depending on the very protein and quantification method.
Let’s look graphically at the number of NAs in each of the UPS1 proteins along the quantification methods :
<- function(dat, newOrd=UPS1$ac, relative=FALSE) { # count number of NAs per UPS protein and order as UPS
countRawNA <- rowSums(is.na(dat$raw[match(newOrd,rownames(dat$raw)),]))
out if(relative) out/nrow(dat$raw) else out }
<- cbind(PD=countRawNA(dataPD), MQ=countRawNA(dataMQ), PL=countRawNA(dataPL) )
sumNAperMeth <- sub("_UPS","",dataPL$annot[UPS1$ac,"EntryName"])
UPS1na par(mar=c(6.8, 3.5, 4, 1))
imageW(sumNAperMeth, rowNa=UPS1na, tit="Number of NAs in UPS proteins", xLab="", yLab="",
transp=FALSE, col=RColorBrewer::brewer.pal(9,"YlOrRd"))
mtext("dark red for high number of NAs",cex=0.7)
Typically the number of NAs is similar when comparing the different quantitation approaches, it tends to be a bit higher with MaxQuant. This means that some UPS1 proteins which are easier to (detect and) quantify than others. We can conclude, the capacity to successfully quantify a given protein depends on its abundance and its composition.
Plotting the principal components (PCA) typically allows to gain an overview on how samples are related to each other. This type of experiment is special for the fact that the majority of proteins is expected to remain constant (yeast matrix), while only the UPS1 proteins vary. Since we are primarily intereseted in the UPS1 proteins, the regular plots of PCA are not shown here, but PCA of the lines identified as UPS1.
Principal component analysis (PCA) cannot handle NA-values. Either all lines with any NAs have to be excluded, or data after NA-imputation have to be used. Here, the option of plotting data after NA-imputation was chosen (in the context of filtering UPS1 lines only one whould loose too many lines, ie proteins). Below plots are be made using the function plotPCAw()
from the package wrGraph. Via indexing we choose only the lines./proteins with the annoation ‘UPS1’.
plotPCAw(testPD$datImp[which(testPD$annot[,"SpecType"]=="UPS1"),], sampleGrp=grp9, tit="PCA on ProteomeDiscoverer, UPS1 only (NAs imputed)",rowTyName="proteins", useSymb2=0)
plotPCAw(testMQ$datImp[which(testMQ$annot[,"SpecType"]=="UPS1"),], sampleGrp=grp9, tit="PCA on MaxQuant, UPS1 only (NAs imputed)",rowTyName="proteins", useSymb2=0)
plotPCAw(testPL$datImp[which(testPL$annot[,"SpecType"]=="UPS1"),], sampleGrp=grp9, tit="PCA on Proline, UPS1 only (NAs imputed)",rowTyName="proteins", useSymb2=0)
Based on PCA plots one can see that the concentrations 125 - 500 aMol are very much alike and detecting differences may perform better when not combining them, as also confirmed by ROC part later. In the Screeplot we can see that the first principal component captures almost all variability. Thus, displaying the 3rd principal component (as done above) finally has no importance.
In order to have more data available for linear regression modelling it was decided to use UPS1 abundance values after NA-Imputation for linear regressions. Previously it was shown that NA values originate predominantly from absent or very low abundance quantitations, which justified relplacing NA values by low abundance values in a shrinkage like fashion.
As general indicator for data-quality and -usability let’s look at the intra-replicate variability. Here we plot all intra-group CVs (defined by UPS1-concentration), either the CVs for all quantified proteins or the UPS1 proteins only.
In the figure below the complete series (including yeast) is shown on the left side, the human UPS1 proteins only on the right side. Briefly, vioplots show a kernel-estimate for the distribution, in addition, a box-plot is also integrated (see vignette to package wrGraph).
## combined plot : all data (left), Ups1 (right)
layout(1:3)
<- list(length=18)
sumNAinPD 2*(1:length(unique(grp9))) -1] <- as.list(as.data.frame(log2(rowGrpCV(testPD$datImp, grp9))))
sumNAinPD[2*(1:length(unique(grp9))) ] <- as.list(as.data.frame(log2(rowGrpCV(testPD$datImp[which(testPD$annot[,"SpecType"]=="UPS1"),], grp9))))
sumNAinPD[names(sumNAinPD)[2*(1:length(unique(grp9))) -1] <- sub("amol","",unique(grp9))
names(sumNAinPD)[2*(1:length(unique(grp9))) ] <- paste(sub("amol","",unique(grp9)),"Ups",sep=".")
vioplotW(sumNAinPD, halfViolin="pairwise", tit="CV Intra Replicate, ProteomeDiscoverer", cexNameSer=0.6)
#> -> vioplotW -> asSepList : argument 'fxArg' is depreciated, please use argument 'exclElem' (still using since 'exclElem=NULL'
mtext("left part : all data\nright part: UPS1",adj=0,cex=0.8)
<- list(length=18)
sumNAinMQ 2*(1:length(unique(grp9))) -1] <- as.list(as.data.frame(log2(rowGrpCV(testMQ$datImp, grp9))))
sumNAinMQ[2*(1:length(unique(grp9))) ] <- as.list(as.data.frame(log2(rowGrpCV(testMQ$datImp[which(testMQ$annot[,"SpecType"]=="UPS1"),], grp9))))
sumNAinMQ[names(sumNAinMQ)[2*(1:length(unique(grp9))) -1] <- sub("amol","",unique(grp9)) # paste(unique(grp9),"all",sep=".")
names(sumNAinMQ)[2*(1:length(unique(grp9))) ] <- paste(sub("amol","",unique(grp9)),"Ups",sep=".") #paste(unique(grp9),"Ups1",sep=".")
vioplotW(sumNAinMQ, halfViolin="pairwise", tit="CV intra replicate, MaxQuant",cexNameSer=0.6)
#> -> vioplotW -> asSepList : argument 'fxArg' is depreciated, please use argument 'exclElem' (still using since 'exclElem=NULL'
mtext("left part : all data\nright part: UPS1",adj=0,cex=0.8)
<- list(length=18)
sumNAinPL 2*(1:length(unique(grp9))) -1] <- as.list(as.data.frame(log2(rowGrpCV(testPL$datImp, grp9))))
sumNAinPL[2*(1:length(unique(grp9))) ] <- as.list(as.data.frame(log2(rowGrpCV(testPL$datImp[which(testPL$annot[,"SpecType"]=="UPS1"),], grp9))))
sumNAinPL[names(sumNAinPL)[2*(1:length(unique(grp9))) -1] <- sub("amol","",unique(grp9))
names(sumNAinPL)[2*(1:length(unique(grp9))) ] <- paste(sub("amol","",unique(grp9)),"Ups",sep=".")
vioplotW(sumNAinPL, halfViolin="pairwise", tit="CV Intra Replicate, Proline", cexNameSer=0.6)
#> -> vioplotW -> asSepList : argument 'fxArg' is depreciated, please use argument 'exclElem' (still using since 'exclElem=NULL'
mtext("left part : all data\nright part: UPS1",adj=0,cex=0.8)
The distribution of intra-group CV-values showed (without major surprise) that the highest UPS1 concentrations replicated best. This phenomenon also correlates with the content of NAs in the original data. When imputing NA-values it is a challange to respect the variability of the respective data (NA-neighbours) before NA-imputation. Many NA-values can be observed when looking at very low UPS1-doses and too few initial quantitations values may remain for meaningful comparisons. Of course, with an elevanted content of NAs the mechanism of NA-substitution will also contribute to masking (in part) the true variability.
In consequence pair-wise comparisons using one of the higher UPS1-concentrations group are expected to have a decent chance to rather specifically reveil a high number of UPS1 proteins.
Once can see that lower concentrations of UPS1 usually have worse CV (coefficient of variance) in the respective samples,
First, we construct a container for storing various measures and results which we will look at lateron.
## prepare object for storing all results
<- array(NA, dim=c(length(UPS1$ac),length(methNa),7), dimnames=list(UPS1$ac,c("PD","MQ","PL"),
datUPS1 c("sco","nPep","medAbund", "logp","slope","startFr","cluNo")))
Now we’ll calculate the linear models, extract slope & pval for each UPS1 protein. The functions used also allow plotting the resulting regression results, but plotting each UPS1 protein would make very crowded figures. Instead, we’ll plot representative examples only aftr clustering the regression-results.
<- list(length=length(NamesUpsPD))
lmPD <- FALSE
doPl 1:length(NamesUpsPD)] <- lapply(NamesUpsPD[1:length(NamesUpsPD)], linModelSelect, dat=dataPD,
lmPD[expect=grp9, startLev=1:5, cexXAxis=0.7, logExpect=TRUE, plotGraph=doPl, silent=TRUE)
names(lmPD) <- NamesUpsPD
## We make a little summary of regression-results (ProteomeDiscoverer)
<- match(names(lmPD), UPS1$ac)
linIn 1,c("logp","slope","startFr")] <- cbind(log10(sapply(lmPD, function(x) x$coef[2,4])),
datUPS1[linIn,sapply(lmPD, function(x) x$coef[2,1]), sapply(lmPD, function(x) x$startLev) )
## need correct rownames in dataPD$datImp !!!
1,"medAbund"] <- apply(wrMisc::.scale01(dataPD$datImp)[match(UPS1$ac,rownames(dataPD$datImp)),],1,median,na.rm=TRUE) datUPS1[,
<- list(length=length(NamesUpsMQ))
lmMQ 1:length(NamesUpsMQ)] <- lapply(NamesUpsMQ[1:length(NamesUpsMQ)], linModelSelect, dat=dataMQ,
lmMQ[expect=grp9, startLev=1:5, cexXAxis=0.7, logExpect=TRUE, plotGraph=doPl, silent=TRUE)
names(lmMQ) <- NamesUpsMQ
## We make a little summary of regression-results (MaxQuant)
<- match(names(lmMQ), UPS1$ac)
linIn 2,c("logp","slope","startFr")] <- cbind( log10(sapply(lmMQ, function(x) x$coef[2,4])),
datUPS1[linIn,sapply(lmMQ, function(x) x$coef[2,1]), sapply(lmMQ, function(x) x$startLev) )
2,"medAbund"] <- apply(wrMisc::.scale01(dataMQ$datImp)[match(UPS1$ac,rownames(dataMQ$datImp)),],1,median,na.rm=TRUE) datUPS1[,
<- list(length=length(NamesUpsPL))
lmPL 1:length(NamesUpsPL)] <- lapply(NamesUpsPL[1:length(NamesUpsPL)], linModelSelect, dat=dataPL,
lmPL[expect=grp9, startLev=1:5, cexXAxis=0.7, logExpect=TRUE, plotGraph=doPl, silent=TRUE)
names(lmPL) <- NamesUpsPL
<- match(names(lmPL), UPS1$ac)
linIn 3,c("logp","slope","startFr")] <- cbind(log10(sapply(lmPL, function(x) x$coef[2,4])),
datUPS1[linIn,sapply(lmPL, function(x) x$coef[2,1]), sapply(lmPL, function(x) x$startLev) )
3,"medAbund"] <- apply(wrMisc::.scale01(dataPL$datImp)[match(UPS1$ac,rownames(dataPL$datImp)),],1,median,na.rm=TRUE) datUPS1[,
To get a general view, let’s look where regressions typically have their best starting-site (ie how many low concentrations points are usually better omitted):
## at which concentration of UPS1 did the best regression start ?
<- sapply(1:5, function(x) apply(datUPS1[,,"startFr"],2,function(y) sum(x==y)))
stTab colnames(stTab) <- paste("lev",1:5,sep="_")
kable(stTab, caption = "Frequency of starting levels for regression")
lev_1 | lev_2 | lev_3 | lev_4 | lev_5 | |
---|---|---|---|---|---|
PD | 2 | 21 | 12 | 3 | 10 |
MQ | 4 | 12 | 8 | 6 | 18 |
PL | 6 | 7 | 10 | 5 | 20 |
Next, we’ll inspect the relation between regression-slopes and p-values (for H0: slope=0) :
layout(matrix(1:4,ncol=2))
<- "fill according to median abundance (blue=low - green - red=high)"
subTi
plotMultRegrPar(datUPS1,1,xlim=c(-25,-1),ylim=c(0.1,1.6),tit="ProteomeDiscoverer UPS1, p-value vs slope",subTit=subTi)
plotMultRegrPar(datUPS1,2,xlim=c(-25,-1),ylim=c(0.1,1.6),tit="MaxQuant UPS1, p-value vs slope",subTit=subTi)
plotMultRegrPar(datUPS1,3,xlim=c(-25,-1),ylim=c(0.1,1.6),tit="Proline UPS1, p-value vs slope",subTit=subTi)
We can observe, that sope and (log)p-value of the resultant regressions do not necessarily correlate well. Thus, considering only one of these resultant values may not be sufficient.
When judging results for indivual UPS1 proteins one may see that both the value of the slope as well as the p-value (for H0:slope=0) are important to consider. For example, there are some cases where the quantitations lign up well giving a good p-value but with slopes < 0.4. This is definitely not the type of dose-response characteristics we are looking for. In consequence, let’s construct a combined score for these components slope and p-value for easier consideration of both elements at once :
for(i in 1:(dim(datUPS1)[2])) datUPS1[,i,"sco"] <- -datUPS1[,i,"logp"] - (datUPS1[,i,"slope"] -1)^2 # cut at > 8
Next, let’s bring together all linear-model scores, the number of peptides and meadian protein abundance for each of UPS1 proteins in one object to facilite further steps.
1,2] <- rowSums(dataPD$count[match(UPS1$ac,dataPD$annot[,1]),,"NoOfPeptides"], na.rm=TRUE)
datUPS1[,2,2] <- rowSums(dataMQ$count[match(UPS1$ac,dataMQ$annot[,1]),,1], na.rm=TRUE)
datUPS1[,3,2] <- rowSums(dataPL$count[match(UPS1$ac,dataPL$annot[,1]),,"NoOfPeptides"], na.rm=TRUE) datUPS1[,
Now we can explore the regression score and its context to other parameters, below it’s done graphically.
layout(matrix(1:4, ncol=2))
par(mar=c(5.5, 2.2, 4, 0.4))
<- RColorBrewer::brewer.pal(9,"YlOrRd")
col1 imageW(datUPS1[,,1], col=col1, tit="Linear regression score", xLab="",yLab="",transp=FALSE)
mtext("red for bad score", cex=0.75)
imageW(log(datUPS1[,,2]), tit="Number of peptides", xLab="",yLab="", col=col1, transp=FALSE)
mtext("dark red for high number of peptides", cex=0.75)
## ratio : regression score vs no of peptides
imageW(datUPS1[,,1]/log(datUPS1[,,2]), col=rev(col1), tit="Regression score / Number of peptides", xLab="",yLab="", transp=FALSE)
mtext("dark red for high (good) lmScore/peptide ratio)", cex=0.75)
## score vs abundance
imageW(datUPS1[,,1]/datUPS1[,,3], col=rev(col1), tit="Regression score / median Abundance", xLab="",yLab="", transp=FALSE)
mtext("dark red for high (good) lmScore/abundance ratio)", cex=0.75)
From the heatmap-like plots we can see that some proteins are rather consistently quantified better by any of the methods. Some of the varaibility may be explained by the number of peptides (in case of MaxQuant ‘razor-peptides’ were used), see plot of ‘regression score / number of peptides’.
In contrast, UPS-protein median abundance does not correlate or explain this phenomenon (see last plot ‘regression score / median abundance’). So we cannot support the hypothesis that highly abundant proteins get quantified better.
Using the linear regression score defined above we can rank UPS1 proteins and display representative ones in order to avoid crowed and repetitive figures.
Now, we can try to group the regression scores into groups and easily display representative examples for each group. Here, we (pre)define that we want to obtain 5 groups (like ratings from 1 -5 starts), a k-Means clustering approach was chosen.
## number of groups for clustering
<- 5
nGr
## clustering using kMeans
<- stats::kmeans(standardW(datUPS1[,,"sco"], byColumn=FALSE), nGr)$cluster
kMx "cluNo"] <- matrix(rep(kMx,dim(datUPS1)[2]), nrow=length(kMx))
datUPS1[,,
<- apply(datUPS1[,,"sco"], 1, function(x) prod(x)^(1/length(x))) # geometric mean across analysis soft
geoM <- lrbind(by(cbind(geoM,datUPS1[,,"sco"], clu=kMx), kMx, function(x) x[order(x[,1],decreasing=TRUE),])) # organize by clusters
geoM2 <- tapply(geoM2[,"geoM"], geoM2[,"clu"], median)
tmp "clu"] <- rep(rank(tmp, ties.method="first"), table(kMx))
geoM2[,<- geoM2[order(geoM2[,"clu"],geoM2[,"geoM"],decreasing=TRUE),] # order as decreasing median.per.cluster
geoM2 "clu"] <- rep(1:max(kMx), table(geoM2[,"clu"])[rank(unique(geoM2[,"clu"]))]) # replace cluster-names to increasing
geoM2[,
profileAsClu(geoM2[,2:4], geoM2[,"clu"], tit="Clustered Regression Results for UPS1 Proteins", ylab="Linear regression score")
<- datUPS1[match(rownames(geoM2), rownames(datUPS1)),,] # bring in new order
datUPS1 "cluNo"] <- geoM2[,"clu"] # update cluster-names
datUPS1[,,
### prepare annotation of UPS proteins
<- dataPL$annot[match(rownames(datUPS1), dataPL$annot[,1]), c(1,3)]
annUPS1 2] <- substr(sub("_UPS","",sub("generic_ups\\|[[:alnum:]]+-{0,1}[[:digit:]]\\|","",annUPS1[,2])),1,42) annUPS1[,
## index of representative for each cluster (median position inside cluster)
<- tapply(geoM2[,"geoM"], geoM2[,"clu"], function(x) floor(length(x)/2))+ c(0,cumsum(table(geoM2[,"clu"])[-5])) UPSrep
Previously we organized all UPS1 proteins according to their regression characteristics into 5 clusters and each cluster was ordered for descending scores. Now we can use the median position within each cluster as representative example for this cluster.
<- 1
gr <- which(datUPS1[,1,"cluNo"]==gr)
useLi <- c("Protein",paste(colnames(datUPS1), rep(c("slope","logp"), each=ncol(datUPS1)), sep=" "))
colNa try(kable(cbind(annUPS1[useLi,2], signif(datUPS1[useLi,,"slope"],3), signif(datUPS1[useLi,,"logp"],3)),
caption="Regression details for cluster of best UPS1 proteins ", col.names=colNa, align="l"),silent=TRUE)
Protein | PD slope | MQ slope | PL slope | PD logp | MQ logp | PL logp | |
---|---|---|---|---|---|---|---|
P00915 | Carbonic anhydrase 1 (Chain 2-261) | 1.07 | 1.4 | 1.02 | -23.6 | -16.9 | -23.9 |
P10636-8 | Microtubule-associated protein tau (Isofor | 1.08 | 1.42 | 0.937 | -18.2 | -17.9 | -22 |
P41159 | Leptin (Chain 22-167) | 1.22 | 1.11 | 0.904 | -21.5 | -19.1 | -17 |
P00167 | Cytochrome b5 (Chain 1-134, N-terminal His | 1.06 | 1.18 | 0.972 | -22.2 | -18.2 | -17.3 |
P06732 | Creatine kinase M-type (Chain 1-381) | 1.15 | 1.54 | 0.89 | -18.2 | -18.1 | -18.9 |
P55957 | BH3-interacting domain death agonist (Chai | 1.08 | 1.06 | 0.969 | -19.8 | -18.3 | -16.6 |
P00918 | Carbonic anhydrase 2 (Chain 2-260) | 1.17 | 1.29 | 0.925 | -18.1 | -16.7 | -20 |
P01375 | Tumor necrosis factor, soluble form (Chain | 1.15 | 0.944 | 1.17 | -22.4 | -15.1 | -17.5 |
P05413 | Fatty acid-binding protein, heart (Chain 2 | 1.07 | 1.2 | 0.864 | -18.2 | -16.9 | -19.1 |
P08758 | Annexin A5 (Chain 2-320) | 1.09 | 1.17 | 0.925 | -19.7 | -16.7 | -16.7 |
P01344 | Insulin-like growth factor II (Chain 25-91 | 0.996 | 0.889 | 0.903 | -17.6 | -18.2 | -16.5 |
P06396 | Gelsolin (Chain 28-782) | 1.12 | 1.41 | 0.399 | -21.7 | -19 | -13.2 |
P00709 | Alpha-lactalbumin (Chain 20-142) | 0.994 | 1.19 | 0.944 | -19.9 | -15.8 | -16.5 |
P02787 | Serotransferrin (Chain 20-698) | 1.1 | 1.46 | 0.813 | -22.7 | -17.6 | -12.5 |
P02144 | Myoglobin (Chain 2-154) | 1.18 | 1.22 | 0.575 | -17.9 | -19.1 | -13.9 |
## Plotting the best regressions, this required package wrGraph version 1.2.5 (or higher)
if(packageVersion("wrGraph") >= "1.2.5"){
layout(matrix(1:4, ncol=2))
<- paste0(methNa,", ",annUPS1[UPSrep[gr],1])
tit try(tm <- linModelSelect(annUPS1[UPSrep[gr],1], dat=dataPD, tit=tit[1], expect=grp9, startLev=1:5, cexXAxis=0.7, logExpect=TRUE, plotGraph=TRUE, silent=TRUE),silent=TRUE)
try(tm <- linModelSelect(annUPS1[UPSrep[gr],1], dat=dataMQ, tit=tit[2], expect=grp9, startLev=1:5, cexXAxis=0.7, logExpect=TRUE, plotGraph=TRUE, silent=TRUE),silent=TRUE)
try(tm <- linModelSelect(annUPS1[UPSrep[gr],1], dat=dataPL, tit=tit[3], expect=grp9, startLev=1:5, cexXAxis=0.7, logExpect=TRUE, plotGraph=TRUE, silent=TRUE),silent=TRUE) }
<- 2
gr <- which(datUPS1[,1,"cluNo"]==gr)
useLi try(kable(cbind(annUPS1[useLi,2], signif(datUPS1[useLi,,"slope"],3), signif(datUPS1[useLi,,"logp"],3)),
caption="Regression details for cluster of 2nd best UPS1 proteins ", col.names=colNa, align="l"),silent=TRUE)
Protein | PD slope | MQ slope | PL slope | PD logp | MQ logp | PL logp | |
---|---|---|---|---|---|---|---|
P02768 | Serum albumin (Chain 26-609) | 1 | 1.46 | 0.808 | -16.5 | -15.4 | -19.7 |
P01579 | Interferon Gamma (Chain 23-166) | 1.07 | 1.05 | 0.831 | -18.1 | -13.7 | -19.2 |
P01008 | Antithrombin-III (Chain 33-464) | 0.959 | 1.24 | 0.864 | -17.2 | -16.6 | -16.5 |
P02788 | Lactotransferrin (Chain 20-710) | 1.3 | 1.47 | 0.688 | -16.2 | -17.9 | -16.1 |
O00762 | Ubiquitin-conjugating enzyme E2 C (Chain 1 | 1.08 | 1.19 | 0.917 | -18.4 | -12.9 | -18.7 |
P04040 | Catalase (Chain 2-527) | 0.997 | 1.35 | 0.75 | -13.5 | -17.5 | -18.2 |
P12081 | Histidyl-tRNA synthetase, cytoplasmic (Cha | 0.885 | 1.28 | 0.863 | -16.6 | -15.2 | -13.8 |
P16083 | Ribosyldihydronicotinamide dehydrogenase [ | 1.24 | 1.75 | 0.883 | -13.6 | -16.3 | -16.1 |
P62937 | Peptidyl-prolyl cis-trans isomerase A (Cha | 0.979 | 1.26 | 0.895 | -18.2 | -12.3 | -15.2 |
P01127 | Platelet-derived growth factor B chain (Ch | 1.1 | 1.48 | 1.02 | -16.4 | -11.3 | -16.9 |
P01133 | Pro-Epidermal growth factor (EGF) (Chain 9 | 0.982 | 1.06 | 0.89 | -17.2 | -13.6 | -13.1 |
Q06830 | Peroxiredoxin 1 (Chain 2-199) | 0.78 | 0.89 | 0.698 | -15 | -12.2 | -16.6 |
O76070 | Gamma-synuclein (Chain 1-127) | 1.06 | 1.04 | 0.698 | -16.4 | -15.2 | -12.1 |
P51965 | Ubiquitin-conjugating enzyme E2 E1 (Chain | 1 | 1.16 | 0.465 | -18.9 | -11.4 | -14 |
P61626 | Lysozyme C (Chain 19-148) | 1.17 | 0.658 | 1.16 | -17.2 | -11.6 | -13.7 |
if(packageVersion("wrGraph") >= "1.2.5"){
layout(matrix(1:4, ncol=2))
<- paste0(methNa,", ",annUPS1[UPSrep[gr],1])
tit try(tm <- linModelSelect(annUPS1[UPSrep[gr],1], dat=dataPD, tit=tit[1], expect=grp9, startLev=1:5, cexXAxis=0.7, logExpect=TRUE, plotGraph=TRUE, silent=TRUE),silent=TRUE)
try(tm <- linModelSelect(annUPS1[UPSrep[gr],1], dat=dataMQ, tit=tit[2], expect=grp9, startLev=1:5, cexXAxis=0.7, logExpect=TRUE, plotGraph=TRUE, silent=TRUE),silent=TRUE)
try(tm <- linModelSelect(annUPS1[UPSrep[gr],1], dat=dataPL, tit=tit[3], expect=grp9, startLev=1:5, cexXAxis=0.7, logExpect=TRUE, plotGraph=TRUE, silent=TRUE),silent=TRUE) }
<- 3
gr <- which(datUPS1[,1,"cluNo"]==gr)
useLi try(kable(cbind(annUPS1[useLi,2], signif(datUPS1[useLi,,"slope"],3), signif(datUPS1[useLi,,"logp"],3)),
caption="Regression details for 3rd cluster UPS1 proteins ", col.names=colNa, align="l"),silent=TRUE)
Protein | PD slope | MQ slope | PL slope | PD logp | MQ logp | PL logp | |
---|---|---|---|---|---|---|---|
P68871 | Hemoglobin subunit beta (Chain 2-147) | 0.689 | 1.51 | 0.804 | -13.4 | -13.8 | -14.9 |
P01112 | GTPase HRas (Chain 1-189) | 0.805 | 0.703 | 0.835 | -12.8 | -11.4 | -15.6 |
P62988 | Ubiquitin (Chain 1-76, N-terminal His tag) | 0.869 | 1.06 | 0.0609 | -13.2 | -12.1 | -14.9 |
P99999 | Cytochrome c (Chain 2-105) | 0.908 | 1.32 | 0.438 | -12 | -12.9 | -12.2 |
P15559 | NAD(P)H dehydrogenase [quinone] 1 (Chain 2 | 0.0732 | 1.04 | 0.113 | -10.1 | -14.8 | -13 |
P08263 | Glutathione S-transferase A1 (Chain 2-222) | 0.403 | 1.12 | 0.21 | -10.4 | -16.4 | -10.6 |
P02753 | Retinol-binding protein 4 (Chain 19-201) | 0.329 | 0.548 | 0.849 | -10.3 | -9.58 | -15.9 |
P09211 | Glutathione S-transferase P (Chain 2-210) | 0.719 | 0.897 | 0.629 | -10.5 | -10.9 | -12.3 |
P63165 | Small ubiquitin-related modifier 1 (Chain | 1.3 | 0.844 | 0.704 | -9.04 | -11.1 | -13.5 |
if(packageVersion("wrGraph") >= "1.2.5"){
layout(matrix(1:4, ncol=2))
<- paste0(methNa,", ",annUPS1[UPSrep[gr],1])
tit try(tm <- linModelSelect(annUPS1[UPSrep[gr],1], dat=dataPD, tit=tit[1], expect=grp9, startLev=1:5, cexXAxis=0.7, logExpect=TRUE, plotGraph=TRUE, silent=TRUE),silent=TRUE)
try(tm <- linModelSelect(annUPS1[UPSrep[gr],1], dat=dataMQ, tit=tit[2], expect=grp9, startLev=1:5, cexXAxis=0.7, logExpect=TRUE, plotGraph=TRUE, silent=TRUE),silent=TRUE)
try(tm <- linModelSelect(annUPS1[UPSrep[gr],1], dat=dataPL, tit=tit[3], expect=grp9, startLev=1:5, cexXAxis=0.7, logExpect=TRUE, plotGraph=TRUE, silent=TRUE),silent=TRUE) }
<- 4
gr <- which(datUPS1[,1,"cluNo"]==gr)
useLi try(kable(cbind(annUPS1[useLi,2], signif(datUPS1[useLi,,"slope"],3), signif(datUPS1[useLi,,"logp"],3)),
caption="Regression details for 3rd cluster UPS1 proteins ", col.names=colNa, align="l"),silent=TRUE)
Protein | PD slope | MQ slope | PL slope | PD logp | MQ logp | PL logp | |
---|---|---|---|---|---|---|---|
P61769 | Beta-2-microglobulin (Chain 21-119) | 1.26 | 0.625 | 0.664 | -17.5 | -8.74 | -12.2 |
Q15843 | NEDD8 (Chain 1-81) | 1.17 | 0.852 | 1 | -12.6 | -7.2 | -14.3 |
P10599 | Thioredoxin (Chain 2-105, N-terminal His | 0.94 | 0.333 | 0.959 | -15.5 | -5.14 | -15.3 |
P10145 | Interleukin-8, IL-8 (Chain 28-99) | 0.881 | 1.09 | 0.983 | -10.9 | -8.31 | -10.7 |
P02741 | C-reactive protein (Chain 19-224) | 0.265 | 1.03 | 0.872 | -12.2 | -7.2 | -11.2 |
P00441 | Superoxide dismutase [Cu-Zn] (Chain 2-154) | 0.758 | 1.18 | 0.505 | -11.6 | -7.5 | -10.3 |
P63279 | SUMO-conjugating enzyme UBC9 (Chain 1-158) | 0.853 | 1.08 | 0.549 | -12.7 | -8.57 | -7.9 |
if(packageVersion("wrGraph") >= "1.2.5"){
layout(matrix(1:4, ncol=2))
<- paste0(methNa,", ",annUPS1[UPSrep[gr],1])
tit try(tm <- linModelSelect(annUPS1[UPSrep[gr],1], dat=dataPD, tit=tit[1], expect=grp9, startLev=1:5, cexXAxis=0.7, logExpect=TRUE, plotGraph=TRUE, silent=TRUE),silent=TRUE)
try(tm <- linModelSelect(annUPS1[UPSrep[gr],1], dat=dataMQ, tit=tit[2], expect=grp9, startLev=1:5, cexXAxis=0.7, logExpect=TRUE, plotGraph=TRUE, silent=TRUE),silent=TRUE)
try(tm <- linModelSelect(annUPS1[UPSrep[gr],1], dat=dataPL, tit=tit[3], expect=grp9, startLev=1:5, cexXAxis=0.7, logExpect=TRUE, plotGraph=TRUE, silent=TRUE),silent=TRUE) }
<- 5
gr <- which(datUPS1[,1,"cluNo"]==gr)
useLi try(kable(cbind(annUPS1[useLi,2], signif(datUPS1[useLi,,"slope"],3), signif(datUPS1[useLi,,"logp"],3)),
caption="Regression details for 5th cluster UPS1 proteins ", col.names=colNa, align="l"),silent=TRUE)
Protein | PD slope | MQ slope | PL slope | PD logp | MQ logp | PL logp | |
---|---|---|---|---|---|---|---|
P01031 | Complement C5 (C5a anaphylatoxin) (Chain 6 | 0.465 | 1.16 | 0.888 | -8.08 | -6.34 | -12.7 |
P69905 | Hemoglobin subunit alpha (Chain 2-142) | 0.507 | 0.306 | 1.01 | -9.24 | -1.41 | -8.73 |
if(packageVersion("wrGraph") >= "1.2.5"){
layout(matrix(1:4, ncol=2))
<- paste0(methNa,", ",annUPS1[UPSrep[gr],1])
tit try(tm <- linModelSelect(annUPS1[UPSrep[gr],1], dat=dataPD, tit=tit[1], expect=grp9, startLev=1:5, cexXAxis=0.7, logExpect=TRUE, plotGraph=TRUE, silent=TRUE),silent=TRUE)
try(tm <- linModelSelect(annUPS1[UPSrep[gr],1], dat=dataMQ, tit=tit[2], expect=grp9, startLev=1:5, cexXAxis=0.7, logExpect=TRUE, plotGraph=TRUE, silent=TRUE),silent=TRUE)
try(tm <- linModelSelect(annUPS1[UPSrep[gr],1], dat=dataPL, tit=tit[3], expect=grp9, startLev=1:5, cexXAxis=0.7, logExpect=TRUE, plotGraph=TRUE, silent=TRUE),silent=TRUE) }
The choice of the ‘best suited’ approach to quantify and compare proteomics data is not trivial at all. Particular attention has to be given to the choice of the numerous ‘small’ parameters which may have a very strong impact on the final outcome, as it has been experienced when preparing the data for this vignette or at other places (eg Chawade et al 2015). Thus, knowing and understanding well the software/tools one has chosen is of prime importance ! Of course, this also concerns for the protein identifcation part/software chosen.
The total number of proteins identified varies considerably between methods, this information may be very important to the user in real-world settings but is only taken in consideration in part in the comparisons presented.
ROC curves allow us to gain more insight in choosing cutoff values for statistical testing. Frequently the ideal threshold maximizing sensitivity and specificity lies quite distant to the common 5-percent threshold. This indicates that many times the common 5-percent threshold may not be the ‘optimal’ compromise as thershold for calling differential abundant proteins.
The author wants to acknowledge the support by the IGBMC (CNRS UMR 7104, Inserm U 1258, UdS), CNRS, Université de Strasbourg and Inserm and of course all collegues from the IGBMC proteomics platform. The author wishes to thank the CRAN -staff for all their help with new entries and their efforts in maintaining this repository of R-packages. Furthermore, many very fruitful discussions with colleages on national and international level have helped to formulate ideas and improve the tools presented here.
Thank you for you interest, this package is constantly evolving, new featues/functions may get added to the next version.
For completeness :
#> R version 4.1.2 (2021-11-01)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19042)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=C LC_CTYPE=French_France.1252
#> [3] LC_MONETARY=French_France.1252 LC_NUMERIC=C
#> [5] LC_TIME=French_France.1252
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] rmarkdown_2.11 knitr_1.37 wrGraph_1.3.0 wrProteo_1.6.0 wrMisc_1.8.1
#>
#> loaded via a namespace (and not attached):
#> [1] qvalue_2.24.0 tidyselect_1.1.1 xfun_0.29 bslib_0.3.1
#> [5] purrr_0.3.4 reshape2_1.4.4 splines_4.1.2 colorspace_2.0-2
#> [9] vctrs_0.3.8 generics_0.1.2 htmltools_0.5.2 yaml_2.2.2
#> [13] utf8_1.2.2 rlang_1.0.1 R.oo_1.24.0 sm_2.2-5.7
#> [17] jquerylib_0.1.4 pillar_1.7.0 R.utils_2.11.0 glue_1.5.1
#> [21] DBI_1.1.2 RColorBrewer_1.1-2 readxl_1.3.1 lifecycle_1.0.1
#> [25] plyr_1.8.6 stringr_1.4.0 cellranger_1.1.0 munsell_0.5.0
#> [29] gtable_0.3.0 R.methodsS3_1.8.1 evaluate_0.14 fastmap_1.1.0
#> [33] fdrtool_1.2.17 fansi_1.0.2 highr_0.9 Rcpp_1.0.8
#> [37] scales_1.1.1 limma_3.48.3 jsonlite_1.7.3 ggplot2_3.3.5
#> [41] digest_0.6.29 stringi_1.7.6 dplyr_1.0.8 grid_4.1.2
#> [45] cli_3.1.1 tools_4.1.2 magrittr_2.0.2 sass_0.4.0
#> [49] tibble_3.1.6 crayon_1.5.0 pkgconfig_2.0.3 ellipsis_0.3.2
#> [53] assertthat_0.2.1 R6_2.5.1 compiler_4.1.2