Example : Analyzing Label-free Quantitative Proteomics Data

Label-free Quantitative Proteomics Introduction

Multiple algorithms and software implementations have been developed for quantitation label-free proteomics experiments (LFQ), in particular for extracted ion chromatograms (XIC). For more background information you may look at Wikipedia labell-free Proteomics.

The tools presented here are designed for use with label-free XIC (ie LFQ) data. Several of the programs for extracting initial quantitations also allow getting spectral counting (PSM) data which can also get imported into R, however their use is not further discussed in this vignette. In general it is preferable to use XIC for comparing peptde of protein quantities between different protein extracts/samples.

This package provides support for importing quantitation results from Proteome Discoverer, MaxQuant and Proline. All quantitation import functions offer special features for further separating annotation related information, like species, for later use.

In most common real-world cases people typically analyze data using only one quantitation algorithm/software. Below in this vignette, we’ll use only the quantitation data generated using MaxQuant (ProteomeDiscoverer, Proline, OpenMS and MassChroQ are supported, too). The other vignette to this package (“UPS-1 spike-in Experiments”) shows in detail the import functions available for ProteomeDiscoverer and Proline and how further comparsions can be performed in bench-mark studies. All these import functions generate an equivalent output format, separating (selected) annotation data ($annot) from normalized log2-quantitation data ($quant) and initial quantitation ($raw).

Normalization (discussed below in more detail) is an important part of ‘preparing’ the data for subsequant analysis. The import functions in this package allow performin an initial normalization step (with choice among multiple algorithims), too. Further information about the proteins identifed can be considered during normalization: For example, it is possible to exclude contaminants like keratins which are frequently found among the higher abundant proteins which may potentially introduce bias at global normalization.

Technical replicates are very frequently produced in proteomics, they allow to assess the variability linked to repeated injection of the same material. Biological replicates, however, make additional information accessible, allowing the interpretation of experiments in a more general way.

Meta-Data (SDRF Format)

Before importing quantification results, let’s try to gather information about the experimental setup :

The project Proteomics Sample Metadata Format aims to provide a framework of providing a uniform format for documenting experimental meta-data (sdrf-format). The meta-data for experiments already integrated can be directly read/accessed from wrProteo.

Either you download the meta-data as file ‘sdrf.tsv’ from Pride/PXD001819, or you may read it directly from github/bigbio. If your dataset has been added to github, you don’t necessarily need to copy/paste the full path, you only need the main PRIDE accession number, readSdrfAnnot can compose the full url for you.

## Read meta-data from  github.com/bigbio/proteomics-metadata-standard/
pxd001819meta <- readSdrf("PXD001819")

str(pxd001819meta)
#> 'data.frame':    27 obs. of  24 variables:
#>  $ source.name                          : chr  "Sample 1" "Sample 1" "Sample 1" "Sample 2" ...
#>  $ characteristics.organism.            : chr  "Saccharomyces cerevisiae" "Saccharomyces cerevisiae" "Saccharomyces cerevisiae" "Saccharomyces cerevisiae" ...
#>  $ characteristics.organism.part.       : chr  "not available" "not available" "not available" "not available" ...
#>  $ characteristics.disease.             : chr  "not available" "not available" "not available" "not available" ...
#>  $ characteristics.cell.type.           : chr  "not applicable" "not applicable" "not applicable" "not applicable" ...
#>  $ characteristics.mass.                : chr  "2 mg" "2 mg" "2 mg" "2 mg" ...
#>  $ characteristics.spiked.compound.     : chr  "CT=mixture;QY=12500 amol;CN=UPS1;CV=Standards Research Group" "CT=mixture;QY=12500 amol;CN=UPS1;CV=Standards Research Group" "CT=mixture;QY=12500 amol;CN=UPS1;CV=Standards Research Group" "CT=mixture;QY=125 amol;CN=UPS1;CV=Standards Research Group" ...
#>  $ characteristics.biological.replicate.: int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ material.type                        : chr  "lysate" "lysate" "lysate" "lysate" ...
#>  $ assay.name                           : chr  "run 1" "run 2" "run 3" "run 4" ...
#>  $ technology.type                      : chr  "proteomic profiling by mass spectrometry" "proteomic profiling by mass spectrometry" "proteomic profiling by mass spectrometry" "proteomic profiling by mass spectrometry" ...
#>  $ comment.label.                       : chr  "AC=MS:1002038;NT=label free sample" "AC=MS:1002038;NT=label free sample" "AC=MS:1002038;NT=label free sample" "AC=MS:1002038;NT=label free sample" ...
#>  $ comment.instrument.                  : chr  "AC=MS:1001742;NT=LTQ Orbitrap Velos" "AC=MS:1001742;NT=LTQ Orbitrap Velos" "AC=MS:1001742;NT=LTQ Orbitrap Velos" "AC=MS:1001742;NT=LTQ Orbitrap Velos" ...
#>  $ comment.precursor.mass.tolerance.    : chr  "5 ppm" "5 ppm" "5 ppm" "5 ppm" ...
#>  $ comment.fragment.mass.tolerance.     : chr  "0.8 Da" "0.8 Da" "0.8 Da" "0.8 Da" ...
#>  $ comment.cleavage.agent.details.      : chr  "NT=trypsin/P;AC=MS:1001313" "NT=trypsin/P;AC=MS:1001313" "NT=trypsin/P;AC=MS:1001313" "NT=trypsin/P;AC=MS:1001313" ...
#>  $ comment.modification.parameters.     : chr  "NT=Carbamidomethyl;TA=C;MT=fixed;AC=UNIMOD:4" "NT=Carbamidomethyl;TA=C;MT=fixed;AC=UNIMOD:4" "NT=Carbamidomethyl;TA=C;MT=fixed;AC=UNIMOD:4" "NT=Carbamidomethyl;TA=C;MT=fixed;AC=UNIMOD:4" ...
#>  $ comment.modification.parameters..1   : chr  "NT=Oxidation;MT=variable;TA=M;AC=UNIMOD:35" "NT=Oxidation;MT=variable;TA=M;AC=UNIMOD:35" "NT=Oxidation;MT=variable;TA=M;AC=UNIMOD:35" "NT=Oxidation;MT=variable;TA=M;AC=UNIMOD:35" ...
#>  $ comment.modification.parameters..2   : chr  "NT=Acetyl;AC=UNIMOD:67;PP=Protein N-term;MT=variable" "NT=Acetyl;AC=UNIMOD:67;PP=Protein N-term;MT=variable" "NT=Acetyl;AC=UNIMOD:67;PP=Protein N-term;MT=variable" "NT=Acetyl;AC=UNIMOD:67;PP=Protein N-term;MT=variable" ...
#>  $ comment.technical.replicate.         : int  1 2 3 1 2 3 1 2 3 1 ...
#>  $ comment.fraction.identifier.         : int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ comment.file.uri.                    : chr  "ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2015/12/PXD001819/UPS1_12500amol_R1.raw" "ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2015/12/PXD001819/UPS1_12500amol_R2.raw" "ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2015/12/PXD001819/UPS1_12500amol_R3.raw" "ftp://ftp.pride.ebi.ac.uk/pride/data/archive/2015/12/PXD001819/UPS1_125amol_R1.raw" ...
#>  $ comment.data.file.                   : chr  "UPS1_12500amol_R1.raw" "UPS1_12500amol_R2.raw" "UPS1_12500amol_R3.raw" "UPS1_125amol_R1.raw" ...
#>  $ factor.value.spiked.compound.        : chr  "CT=mixture;QY=12500 amol;CN=UPS1;CV=Standards Research Group" "CT=mixture;QY=12500 amol;CN=UPS1;CV=Standards Research Group" "CT=mixture;QY=12500 amol;CN=UPS1;CV=Standards Research Group" "CT=mixture;QY=125 amol;CN=UPS1;CV=Standards Research Group" ...

The second vignette (entitled “UPS1 spike-in Experiments”) to this package shows more details how this information can be used to in further analysis.

Import From Dedicated Quantification Algorithms/Software

Data From MaxQuant

MaxQuant is free software provided by the Max-Planck-Institute, see Tyanova et al 2016. Typically MaxQuant exports by default quantitation data on level of consensus-proteins as a folder called txt with a file always called ‘proteinGroups.txt’. Data exported from MaxQuant can get imported (and normalized) using readMaxQuantFile(), in a standard case one needs only to provide the path to the file ‘proteinGroups.txt’ which can be found the combined/txt/ folder produced by MaxQuant. Gz-compressed files can be read, too (as in the example below the file ‘proteinGroups.txt.gz’).

path1 <- system.file("extdata", package="wrProteo")
dataMQ <- readMaxQuantFile(path1, specPref=NULL, normalizeMeth="median")
#>  -> readMaxQuantFile : Suspicious filename, this function was designed for reading tabulated text files produced by MaxQuant
#>  -> readMaxQuantFile : Note: Found 11 proteins marked as 'REV_' (reverse peptide identification) - Removing
#>  -> readMaxQuantFile : Display 2845(9.5%) initial '0' values as 'NA'
#>  -> readMaxQuantFile : Could not find peptide counts columns (argument 'pepCountCol') matching to 'Unique peptides MS\.MS\.count '
#>  -> readMaxQuantFile : Note: 6 proteins with unknown species
#>     by species : Homo sapiens: 49,  Mus musculus: 1,  Saccharomyces cerevisiae: 1047,  Sus scrofa: 1,
#>  -> readMaxQuantFile : Found 75 composite accession-numbers, truncating (eg P00761;CON__P00761)
#>  -> readMaxQuantFile : Use column 'Accession' as identifyer (has fewest, ie 0 duplicated entries) as rownames
#> -> readMaxQuantFile -> vioplotW -> asSepList : argument 'fxArg' is depreciated, please use argument 'exclElem' (still using since 'exclElem=NULL'
#> -> readMaxQuantFile -> vioplotW -> asSepList : argument 'fxArg' is depreciated, please use argument 'exclElem' (still using since 'exclElem=NULL'

The data were imported and median-normalized, the protein annotation was parsed, IDs, protein-names and species information automatically extracted.

## the number of lines and colums
dim(dataMQ$quant)
#> [1] 1104   27
## a summary of the quantitation data
summary(dataMQ$quant[,1:8])        # the first 8 cols
#>   12500amol_1     12500amol_2     12500amol_3      125amol_1    
#>  Min.   :17.50   Min.   :15.63   Min.   :14.89   Min.   :15.21  
#>  1st Qu.:22.49   1st Qu.:22.47   1st Qu.:22.48   1st Qu.:22.40  
#>  Median :23.45   Median :23.45   Median :23.45   Median :23.45  
#>  Mean   :23.67   Mean   :23.64   Mean   :23.65   Mean   :23.62  
#>  3rd Qu.:24.80   3rd Qu.:24.74   3rd Qu.:24.76   3rd Qu.:24.84  
#>  Max.   :30.29   Max.   :30.26   Max.   :30.33   Max.   :30.27  
#>  NA's   :98      NA's   :100     NA's   :104     NA's   :114    
#>    125amol_2       125amol_3      25000amol_1     25000amol_2   
#>  Min.   :14.88   Min.   :14.92   Min.   :15.73   Min.   :18.35  
#>  1st Qu.:22.39   1st Qu.:22.41   1st Qu.:22.44   1st Qu.:22.46  
#>  Median :23.45   Median :23.45   Median :23.45   Median :23.45  
#>  Mean   :23.62   Mean   :23.63   Mean   :23.65   Mean   :23.65  
#>  3rd Qu.:24.84   3rd Qu.:24.80   3rd Qu.:24.86   3rd Qu.:24.80  
#>  Max.   :30.28   Max.   :30.31   Max.   :30.20   Max.   :30.02  
#>  NA's   :106     NA's   :114     NA's   :109     NA's   :115

To treat the data from this experiment we also need to declare who is replicate of whom. In this case we’ll use the UPS1 concentrations used when making the serial dilution as names for the groups of replicates.

UPSconc <- c(50,125,250,500,2500,5000,12500,25000,50000)  
sampNa <- paste0(rep(UPSconc, each=3),"amol_",rep(1:3,length(UPSconc))) 
grp9 <- paste0(rep(UPSconc,each=3),"amol") 
head(grp9)
#> [1] "50amol"  "50amol"  "50amol"  "125amol" "125amol" "125amol"

Data From ProteomeDiscoverer

Proteome Discoverer is commercial software from ThermoFisher (www.thermofisher.com). Data exported from Proteome Discoverer can get imported (typically the xx_Proteins.txt file) using readProtDiscovFile(), for details please see the vignette “UPS-1 spike-in Experiments” to this package.

Data From Proline

Proline is free software provided by the Profi-consortium, see Bouyssié et al 2020. Data exported from Proline (xlsx or tsv format) can get imported using readProlineFile(), for details please see the vignette “UPS-1 spike-in Experiments” to this package.

Data From OpenMS

OpenMS is free open software provided by the deNBI Center for integrative Bioinformatics, see Rost et al 2016. Data exported as csv get summarized from peptide to protein level and further normalized using readOpenMSFile(), for details please see the help-page to this function.

Data From MassChroQ

MassChroQ is free open software provided by the PAPPSO, see Valot et al 2011. Inital quantifications are on peptide basis and should be normalized and summarized using the R-package MassChroqR, which is also distributed by the PAPPSO. Quantifications at protein-level can be saved as matrix into an RData-file or written to tsv,csv or txt files for following import into the framework of this package using readMassChroQFile(), for details please see the help-page to this function.

Normalization

As mentioned, the aim of normalization is to remove bias in data not linked to the original (biological) question. The import functions presented above do already by default run global median normalization. When choosing a normalization procedure one should reflect what additional information may be available to guide normalization. For example, it may be very useful to exclude classical protein contaminants since they typically do not reflect the original biolocial material. In overall, it is important to inspect results from normalization, graphical display of histograms, boxplots or violin-plots to compare distributions. Multiple options exist for normalizing data, please look at the documentation provided with the import-functions introduced above. Plese note, that enrichment experiments (like IP) can quickly pose major problems to the choice of normalization approaches. The function normalizeThis() from the package wrMisc can be used to run additional normalization, if needed.

Different normalization procedures intervene with different ‘aggressiveness’, ie also with different capacity to change the initial data. In general, it is suggested to start normalizing using ‘milder’ procedures, like global median and switch to more intervening methods if initial results seem not satisfactory. Beware, heavy normalization procedures may also alter the main information you want to analyze. Ie, some biologically true positive changes may start to fade or disappear when inappropriate normalization gets performed. Please note, that normalization should be performed before NA-imputation to avoid introducing new bias in the group of imputed values.

Imputation of NA-values

In proteomics the quantitation of very low abundances is very challenging. Proteins which are absent or very low abundances typically appear in the results as 0 or NA. Frequantly this may be linked to the fact that no peak is detected in a MS-1 chromatogram (for a given LC elution-time) while other samples had a strong peak at the respective place which led to successful MS-2 identification. Please note, that the match-between-runs option in the various softwar options allows to considerably reduce the number of NAs. To simplify the treatment all 0 values are transformed to NA, anyway they would not allow log2 transformation either.

Before replacing NA-values it is important to verify that such values may be associated to absent or very low abundances. To do so, we suggest to inspect groups of replicate-measurements using matrixNAinspect(). In particular, with multiple technical replicates of the same sample it is supposed that any variability observed is not linked to the sample itself. So for each NA that occurs in the data we suggest to look what was reported for the same protein with the other (technical) replicates. This brings us to the term of ‘NA-neighbours’ (quantifications for the same protein in replicates). When drawing histograms of NA-neighbours one can visually inspect and verify that NA-neighbours are typically low abundance values, however, but not necessarily the lowest values observed in the entire data-set.

## Let's inspect NA values as graphic
matrixNAinspect(dataMQ$quant, gr=grp9, tit="Histogram of Protein Abundances and NA-Neighbours")

So only if the hypothesis of NA-neighbours as typically low abundance values gets confirmed by visual inspection of the histograms, one may safely proceed to replacing them by low random values.

If one uses a unique (very) low value for NA-replacements, this will quickly pose a problem at the level of t-tests to look for proteins changing abundance between two or more groups of samples. Therefore it is common practice to draw random values from a Normal distribution representing this lower end of abundance values. Nevertheless, the choice of the parameters of this Normal distribution is very delicate.

This package proposes several related strategies/options for NA-imputation. First, the classical imputation of NA-values using Normal distributed random data is presented. The mean value for the Normal data can be taken from the median or mode of the NA-neighbour values, since (in case of technical replicetes) NA-neighbours tell us what these values might have been and thus we model a distribution around. Later in this vignette, a more elaborate version based on repeated implementations to obtain more robust results will be presented.

The function matrixNAneighbourImpute() proposed in this package offers automatic selection of these parameters, which have been tested in a number of different projects. However, this choice should be checked by critically inspecting the histograms of ‘NA-neighbours’ (ie successful quantitation in other replicate samples of the same protein) and the final resulting distribution. Initially all NA-neighbours are extracted. It is also worth mentioning that in the majority of data-sets encountered, such NA-neighbours form skewed distributions.

The successful quantitation of instances with more than one NA-values per group may be considered even more representative, but of course less sucessfully quntified values remain. Thus a primary choice is made: If the selection of (min) 2 NA-values per group has more than 300 values, this distribution will be used as base to model the distribution for drawing random values. In this case, by default the 0.18 quantile of the 2 NA-neighbour distribution will be used as mean for the new Normal distribution used for NA-replacements. If the number of 2 NA-neighbours is >= 300, (by default) the 0.1 quantile all NA-neighbour values will used. Of course, the user has also the possibility to use custom choices for these parameters.

The final replacement is done on all NA values. This also includes proteins with are all NA in a given condition as well a instances of mixed successful quantitation and NA values.

## MaxQuant simple NA-imputation (single round)
dataMQimp <- matrixNAneighbourImpute(dataMQ$quant, gr=grp9, tit="Histogram of Imputed and Final Data") 
#> -> matrixNAneighbourImpute -> stableMode : Method='density',  length of x =1142, 'bandw' has been set to 47
#>  -> matrixNAneighbourImpute :  n.woNA= 26963 , n.NA = 2845
#>     impute based on 'mode2' using mean= 21.42 and sd= 0.5
#>

However, imputing using normal distributed random data also brings the risk of occasional extreme values. In the extreme case it may happen that a given protein is all NA in one group, and by chance the random values turn out be rather high. Then, the final group mean of imputed values may be even higher than the mean of another group with successful quantitations. Of course in this case it would be a bad interpretation to consider the protein in question upregulated in a sample where all values for this protein were NA. To circumvent this problem there are 2 options : 1) one may use special filtering schemes to exclude such constellations from final results or 2) one could repeat replacement of NA-values numerous times.

Filtering can be performed using `presenceFilt() (package wrMisc).

The function testRobustToNAimputation() allows such repeated replacement of NA-values. For details, see also the following section.

Filtering

The main aim of filtering in omic-data analysis is to remove proteins/genes/lines which are for obvious reasons not suitable for further analysis. Invalid or low quality measures are not suitable for good results and may thus be removed. Frequently additional information is used to justy the procedure of removing certain proteins/genes/lines.

One very common element in filtering is the observation that very low abundance measures are typically less precise than medium or high abundance values. Thus, a protein/gene with all abundance measures at the very low end of the global scale may as well just seem to change abundance due to the elevated variance of low abundance measures. However, most statitical tests are not very well prepared for elevated variance of low abundance measures. In consequence, it is common to remove or disqualify such proteins/genes/lines which are at high risk of yielding false positive results.

In the context of proteomics the number of samples with NAs (non-quantified peptides/proteins) for a given protein/peptide represents also an interesting starting point. If almost all values finally compared are a result of (random) imputation, any apparent change in abundanc of such proteins/peptides lay rather reflect rare stochastic events of NA-imputation. Please note, that rather aggressive filtering may severly reduce the ability to identify on/off situations which very well may occur in most biological settings.

Initial information for filtering is already collected by the import-functions (readMaxQuantFile(), readProtDiscovFile(), readProlineFile(), readOpenMSFile() etc..). Then information for filtering can be used by the function combineMultFilterNAimput() which is integrated to testRobustToNAimputation() (see section below) to conveniently include filtering-aspects.

Statistical Testing

The t-test remains the main statistical test used, as in many other coases of omics, too. Statistical testing in the context of proteomics data poses challenges similar to transcriptomics : Many times the number of replicate-samples is fairly low and the inter-measurement variability quite high. In some unfortunate cases proteins with rather constant quantities may appear as false positives when searching for proteins who’s abundance changes between two groups of samples : If the apparent variability is by chance too low, the respective standard-deviations will be low and a plain t-test may give very enthusiastic p-values.
Besides stringent filtering (previous section of this vignette), the use of shrinkage when estimating the intra-group/replicate variance from the Bioconductor package limma turns out very helpful, see also Ritchie et al 2015. In this package the function eBayes() has been used and adopted to proteomics.

The function testRobustToNAimputation() allows running multiple cycles of NA-imputation and statistical testing with the aim of providing stable imputation and testing results. It performs NA-imputation and statistical testing (after repeated imputation) between all groups of samples the same time (as it would be inefficient to separate these two tasks). The tests underneath apply shrinkage from the empirical Bayes procedure from the bioconductor package limma. In addition, various formats of multiple test correction can be directly added to the results : Benjamini-Hochberg FDR, local false discovery rate (lfdr, using the package fdrtool, see Strimmer 2008 doi: 10.1093/bioinformatics/btn209), or modified testing by ROTS, etc …

The fact that a single round of NA-imputation may provoke false positives as well as false negatives, made it necessary to combine this (iterative) process of NA-imputation and subsequent testing in one single function.

## Impute NA-values repeatedly and run statistical testing after each round of imputations
testMQ <- testRobustToNAimputation(dataMQ, gr=grp9) 
#> -> testRobustToNAimputation -> matrixNAneighbourImpute -> stableMode : Method='density',  length of x =1142, 'bandw' has been set to 47
#> -> testRobustToNAimputation -> matrixNAneighbourImpute :  n.woNA= 26963 , n.NA = 2845
#>     impute based on 'mode2' using mean= 21.42 and sd= 0.5
#> 
#> -> testRobustToNAimputation -> combineMultFilterNAimput :    at presenceFilt:  1050 1049 1047 1030 1051 1037 1050 1022 1040 1046 1034 1050 1039 1027 1047 1061 1022 1060 1061 1032 1066 1056 1017 1045 1048 1019 1051 1028 1040 1051 1032 1027 1043 1027 1067 1051  out of 1104
#> -> testRobustToNAimputation -> combineMultFilterNAimput :    at abundanceFilt: 1045 1042 1042 1022 1040 1031 1044 1009 1033 1036 1031 1045 1035 1024 1044 1055 1016 1053 1053 1025 1061 1049 1002 1038 1032 1000 1048 1019 1037 1040 1029 1017 1034 1024 1063 1035
#> -> testRobustToNAimputation -> combineMultFilterNAimput :    at NA> mean:   949, 978, 938, 973, 955, 975, 952, 966, 942, 960, 974, 938, 964, 961, 938, 976, 944, 968, 979, 947, 962, 954, 956, 945, 960, 966, 941, 945, 944, 967, 934, 943, 960, 927, 943 and 955

## the data after repeated NA-imputation
head(testMQ$datImp[,1:8])
#>        12500amol_1 12500amol_2 12500amol_3 125amol_1 125amol_2 125amol_3
#> A5Z2X5    23.72700    23.71202    23.65306  23.76069  24.03258  23.70469
#> P00761    28.83921    28.78712    28.86386  28.70094  28.70978  28.76617
#> P01966    21.15232    21.01268    20.95126  21.35190  21.43183  21.41475
#> P02768    24.65579    24.20806    24.56327  15.69777  16.19941  14.92077
#> P04264    21.43762    21.36715    21.42670  21.39359  21.44080  21.25315
#> P13645    21.48840    21.26437    21.42793  21.48669  21.37913  21.30669
#>        25000amol_1 25000amol_2
#> A5Z2X5    23.78891    23.73787
#> P00761    28.75811    28.67850
#> P01966    21.65894    22.21233
#> P02768    25.66044    25.59727
#> P04264    21.41681    21.41142
#> P13645    21.46798    21.42185

PCA

Brielfy, principal components analysis (PCA) searches to decompose the data along all the axises defined by all samples. Then, the axis-combinations with the highest degree of correlation are searched. In principle one could also run PCA along the rows, ie the proteins, but their number is typically so high that the resultant plots get too crowded.

In the context of high throughput experiments, like proteomics, PCA allows to distinguish important information how the different samples are related (ie similar). This covers of course the real differences between different biological conditions, but also additional bias introduced as (technical) artifacts. Thus, such plots serve as well for quality control (in particular to identify outlyer-samples, eg due to degraded material) as well as for the biological interpretation.

Normally one could immediately check the normalized data by PCA before running statistical tests. As stated in other places, PCA can’t handle missing values (ie _NA_s). Thus, all proteins having one NA in just one sample won’t be considered during PCA. This would mask a significant number of proteins in numerous of proteomics experiments. Thus, it may be preferable to run PCA after NA-imputation. However, since in this package statistical testing was coupled to the repeated NA-imputation, it may be better to use the NA-imputations made for the statistical testing (in the section above).

Here we’ll use the function plotPCAw() form the package wrGraph

# limit to UPS1 
plotPCAw(testMQ$datImp, sampleGrp=grp9, tit="PCA on Protein Abundances (MaxQuant,NAs imputed)", rowTyName="proteins", useSymb2=0)

Please note, the vignette dedicated to spike-in experiments (“UPS-1 spike-in Experiments”) presents a slightly different way of making PCA-plots for this specific type of experiment/data-set.

MA-plot

MA-plots are mainly used for diagnostic purposes. Basically an MA-plot displays the log-Fold-Change versus the average abundance. We’ll use the function MAplotW() from the package wrGraph.

# By default this plots at the first of all pairwise questions
MAplotW(testMQ)
#>  -> MAplotW : Successfully extracted  1104 Mvalues and  1104 Avalues plus annotation
#>  -> MAplotW :  filtered (based on 'filtFin') from 1104 to  949 lines

Now for the second group of pair-wise comparisons, plus adding names of proteins passing threshold:

MAplotW(testMQ, useComp=2, namesNBest="passFC") 
#>  -> MAplotW : Successfully extracted  1104 Mvalues and  1104 Avalues plus annotation
#>  -> MAplotW :  filtered (based on 'filtFin') from 1104 to  978 lines

Volcano-Plot

A Volcano-plot allows to compare the simple fold-change (FC) opposed to the outcome of a statistcal test. Frequently we can obsereve, that a some proteins show very small FC but enthousiastic p-values and subsequently enthousiastic FDR-values. However, generally such proteins with so small FC don’t get considered as reliable results, therefore it is common practice to add an additional FC-threshold, typically a 1.5 or 2 fold-change.

The number of proteins retained by pair-wise comparison :

## by default the first pairwise comparison is taken
## using the argument 'namesNBest' we can add names from the annotation
VolcanoPlotW(testMQ, useComp=2, namesNBest="passFDR")
#>  -> VolcanoPlotW : Using element 'BH' as FDR-values for plot
#>  -> VolcanoPlotW : Successfully extracted  1104 Mvalues and  1104 pValues plus anotation
#>  -> VolcanoPlotW :  filtered (based on 'filtFin') from 1104 to  978 lines

Additional Note : Vlcano-plots may also help identifying bias in the data, in particular, to the question if normalization gave satisfactory results. Based on the hypothesis of no global change used for normalization, normally, one would expect about the same number of down-regulated as up-regulated proteins.

In fact, this experiment is somehow unusual since one set of samples got a strong increase in abundance for 48 UPS1 proteins while the other proteins remained constant. Thus, on the global scale there may be a (small) imbalance of abundances and the global median will reflect this, which can create some bias. So, in this special case it might be better to perform normalization only based on the yeast proteins (which are assumed as constant), as it has been performed in the vignette ‘UPS-1 spike-in Experiments’, a vignette which is entirely dedicated to UPS1 spike-in experiments.

Reporting Results

Tables with results can be either directed created using VolcanoPlotW() or, as shown below, using the function extractTestingResults().

For example, let’s look at the first of the pair-wise comparisons: The moderated t-test expressed as Benjamini-Hochberg FDR gave 99 proteins with FDR < 0.05 for the comparison 12500amol-125amol. Since unfortunately many verly low fold-change instances are amongst the results, one should add an additional filter for too low FC values. This is common practice in most omics analysis when mostly technical replicates are run and/or the number of replicates is rather low.

res1 <- extractTestingResults(testMQ, compNo=1, thrsh=0.05, FCthrs=2)

After FC-filtering for 2-fold (ie change of protein abundance to double or half) 10 proteins remain.

kable(res1[,-1], caption="5%-FDR (BH) Significant results for 1st pairwise set", align="c")

5%-FDR (BH) Significant results for 1st pairwise set
	EntryName	GeneName	FDR	logFC.12500amol-125amol	av.50amol	av.125amol	av.250amol	av.500amol	av.2500amol	av.5000amol	av.12500amol	av.25000amol	av.50000amol
P02768	ALBU_HUMAN_UPS	ALB	0.0000006	-7.23269	24.4757	15.6060	25.6408	20.9337	15.7827	26.7534	22.8387	15.8079	15.4552
P00441	SODC_HUMAN_UPS	SOD1	0.0034339	-1.82887	23.2691	20.7394	24.0250	21.0722	21.0158	26.1026	22.5683	20.8232	20.8702
P02144	MYG_HUMAN_UPS	MB	0.0021343	-2.92105	23.6045	19.3249	24.6781	20.7939	21.0690	26.1203	22.2459	22.0370	21.6413
P06732	KCRM_HUMAN_UPS	CKM	0.0347885	-4.45680	23.2128	17.1243	24.8089	20.3459	22.0490	25.9574	21.5811	19.8151	21.5380
P12081	SYHC_HUMAN_UPS	HARS	0.0000363	-1.65047	23.2348	20.1249	24.5195	20.5875	19.9998	25.9750	21.7754	20.2658	21.2098
P16083	NQO2_HUMAN_UPS	NQO2	0.0117988	-4.94696	24.2852	16.6461	25.5389	20.3294	18.1795	27.1439	21.5930	19.9387	16.8189
P40518	ARPC5_YEAST	ARC15	0.0001276	1.38588	22.4895	23.8360	22.3499	22.3387	23.4078	22.4893	22.4502	23.4559	23.3897
P53978	EF3B_YEAST	HEF3	0.0456776	1.20205	24.2554	24.3327	24.1287	24.0771	24.2989	24.0777	23.1307	24.1583	24.3194
Q03677	MTND_YEAST	ADI1	0.0078128	1.48044	21.1238	22.6964	21.0997	21.5038	22.5139	20.6300	21.2160	21.9042	22.6562
Q06830	PRDX1_HUMAN_UPS	PRDX1	0.0001784	-2.60795	23.9618	19.3147	25.3420	21.3417	19.1224	26.4670	21.9227	19.2315	19.4075

Please note that the column-name ‘BH’ referrs to Benjamini-Hochberg FDR (several other options of multiple testing correction exist, too). We can see that many UPS1 proteins are, as expected, among the best-ranking differential changes. However, not all UPS1 proteins do show up in the results as expected, and furthermore, a number of yeast proteins (however expected to remain constant !) were reported as differential, too.

The function extractTestingResults() also allows to write the data shown above directly to a csv-file.

Further Steps

In case of standard projects one typically would like to find out more about the biological context of the proteins retained at statistical analysis, their function and their interactors. Such a list of significant proteins from a given project could be tested lateron for enrichment of GO-functions or for their inter-connectivity in PPI networks like String. There are multiple tools available on Bioconductor and CRAN as well as outside of R to perform such analysis tasks.

In case of UPS1 spike-in experiments the subsequent analysis is different. Suggestions for in depth-analysis of UPS1 spike-in are shown in the vignette ‘UPS-1 spike-in Experiments’ of this package.

Protein Annotation

In most ‘Omics’ activities getting additional annotation may get sometimes a bit tricky. In Proteomics most mass-spectrometry software will use the informaton provided in the Fasta-file as annotation (typically as provided from UniProt). But this lacks for example chromosomal location information. There are are many repositories with genome-, gene- and protein-annotation and most of them are linked, but sometimes the links get broken when data-base updates are not done everywhere or are not followed by new re-matching. The Fasta-files used initially for mass-spectrometry peak-identification may be (slightly) not up to date (sometimes gene- or protein-IDs do change or may even disappear) and thus will contribute to a certain percentage of entries hard to link.

Globally two families of strategies for adding annotation exist :

Strategies using online-based ressources for getting the most up-to-date information/annotation (like biomaRt). Despite the advantage of most up-to-date information there may be some downsides : Interogating databases may require more time to run all queries via interet-connections and this strategy is vulnerable to broken links (eg linked to the delay of updates between different types of databases that may need to get combined). Furthermore, the results typically may change a bit when the queries get repeated. When combining multiple interconnected ressources it may be very tricky to document the precise version of all individual ressources used.
Strategies based on using (local copies of) defined versions of databases. Frequently, these databases can get downloaded/installed locally and thus allow faster queries and guarantee of repeatability and comparability to other tools or studies. A number of databases are available on Bioconductor formatted for R. Besides, the tables UCSC are another option. Note, that tracking version-numbers may be much easier using this approach.

In the context of adding chromosomal annotation to a list of proteins here the following concept is developed : Annotation-tables from UCSC are available for a decent number of species and can be downloaded for conventient off-line search. Howwever, in the context of less common species we realized that the UniProt tables from UCSC had many times low yield in final matching. For this reason we propose the slightly more complicated route that provided finally a much higher success-rate to find chromosomal locations for a list of UniProt IDs. First one needs to acces/download from UCSC the table corresponding to the species of question fields ‘clade’,‘genome’,‘assembly’). For ‘group’ choose ‘Genes and Gene Predictions’ and for ‘track’ choose ‘Ensembl Genes’, as table choose ‘ensGene’. In addition, it is possible to select either the entire genome-annotation or user-specified regions. In terms of ‘output-format’ one may choose ‘GTF’ (slightly more condensed, no headers) or ‘all filds from selected table’.

The following strategy for adding genomic location data using this package is presented here :

Locate (& download) organism annotation from UCSC, read into R (readUCSCtable() ) -> from R export (non-redundant) ‘enst’-IDs (still using readUCSCtable() ), get corresponding UniProt-IDs at UniProt site, save as file and import result into R (readUniProtExport() ) -> (in R) combine with initial UCSC table (readUniProtExport() ) .

The function readUCSCtable() is able to read such files downloaded from UCSC, compressed .gz files can be read, too (like in the example below). In the example below we’ll just look at chromosome 11 of the human genome - to keep this example small.

path1 <- system.file("extdata", package="wrProteo")
gtfFi <- file.path(path1, "UCSC_hg38_chr11extr.gtf.gz")
UcscAnnot1 <- readUCSCtable(gtfFi)
#>  -> readUCSCtable :  File 'UCSC_hg38_chr11extr.gtf.gz' is gtf : TRUE
#>  -> readUCSCtable :  simplifed from 2000 to 223 non-redundant gene_id

# The Ensemble transcript identifyers and their chromosomal locations :
head(UcscAnnot1)
#>      gene_id              chr     start    end      strand frame
#> [1,] "ENST00000270115.8"  "chr11" "537527" "554912" "+"    "."  
#> [2,] "ENST00000308020.6"  "chr11" "450279" "491399" "+"    "."  
#> [3,] "ENST00000311189.8"  "chr11" "532242" "535576" "-"    "."  
#> [4,] "ENST00000312165.5"  "chr11" "278570" "285304" "+"    "."  
#> [5,] "ENST00000325113.8"  "chr11" "196738" "200258" "+"    "."  
#> [6,] "ENST00000325147.13" "chr11" "202924" "207382" "-"    "."

However, this annotation does not provide protein IDs. In order to obtain the corresponding protein IDs an additional step is required : Here we will use the batch search/conversion tool from UniProt. In order to do so, we can export directly from readUCSCtable() a small text-file which can be fed into the UniProt batch-search tool.

# Here we'll redo reading the UCSC table, plus immediatley write the file for UniProt conversion 
#  (in this vignette we write to tempdir() to keep things tidy)
expFi <- file.path(tempdir(),"deUcscForUniProt2.txt")
UcscAnnot1 <- readUCSCtable(gtfFi, exportFileNa=expFi)
#>  -> readUCSCtable :  File 'UCSC_hg38_chr11extr.gtf.gz' is gtf : TRUE
#>  -> readUCSCtable :  simplifed from 2000 to 223 non-redundant gene_id
#>  -> readUCSCtable :  Write to file : ENST00000270115, ENST00000308020, ENST00000311189, ENST00000312165 ...
#>  -> readUCSCtable :  Exporting file  'C:/Users/wraff/AppData/Local/Temp/RtmpEvsYl2/deUcscForUniProt2.txt'  for conversion on https://www.uniprot.org/uploadlists

Now everything is ready to go to UniProt for retrieving the corresponding UniProt-IDs. Since we exported Ensemble transcript IDs (ENSTxxx), select converting from ‘Ensembl Transcript’ to ‘UniProtKB’. Then, when downloading the conversion results, choose tab-separated file format (compression is recommended), this may take several seconds (depending on the size).

It is suggested to rename the downloaded file so one can easily understand its content. Note, that the function readUniProtExport() can also read .gz compressed files. To continue this vignette we’ll use a result which has been downloaded from UniProt and renamed to ‘deUniProt_hg38chr11extr.tab’. One may also optionally define a specific genomic region of interest using the argument ‘targRegion’, here the entire chromosome 11 was chosen.

deUniProtFi <- file.path(path1, "deUniProt_hg38chr11extr.tab")
deUniPr1 <- readUniProtExport(UniP=deUniProtFi, deUcsc=UcscAnnot1, targRegion="chr11:1-135,086,622")
#>  -> readUniProtExport :  low yield matching 'enst00000410108', 'enst00000342878' and 'enst00000325113,enst00000525282' and 'ENST00000270115', 'ENST00000308020' and 'ENST00000311189' convert all to lower case and remove version numbers ('xxx.2') for better matching
#>  -> readUniProtExport :  intergrating genomic information for 88 entries (14 not found)
#>  -> readUniProtExport :  88 IDs in output
str(deUniPr1)
#> 'data.frame':    88 obs. of  14 variables:
#>  $ EnsTraID     : chr  "enst00000410108" "enst00000342878" "enst00000325113,enst00000525282" "enst00000342593" ...
#>  $ UniProtID    : chr  "B8ZZS0" "Q8TD33" "Q96PU9" "F8W6Z3" ...
#>  $ Entry.name   : chr  "B8ZZS0_HUMAN" "SG1C1_HUMAN" "ODF3A_HUMAN" "F8W6Z3_HUMAN" ...
#>  $ Status       : chr  "unreviewed" "reviewed" "reviewed" "unreviewed" ...
#>  $ Protein.names: chr  "BET1-like protein" "Secretoglobin family 1C member 1 (Secretoglobin RYD5)" "Outer dense fiber protein 3 (Outer dense fiber of sperm tails protein 3) (Sperm tail protein SHIPPO 1) (Transcr"| __truncated__ "Outer dense fiber protein 3" ...
#>  $ Gene.names   : chr  "BET1L" "SCGB1C1" "ODF3 SHIPPO1 TISP50" "ODF3" ...
#>  $ Length       : int  152 95 254 127 102 111 92 58 88 148 ...
#>  $ chr          : chr  "chr11" "chr11" NA "chr11" ...
#>  $ start        : num  167784 193078 NA 197303 202924 ...
#>  $ end          : num  207383 194575 NA 200033 207382 ...
#>  $ strand       : chr  "-" "+" NA "+" ...
#>  $ frame        : chr  "." "." NA "." ...
#>  $ avPos        : num  187584 193826 NA 198668 205153 ...
#>  $ inTarg       : logi  FALSE FALSE NA FALSE FALSE FALSE ...

The resulting data.frame (ie the column ‘UniProtID’) may be used to complement protein annotation after importing mass-spectrometry peak- and protein-identification results. Obviously, using recent Fasta-files from UniProt for protein-identification will typically give better matching at the end.

You may note that sometimes Ensemble transcript IDs are written as ‘enst00000410108’ whereas at other places it may be written as ‘ENST00000410108.5’. The function readUniProtExport() switches to a more flexible search mode stripping of version-numbers and reading all as lower-caps, if initial direct matching reveals less than 4 hits.

Finally, it should be added, that of course several other ways of retrieving annotation exist, in particular, using the annotation-packages provided by Bioconductor.

Acknowledgements

The author would like to acknowledge the support by the IGBMC (CNRS UMR 7104, Inserm U 1258, UdS), CNRS, Université de Strasbourg and Inserm. All collegues from the proteomics platform at the IGBMC work very commited to provide high quality mass-spectrometry data (including some of those used here). The author wishes to thank the CRAN-staff for all their help with new entries and their efforts in maintaining this repository of R-packages. Furthermore, many very fruitful discussions with colleages on national and international level have helped to improve the tools presented here.

Session-Info

For completeness :

#> R version 4.1.2 (2021-11-01)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19042)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=C                   LC_CTYPE=French_France.1252   
#> [3] LC_MONETARY=French_France.1252 LC_NUMERIC=C                  
#> [5] LC_TIME=French_France.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] rmarkdown_2.11 knitr_1.37     wrGraph_1.3.0  wrProteo_1.6.0 wrMisc_1.8.1  
#> 
#> loaded via a namespace (and not attached):
#>  [1] qvalue_2.24.0      tidyselect_1.1.1   xfun_0.29          bslib_0.3.1       
#>  [5] purrr_0.3.4        reshape2_1.4.4     splines_4.1.2      colorspace_2.0-2  
#>  [9] vctrs_0.3.8        generics_0.1.2     htmltools_0.5.2    yaml_2.2.2        
#> [13] utf8_1.2.2         rlang_1.0.1        R.oo_1.24.0        sm_2.2-5.7        
#> [17] jquerylib_0.1.4    pillar_1.7.0       R.utils_2.11.0     glue_1.5.1        
#> [21] DBI_1.1.2          RColorBrewer_1.1-2 lifecycle_1.0.1    plyr_1.8.6        
#> [25] stringr_1.4.0      munsell_0.5.0      gtable_0.3.0       R.methodsS3_1.8.1 
#> [29] evaluate_0.14      fastmap_1.1.0      fdrtool_1.2.17     fansi_1.0.2       
#> [33] highr_0.9          Rcpp_1.0.8         scales_1.1.1       limma_3.48.3      
#> [37] jsonlite_1.7.3     ggplot2_3.3.5      digest_0.6.29      stringi_1.7.6     
#> [41] dplyr_1.0.8        grid_4.1.2         cli_3.1.1          tools_4.1.2       
#> [45] magrittr_2.0.2     sass_0.4.0         tibble_3.1.6       crayon_1.5.0      
#> [49] pkgconfig_2.0.3    ellipsis_0.3.2     assertthat_0.2.1   R6_2.5.1          
#> [53] compiler_4.1.2

Getting started with wrProteo

Wolfgang Raffelsberger

2022-02-28

Introduction

Calculating Molecular Masses From Composition Formulas

Molecular masses based on (summed) chemical formulas

Molecular masses based on amino-acid sequence

Reading Fasta Files (from Uniprot)

Writing Sequences As Fasta Files