canprot: Chemical metrics of differentially expressed proteins

Calculating chemical metrics

Specify the amino acid composition of a protein in a data frame or matrix with column names corresponding to the 3-letter abbreviations of the amino acids. This can be done as follows for the dipeptide alanylglycine:

AG <- data.frame(Ala = 1, Gly = 1)

Use the functions ZCAA and H2OAA to calculate the carbon oxidation state (Z_C) and stoichiometric hydration state (n_H₂O) of the molecule:

ZCAA(AG)

## [1] 0.4

H2OAA(AG)

## [1] 0

By default, n_H₂O is calculated from a chemical reaction to form the protein from the basis species glutamine, glutamic acid, cysteine, H₂O, and O₂, abbreviated as QEC. To see how this works, consider the formation reaction of alanylglycine, which can be written using functions in the CHNOSZ package:

CHNOSZ::basis("QEC")

##           C  H N O S ispecies logact state
## C5H10N2O3 5 10 2 3 0     1585   -3.2    aq
## C5H9NO4   5  9 1 4 0     1586   -4.5    aq
## C3H7NO2S  3  7 1 2 1     1583   -3.6    aq
## H2O       0  2 0 1 0        1    0.0   liq
## O2        0  0 0 2 0     2637  -80.0   gas

CHNOSZ::subcrt("alanylglycine", 1)$reaction

##      coeff          name   formula state ispecies
## 2531     1 alanylglycine C5H10N2O3    cr     2531
## 1585    -1     glutamine C5H10N2O3    aq     1585

Alanylglycine has the same formula as glutamine, so there is no water in the reaction, and n_H₂O is zero.

Protein Data 1: CHNOSZ package

For a more practical example, let’s try an actual protein, chicken egg-white lysozome, which has the name LYSC_CHICK in UniProt with accession number P00698. The amino acid compositions of this and selected other proteins are available in the CHNOSZ package. Here we get the amino acid composition and also print the protein length:

AA <- CHNOSZ::pinfo(CHNOSZ::pinfo("LYSC_CHICK"))
CHNOSZ::protein.length(AA)

## [1] 129

AA

##   protein organism     ref  abbrv chains Ala Cys Asp Glu Phe Gly His
## 6    LYSC    CHICK UniProt P00698      1  12   8   7   2   3  12   1
##   Ile Lys Leu Met Asn Pro Gln Arg Ser Thr Val Trp Tyr
## 6   6   6   8   2  14   2   3  11  10   7   6   6   3

This data frame has some other identifying information (protein and organism names, reference, accession number in the abbrv column) as well as chains to indicate the number of polypeptide chains. However, the important thing here are the 20 columns with amino acid frequencies; this means we can use the data frame with the functions in canprot to calculate chemical metrics:

ZCAA(AA)

##          6 
## 0.01631321

H2OAA(AA)

##          6 
## -0.8790698

Now we can look at the formation reaction of LYSC_CHICK from the QEC basis species to see where the value of n_H₂O comes from.

CHNOSZ::subcrt("LYSC_CHICK", 1)$reaction

##      coeff          name             formula state ispecies
## 3444   1.0    LYSC_CHICK C613H959N193O185S10    aq     3444
## 1585 -66.4     glutamine           C5H10N2O3    aq     1585
## 1586 -50.2 glutamic acid             C5H9NO4    aq     1586
## 1583 -10.0      cysteine            C3H7NO2S    aq     1583
## 1    113.4         water                 H2O   liq        1
## 2637  60.8        oxygen                  O2   gas     2637

This shows that 113.4 water molecules are released in the reaction. n_H₂O is the opposite of this value (because we are counting how many waters go into forming the protein), divided by the length of the protein (129), which gives us the value of n_H₂O: -0.8790698.

H2OAA works not by writing the formation reaction for each protein but rather by using precomputed values of n_H₂O for each amino acid. The two methods give equivalent results, as described in Dick et al. (2020).

It is important to note that calculating Z_C of proteins from those of amino acids requires weighting by number of carbon atoms in each amino acid. Using the unweighted mean of Z_C of amino acids is a common mistake that leads to artificially higher values for the protein.

Protein Data 2: canprot package

canprot has an extensive list of amino acid compositions of human proteins assembled from UniProt together with proteins from other organisms that have been identified in differential expression studies used in the package (look in these directories in extdata/aa: archaea bacteria cow dog human mouse rat yeast). If you have a UniProt ID for a human protein, such as P24298, use protcomp to get the amino acid composition:

(pc <- protcomp("P24298"))

## $uniprot
## [1] "P24298"
## 
## $aa
##              protein organism  ref abbrv chains Ala Cys Asp Glu Phe Gly
## 5457 sp|P24298|ALAT1    HUMAN <NA>  <NA>      1  51  10  20  34  21  38
##      His Ile Lys Leu Met Asn Pro Gln Arg Ser Thr Val Trp Tyr
## 5457   9  18  16  55  12  12  33  30  37  25  18  41   1  15

ZCAA(pc$aa)

##       5457 
## -0.1482091

Next let’s use a file with amino acid compositions for non-human proteins, in this case proteins identified in a study of the response of an archaeal organism to salt and temperature stress (Jevtić et al., 2019). Note that high Z_C is a characteristic of many proteins in halophiles (Dick et al., 2020).

aa_file <- system.file("extdata/aa/archaea/JSP+19_aa.csv.xz", package = "canprot")
pc <- protcomp("D4GP79", aa_file = aa_file)

## [1] "protcomp: adding 1268 proteins from /tmp/RtmpaFI4jY/Rinst70bb45388775/canprot/extdata/aa/archaea/JSP+19_aa.csv.xz"

ZCAA(pc$aa)

##         28 
## -0.1003584

Other chemical metrics

There are also functions for calculating the grand average of hydropathicity (GRAVY, which is higher for proteins with more hydrophobic amino acids) and isoelectric point (pI) of proteins. There are some limitations of this implementation (see Dick et al. (2020) for details), but values for representative proteins are equal to those computed with the ProtParam tool (Gasteiger et al., 2005) in UniProt (see ?pI for numerical tests).

proteins <- c("LYSC_CHICK", "RNAS1_BOVIN", "AMYA_PYRFU")
AA <- CHNOSZ::pinfo(CHNOSZ::pinfo(proteins))
pI(AA)

##    6    9    3 
## 9.32 8.64 5.46

GRAVY(AA)

##          6          9          3 
## -0.4720930 -0.6629032 -0.3250000

Retrieving differentially expressed proteins

See ?pdat_ for the functions to get the lists of differentially expressed proteins from different cancer types and experimental conditions. Run one of the functions with default arguments to see the list of datasets:

pdat_3D()

##  [1] "PLC+10=cancer"              "MHG+12_P5=cancer"          
##  [3] "MHG+12_P2=cancer"           "MVC+12_perinecrotic=cancer"
##  [5] "MVC+12_necrotic=cancer"     "YYW+13=cancer"             
##  [7] "ZMH+13_Matr.12h"            "ZMH+13_Matr.24h"           
##  [9] "HKX+14=cancer"              "KDS+14_hESC"               
## [11] "KDS+14_hiPSC"               "KDS+14_hPSC"               
## [13] "RKP+14=cancer"              "SAS+14=cancer"             
## [15] "WRK+14=cancer"              "MTK+15=cancer"             
## [17] "YLW+16=cancer"              "KJK+18=cancer"             
## [19] "TGD18_NHF"                  "TGD18_CAF=cancer"          
## [21] "EWK+19=cancer"              "GADS19"                    
## [23] "HLC19=cancer"               "LPK+19_preadipocytes"      
## [25] "LPK+19_adipocytes"          "LPK+19_macrophages"        
## [27] "DKM+20"

The letters (from authors’ surnames) and 2-digit year are the bibliographic keys; see system.file("vignettes/cpdat.bib", package = "canprot") for their BibTeX entries. Text after an underscore indicates different experimental groups, and one or more equals signs are used to tag datasets with different attributes; here, =cancer means that the experiments involve cancer cells. Let’s look at one of these datasets, which lists differentially expressed proteins in mesenchymal stromal cells grown as aggregates (i.e. 3D cell culture) compared to those grown in monolayers (Doron et al., 2020).

pdat <- pdat_3D("DKM+20")

## [1] "check_IDs: updating 1 old UniProt ID: A0A0C4DFX3"
## [1] "pdat_3D: bone marrow-derived MSCs aggregates [DKM+20]"

str(pdat)

## List of 4
##  $ dataset    : chr "DKM+20"
##  $ pcomp      :List of 2
##   ..$ uniprot: chr [1:174] "P02751" "Q09666" "P12111" "Q01995" ...
##   ..$ aa     :'data.frame':  174 obs. of  25 variables:
##   .. ..$ protein : chr [1:174] "sp|P02751|FINC" "sp|Q09666|AHNK" "sp|P12111|CO6A3" "sp|Q01995|TAGL" ...
##   .. ..$ organism: chr [1:174] "HUMAN" "HUMAN" "HUMAN" "HUMAN" ...
##   .. ..$ ref     : chr [1:174] NA NA NA NA ...
##   .. ..$ abbrv   : chr [1:174] NA NA NA NA ...
##   .. ..$ chains  : int [1:174] 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ Ala     : int [1:174] 98 211 235 11 182 139 36 188 129 96 ...
##   .. ..$ Cys     : int [1:174] 63 9 30 1 38 18 1 14 9 362 ...
##   .. ..$ Asp     : int [1:174] 119 608 183 10 156 66 29 157 43 173 ...
##   .. ..$ Glu     : int [1:174] 140 348 183 15 164 75 96 269 66 201 ...
##   .. ..$ Phe     : int [1:174] 51 152 149 7 86 27 12 61 22 84 ...
##   .. ..$ Gly     : int [1:174] 199 641 303 20 277 391 22 78 381 307 ...
##   .. ..$ His     : int [1:174] 51 99 45 3 58 9 3 70 15 48 ...
##   .. ..$ Ile     : int [1:174] 110 288 153 8 134 24 13 116 32 149 ...
##   .. ..$ Lys     : int [1:174] 78 827 158 17 173 57 65 179 50 111 ...
##   .. ..$ Leu     : int [1:174] 129 401 282 14 140 48 29 246 61 141 ...
##   .. ..$ Met     : int [1:174] 27 256 37 10 37 13 7 65 10 52 ...
##   .. ..$ Asn     : int [1:174] 99 146 130 7 88 28 18 93 41 188 ...
##   .. ..$ Pro     : int [1:174] 189 718 209 10 200 278 22 46 232 175 ...
##   .. ..$ Gln     : int [1:174] 129 68 147 14 72 49 28 153 33 100 ...
##   .. ..$ Arg     : int [1:174] 125 44 189 11 83 71 57 157 72 131 ...
##   .. ..$ Ser     : int [1:174] 191 323 217 12 178 60 37 153 52 173 ...
##   .. ..$ Thr     : int [1:174] 257 104 167 6 168 45 32 115 42 166 ...
##   .. ..$ Val     : int [1:174] 192 605 292 16 263 47 23 109 55 108 ...
##   .. ..$ Trp     : int [1:174] 39 32 6 3 16 6 4 46 5 13 ...
##   .. ..$ Tyr     : int [1:174] 100 10 62 6 89 13 4 49 16 93 ...
##  $ up2        : logi [1:174] TRUE FALSE TRUE FALSE FALSE FALSE ...
##  $ description: chr "bone marrow-derived MSCs aggregates"

We now have the UniProt IDs of the proteins, their amino acid compositions, and whether each protein is up- or down-expressed in the experiments (in the up2 list element). With this in hand, use get_comptab to calculate median differences of chemical metrics between the up- and down-regulated proteins. Column names with median1 and median2 indicate the median values for the down- and up-regulated proteins, respectively, and diff is the different between them (median2 minus median1).

(get_comptab(pdat))

## DKM+20 (bone marrow-derived MSCs aggregates): n1 57, n2 117

##   dataset                         description n1  n2 ZC.median1
## 1  DKM+20 bone marrow-derived MSCs aggregates 57 117 -0.1028213
##   ZC.median2     ZC.diff nH2O.median1 nH2O.median2   nH2O.diff
## 1  -0.133829 -0.03100768   -0.7563636       -0.776 -0.01963636

ZC.diff and nH2O.diff are negative, meaning that 3D growth in this experiment results in higher expression of proteins with lower median Z_C and n_H₂O.

The lapply function in R makes it easy to compute the metrics for multiple datasets.

datasets <- pdat_3D()
pdats <- lapply(datasets, pdat_3D)
comptabs <- lapply(pdats, get_comptab)

Now we can make a plot of the median differences of Z_C and n_H₂O for all of these datasets. The points are lettered according to the order of datasets, and the dashed line shows the 50% probability contour; that is, approximately half the datasets are inside the contoured area, and half are outside. This plot shows that many 3D cell culture experiments are characterized by both lower carbon oxidation state and lower stoichiometric hydration state of up-expressed than down-expressed proteins compared to growth in monolayers.

diffplot(comptabs)
title("3D cell culture vs monolayers")

This is the essence of the vignettes described below. These analyses have been used for interpreting the effects of salinity on protein expression (Dick et al., 2020) and for describing the chemical features of differentially regulated proteins in multiple cancer types and experimental cell culture conditions (Dick, 2021).

Building the analysis vignettes

There is an analysis vignette for each dataset of differentially expressed proteins. To save package space and checking time, prebuilt analysis vignettes are not included in the package.

Use the mkvig() funciton to compile the vignettes on demand and view them in the browser. For example, mkvig("3D") compiles the vignette for three-dimensional cell culture and then opens it in the browser. Each of the vignettes is also available as a demo, which can be run from the browser (help.start() > Packages > canprot > Code demos). Building the vignettes requires pandoc as a system dependency.

Here is the list of available vignettes:

##  [1] "3D"           "breast"       "colorectal"   "glucose"     
##  [5] "hypoxia"      "liver"        "lung"         "osmotic_bact"
##  [9] "osmotic_euk"  "osmotic_halo" "pancreatic"   "prostate"    
## [13] "secreted"

The vignettes can be viewed online at https://chnosz.net/canprot/doc/index.html. Additional vignettes based on data from the Human Protein Atlas (HPA) and The Cancer Genome Atlas (TCGA) are available in the JMDplots package on GitHub (see also https://chnosz.net/JMDplots/doc/index.html)

References

Dick JM. 2021. Water as a reactant in the differential expression of proteins in cancer. Computational and Systems Oncology 1(1): e1007. doi: 10.1002/cso2.1007

Dick JM, Yu M, Tan J. 2020. Uncovering chemical signatures of salinity gradients through compositional analysis of protein sequences. Biogeosciences 17(23): 6145–6162. doi: 10.5194/bg-17-6145-2020

Doron G, Klontzas ME, Mantalaris A, Guldberg RE, Temenoff JS. 2020. Multiomics characterization of mesenchymal stromal cells cultured in monolayer and as aggregates. Biotechnology and Bioengineering 117(6): 1761–1778. doi: 10.1002/bit.27317

Gasteiger E, Hoogland C, Gattiker A, Duvaud S, Wilkins MR, Appel RD, Bairoch A. 2005. Protein identification and analysis tools on the ExPASy server. In: Walker JM, editor. The Proteomics Protocols Handbook. Totowa, NJ: Humana Press. pp. 571–607. doi: 10.1385/1-59259-890-0:571

Jevtić Ž, Stoll B, Pfeiffer F, Sharma K, Urlaub H, Marchfelder A, Lenz C. 2019. The response of Haloferax volcanii to salt and temperature stress: A proteome study by label-free mass spectrometry. Proteomics 19(20): 1800491. doi: 10.1002/pmic.201800491