The canprot package has lists of differentially expressed proteins compiled from various literature sources and functions to calculate chemical metrics of proteins.
Specify the amino acid composition of a protein in a data frame or matrix with column names corresponding to the 3-letter abbreviations of the amino acids. This can be done as follows for the dipeptide alanylglycine:
Use the functions ZCAA
and H2OAA
to calculate the carbon oxidation state (ZC) and stoichiometric hydration state (nH2O) of the molecule:
## [1] 0.4
## [1] 0
By default, nH2O is calculated from a chemical reaction to form the protein from the basis species glutamine, glutamic acid, cysteine, H2O, and O2, abbreviated as QEC. To see how this works, consider the formation reaction of alanylglycine, which can be written using functions in the CHNOSZ package:
## C H N O S ispecies logact state
## C5H10N2O3 5 10 2 3 0 1585 -3.2 aq
## C5H9NO4 5 9 1 4 0 1586 -4.5 aq
## C3H7NO2S 3 7 1 2 1 1583 -3.6 aq
## H2O 0 2 0 1 0 1 0.0 liq
## O2 0 0 0 2 0 2637 -80.0 gas
## coeff name formula state ispecies
## 2531 1 alanylglycine C5H10N2O3 cr 2531
## 1585 -1 glutamine C5H10N2O3 aq 1585
Alanylglycine has the same formula as glutamine, so there is no water in the reaction, and nH2O is zero.
For a more practical example, let’s try an actual protein, chicken egg-white lysozome, which has the name LYSC_CHICK in UniProt with accession number P00698. The amino acid compositions of this and selected other proteins are available in the CHNOSZ package. Here we get the amino acid composition and also print the protein length:
## [1] 129
## protein organism ref abbrv chains Ala Cys Asp Glu Phe Gly His
## 6 LYSC CHICK UniProt P00698 1 12 8 7 2 3 12 1
## Ile Lys Leu Met Asn Pro Gln Arg Ser Thr Val Trp Tyr
## 6 6 6 8 2 14 2 3 11 10 7 6 6 3
This data frame has some other identifying information (protein and organism names, reference, accession number in the abbrv
column) as well as chains
to indicate the number of polypeptide chains. However, the important thing here are the 20 columns with amino acid frequencies; this means we can use the data frame with the functions in canprot to calculate chemical metrics:
## 6
## 0.01631321
## 6
## -0.8790698
Now we can look at the formation reaction of LYSC_CHICK from the QEC basis species to see where the value of nH2O comes from.
## coeff name formula state ispecies
## 3444 1.0 LYSC_CHICK C613H959N193O185S10 aq 3444
## 1585 -66.4 glutamine C5H10N2O3 aq 1585
## 1586 -50.2 glutamic acid C5H9NO4 aq 1586
## 1583 -10.0 cysteine C3H7NO2S aq 1583
## 1 113.4 water H2O liq 1
## 2637 60.8 oxygen O2 gas 2637
This shows that 113.4 water molecules are released in the reaction. nH2O is the opposite of this value (because we are counting how many waters go into forming the protein), divided by the length of the protein (129), which gives us the value of nH2O: -0.8790698.
H2OAA
works not by writing the formation reaction for each protein but rather by using precomputed values of nH2O for each amino acid. The two methods give equivalent results, as described in Dick et al. (2020).
It is important to note that calculating ZC of proteins from those of amino acids requires weighting by number of carbon atoms in each amino acid. Using the unweighted mean of ZC of amino acids is a common mistake that leads to artificially higher values for the protein.
canprot has an extensive list of amino acid compositions of human proteins assembled from UniProt together with proteins from other organisms that have been identified in differential expression studies used in the package (look in these directories in extdata/aa
: archaea bacteria cow dog human mouse rat yeast). If you have a UniProt ID for a human protein, such as P24298
, use protcomp
to get the amino acid composition:
## $uniprot
## [1] "P24298"
##
## $aa
## protein organism ref abbrv chains Ala Cys Asp Glu Phe Gly
## 5457 sp|P24298|ALAT1 HUMAN <NA> <NA> 1 51 10 20 34 21 38
## His Ile Lys Leu Met Asn Pro Gln Arg Ser Thr Val Trp Tyr
## 5457 9 18 16 55 12 12 33 30 37 25 18 41 1 15
## 5457
## -0.1482091
Next let’s use a file with amino acid compositions for non-human proteins, in this case proteins identified in a study of the response of an archaeal organism to salt and temperature stress (Jevtić et al., 2019). Note that high ZC is a characteristic of many proteins in halophiles (Dick et al., 2020).
aa_file <- system.file("extdata/aa/archaea/JSP+19_aa.csv.xz", package = "canprot")
pc <- protcomp("D4GP79", aa_file = aa_file)
## [1] "protcomp: adding 1268 proteins from /tmp/RtmpaFI4jY/Rinst70bb45388775/canprot/extdata/aa/archaea/JSP+19_aa.csv.xz"
## 28
## -0.1003584
There are also functions for calculating the grand average of hydropathicity (GRAVY, which is higher for proteins with more hydrophobic amino acids) and isoelectric point (pI) of proteins. There are some limitations of this implementation (see Dick et al. (2020) for details), but values for representative proteins are equal to those computed with the ProtParam tool (Gasteiger et al., 2005) in UniProt (see ?pI
for numerical tests).
proteins <- c("LYSC_CHICK", "RNAS1_BOVIN", "AMYA_PYRFU")
AA <- CHNOSZ::pinfo(CHNOSZ::pinfo(proteins))
pI(AA)
## 6 9 3
## 9.32 8.64 5.46
## 6 9 3
## -0.4720930 -0.6629032 -0.3250000
See ?pdat_
for the functions to get the lists of differentially expressed proteins from different cancer types and experimental conditions. Run one of the functions with default arguments to see the list of datasets:
## [1] "PLC+10=cancer" "MHG+12_P5=cancer"
## [3] "MHG+12_P2=cancer" "MVC+12_perinecrotic=cancer"
## [5] "MVC+12_necrotic=cancer" "YYW+13=cancer"
## [7] "ZMH+13_Matr.12h" "ZMH+13_Matr.24h"
## [9] "HKX+14=cancer" "KDS+14_hESC"
## [11] "KDS+14_hiPSC" "KDS+14_hPSC"
## [13] "RKP+14=cancer" "SAS+14=cancer"
## [15] "WRK+14=cancer" "MTK+15=cancer"
## [17] "YLW+16=cancer" "KJK+18=cancer"
## [19] "TGD18_NHF" "TGD18_CAF=cancer"
## [21] "EWK+19=cancer" "GADS19"
## [23] "HLC19=cancer" "LPK+19_preadipocytes"
## [25] "LPK+19_adipocytes" "LPK+19_macrophages"
## [27] "DKM+20"
The letters (from authors’ surnames) and 2-digit year are the bibliographic keys; see system.file("vignettes/cpdat.bib", package = "canprot")
for their BibTeX entries. Text after an underscore indicates different experimental groups, and one or more equals signs are used to tag datasets with different attributes; here, =cancer
means that the experiments involve cancer cells. Let’s look at one of these datasets, which lists differentially expressed proteins in mesenchymal stromal cells grown as aggregates (i.e. 3D cell culture) compared to those grown in monolayers (Doron et al., 2020).
## [1] "check_IDs: updating 1 old UniProt ID: A0A0C4DFX3"
## [1] "pdat_3D: bone marrow-derived MSCs aggregates [DKM+20]"
## List of 4
## $ dataset : chr "DKM+20"
## $ pcomp :List of 2
## ..$ uniprot: chr [1:174] "P02751" "Q09666" "P12111" "Q01995" ...
## ..$ aa :'data.frame': 174 obs. of 25 variables:
## .. ..$ protein : chr [1:174] "sp|P02751|FINC" "sp|Q09666|AHNK" "sp|P12111|CO6A3" "sp|Q01995|TAGL" ...
## .. ..$ organism: chr [1:174] "HUMAN" "HUMAN" "HUMAN" "HUMAN" ...
## .. ..$ ref : chr [1:174] NA NA NA NA ...
## .. ..$ abbrv : chr [1:174] NA NA NA NA ...
## .. ..$ chains : int [1:174] 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ Ala : int [1:174] 98 211 235 11 182 139 36 188 129 96 ...
## .. ..$ Cys : int [1:174] 63 9 30 1 38 18 1 14 9 362 ...
## .. ..$ Asp : int [1:174] 119 608 183 10 156 66 29 157 43 173 ...
## .. ..$ Glu : int [1:174] 140 348 183 15 164 75 96 269 66 201 ...
## .. ..$ Phe : int [1:174] 51 152 149 7 86 27 12 61 22 84 ...
## .. ..$ Gly : int [1:174] 199 641 303 20 277 391 22 78 381 307 ...
## .. ..$ His : int [1:174] 51 99 45 3 58 9 3 70 15 48 ...
## .. ..$ Ile : int [1:174] 110 288 153 8 134 24 13 116 32 149 ...
## .. ..$ Lys : int [1:174] 78 827 158 17 173 57 65 179 50 111 ...
## .. ..$ Leu : int [1:174] 129 401 282 14 140 48 29 246 61 141 ...
## .. ..$ Met : int [1:174] 27 256 37 10 37 13 7 65 10 52 ...
## .. ..$ Asn : int [1:174] 99 146 130 7 88 28 18 93 41 188 ...
## .. ..$ Pro : int [1:174] 189 718 209 10 200 278 22 46 232 175 ...
## .. ..$ Gln : int [1:174] 129 68 147 14 72 49 28 153 33 100 ...
## .. ..$ Arg : int [1:174] 125 44 189 11 83 71 57 157 72 131 ...
## .. ..$ Ser : int [1:174] 191 323 217 12 178 60 37 153 52 173 ...
## .. ..$ Thr : int [1:174] 257 104 167 6 168 45 32 115 42 166 ...
## .. ..$ Val : int [1:174] 192 605 292 16 263 47 23 109 55 108 ...
## .. ..$ Trp : int [1:174] 39 32 6 3 16 6 4 46 5 13 ...
## .. ..$ Tyr : int [1:174] 100 10 62 6 89 13 4 49 16 93 ...
## $ up2 : logi [1:174] TRUE FALSE TRUE FALSE FALSE FALSE ...
## $ description: chr "bone marrow-derived MSCs aggregates"
We now have the UniProt IDs of the proteins, their amino acid compositions, and whether each protein is up- or down-expressed in the experiments (in the up2
list element). With this in hand, use get_comptab
to calculate median differences of chemical metrics between the up- and down-regulated proteins. Column names with median1
and median2
indicate the median values for the down- and up-regulated proteins, respectively, and diff
is the different between them (median2 minus median1).
## DKM+20 (bone marrow-derived MSCs aggregates): n1 57, n2 117
## dataset description n1 n2 ZC.median1
## 1 DKM+20 bone marrow-derived MSCs aggregates 57 117 -0.1028213
## ZC.median2 ZC.diff nH2O.median1 nH2O.median2 nH2O.diff
## 1 -0.133829 -0.03100768 -0.7563636 -0.776 -0.01963636
ZC.diff
and nH2O.diff
are negative, meaning that 3D growth in this experiment results in higher expression of proteins with lower median ZC and nH2O.
The lapply
function in R makes it easy to compute the metrics for multiple datasets.
Now we can make a plot of the median differences of ZC and nH2O for all of these datasets. The points are lettered according to the order of datasets, and the dashed line shows the 50% probability contour; that is, approximately half the datasets are inside the contoured area, and half are outside. This plot shows that many 3D cell culture experiments are characterized by both lower carbon oxidation state and lower stoichiometric hydration state of up-expressed than down-expressed proteins compared to growth in monolayers.
This is the essence of the vignettes described below. These analyses have been used for interpreting the effects of salinity on protein expression (Dick et al., 2020) and for describing the chemical features of differentially regulated proteins in multiple cancer types and experimental cell culture conditions (Dick, 2021).
There is an analysis vignette for each dataset of differentially expressed proteins. To save package space and checking time, prebuilt analysis vignettes are not included in the package.
Use the mkvig()
funciton to compile the vignettes on demand and view them in the browser. For example, mkvig("3D")
compiles the vignette for three-dimensional cell culture and then opens it in the browser. Each of the vignettes is also available as a demo, which can be run from the browser (help.start()
> Packages > canprot > Code demos). Building the vignettes requires pandoc as a system dependency.
Here is the list of available vignettes:
## [1] "3D" "breast" "colorectal" "glucose"
## [5] "hypoxia" "liver" "lung" "osmotic_bact"
## [9] "osmotic_euk" "osmotic_halo" "pancreatic" "prostate"
## [13] "secreted"
The vignettes can be viewed online at https://chnosz.net/canprot/doc/index.html. Additional vignettes based on data from the Human Protein Atlas (HPA) and The Cancer Genome Atlas (TCGA) are available in the JMDplots package on GitHub (see also https://chnosz.net/JMDplots/doc/index.html)
Dick JM. 2021. Water as a reactant in the differential expression of proteins in cancer. Computational and Systems Oncology 1(1): e1007. doi: 10.1002/cso2.1007
Dick JM, Yu M, Tan J. 2020. Uncovering chemical signatures of salinity gradients through compositional analysis of protein sequences. Biogeosciences 17(23): 6145–6162. doi: 10.5194/bg-17-6145-2020
Doron G, Klontzas ME, Mantalaris A, Guldberg RE, Temenoff JS. 2020. Multiomics characterization of mesenchymal stromal cells cultured in monolayer and as aggregates. Biotechnology and Bioengineering 117(6): 1761–1778. doi: 10.1002/bit.27317
Gasteiger E, Hoogland C, Gattiker A, Duvaud S, Wilkins MR, Appel RD, Bairoch A. 2005. Protein identification and analysis tools on the ExPASy server. In: Walker JM, editor. The Proteomics Protocols Handbook. Totowa, NJ: Humana Press. pp. 571–607. doi: 10.1385/1-59259-890-0:571
Jevtić Ž, Stoll B, Pfeiffer F, Sharma K, Urlaub H, Marchfelder A, Lenz C. 2019. The response of Haloferax volcanii to salt and temperature stress: A proteome study by label-free mass spectrometry. Proteomics 19(20): 1800491. doi: 10.1002/pmic.201800491