The purpose of this document is for users interested in the
underlying model works and how the code under the hood of the dsb method
is implemented in a step by step manner as outlined in the methods
section of the paper. All of the steps outlined below are
carried out in a single step with the DSBNormalizeProtein()
function.
suppressMessages(library(mclust))
suppressMessages(library(magrittr))
suppressMessages(library(ggplot2))
#> Warning: replacing previous import 'lifecycle::last_warnings' by
#> 'rlang::last_warnings' when loading 'tibble'
#> Warning: replacing previous import 'lifecycle::last_warnings' by
#> 'rlang::last_warnings' when loading 'pillar'
library(dsb)
cell is a random subset of ~2000 cells from our PBMC data and neg are a subset of background drops.
= dsb::cells_citeseq_mtx
cell = dsb::empty_drop_citeseq_mtx neg
Log transform background droplets
# log transformation
= log(cell + 10)
dlog = log(neg + 10) nlog
In step I, protein counts in empty droplets are used to estimate the expected ambient background noise for each antibody. Each protein’s counts in cell-containing droplets are thus rescaled using this expected noise measurement.
How do we know this removes the ambient component?
In our experiments outlined in the paper using unstained spike-in
control cells added after staining prior to droplet encapsulation, we
found that this captures the ambient component of protein specific
noise. Background levels in the empty droplets and cells were highly
correlated using several definitions of empty / background droplets.
Calculate the mean and standard deviation of the log transformed cells and background droplets:
# calc mean and sd of background drops
= apply(nlog, 1 , sd)
sd_nlog = apply(nlog, 1 , mean) mean_nlog
Standardize ADT levels in the cell-containing droplets based on ambient noise. Subtract background mean \(\mu_{N}\) and divide by the background standard deviation \(\sigma_{N}\) of ADTs detected droplets detected in the empty / background droplets. This makes the value the signal above expected background in units of background standard deviations.
\[ y = (log(x_{i} + P) - µ_{N}) / \sigma_{N} \]
= apply(dlog, 2, function(x) (x - mean_nlog) / sd_nlog) norm_adt
Below we view the distribution of ambient corrected data compared to a simple log transformation as well as the raw data.
# check structure of denoised data with zero centering of background population
= 'deepskyblue3'
r = list(theme_bw(), geom_density_2d(color = r),
plist geom_vline(xintercept = 0, linetype = 'dashed'),
geom_hline(yintercept = 0, linetype = 'dashed'),
xlab('CD3'), ylab('CD19')
)
= qplot(as.data.frame(t(norm_adt))$CD3_PROT,
p1 as.data.frame(t(norm_adt))$CD19_PROT) +
+
plist ggtitle('dsb step 1')
# raw data
= qplot(as.data.frame(t(cell))$CD3_PROT,
p2 as.data.frame(t(cell))$CD19_PROT) +
+
plist ggtitle('RAW data')
# log transformed data
= qplot(as.data.frame(t(dlog))$CD3_PROT,
p3 as.data.frame(t(dlog))$CD19_PROT) +
+
plist ggtitle('log transformed')
# examine distributions.
::plot_grid(p1, p2, p3, nrow = 1) cowplot
In the second step of dsb, we remove technical cell to cell variations.
There are 2 parts to this second step This section describes the first part of the second step: dsb fits a Gaussian mixture with 2 mixing components to model the ADT levels in each cell. This captures the positive-staining proteins and the background, non-staining proteins. Intuitively, we can think of this as in each cell the model is assigning each protein to a positive and negative population, we then take the mean of the proteins in the background. Why do we use a 2 component mixture instead of, say, 3, or 4? Is a 2-component mixture an overly simplistic fit? No–there are several reasons this is the case discussed in detail in the paper. briefly:
How do we know this fitted mean captures only technical variations?
While it’s intuitive that µ1 should measure background noise intrinsic to each cell, unobserved factors may contribute to some biological variability in µ1, however, dsb does not use the fitted background mean alone (by default). µ1 is instead combined with isotype control levels (see the paper and below).
# fit 2 component Gaussian mixture to each cell
= apply(norm_adt, 2, function(x) {
cmd = Mclust(x, G=2, warn = TRUE , verbose = FALSE)
g return(g)
})
cmd
is a list of model fitting results for each cell.
For a random cell (the second cell here), plot the distribution of
proteins in mixture 1 vs mixture 2 of the Gaussian mixture model with k
= 2 mixing components:
Gaussian mixture applied to a single cell
= colnames(norm_adt)[2]
cell.name # get density
= MclustDR(cmd[[2]])
cell2 plot(cell2, what = "density", main = 'test')
title(cex.main = 0.6, paste0('single cell: ',cell.name,
'\n k = 2 component Gaussian Mixture '))
Most of the proteins in this fairly large panel are in the background distribution (N1). 14 proteins form the positive staining distribution (N2) in this cell:
table(cell2$classification)
#>
#> 1 2
#> 73 14
What this means is that the more highly parameterized models (those with a larger number of k values) tend to overfit the background distribution. For example, for this cell. We can compare this k = 2 component mixture to a Gaussian with k = 1 to k= 6 component mixtures. Below the distribution of protein populations inferred to belong to each sub-population is visualized as above for the k = 3 and 6 component mixtures as the Gaussian density distribution that we approximated with the model.
models shown below fit to the same cell as above
# fit a mixture with between 1 and 6 components.
= lapply(as.list(1:6), function(k) Mclust(norm_adt[ ,2], G = k,
m.list warn = FALSE, verbose = FALSE ))
# extract densities for k = 3 and k = 6 component models
= MclustDR(m.list[[3]])
dr_3 = MclustDR(m.list[[6]])
dr_6
# visualize distributiion of protein populations with different k
# in Gaussian mixture for a single cell
= paste0('single cell: ', cell.name,
plot.title '\n k-component Gaussian Mixture ')
plot(dr_3,what = "density")
title(cex.main = 0.6, plot.title)
plot(dr_6,what = "density")
title(cex.main = 0.6, plot.title)
The clear positive distribution in all cases is similar and is far from the lower peak; increasing the k value (the number of mixing components) finds many local densities near the background. The variation in background means is low relative to the magnitude of the difference between the background peaks and the clear positive peak. We observed this trend even in datasets with only 14 proteins
We can use information criteria to determine the optimal model for the cell. Note that Mclust uses a convention where there is not a - sign in the information criteria calculation for BIC, (which is typically used in a regression context), meaning the model that has the highest relative BIC is the optimal model. We can see that the 2 component mixture has the optimal BIC, and it is very similar to the 3 component model.
In the paper we show the distribution of the fitted values for k = 1-6 across cells. We further find that regardless of the optimal model, the µ1 parameter estimated is highly concordant between k = 2 or k = 3 component models. Note all of this is to help justify a single fitted parameter value in the dsb model–this parameter µ1 is further combined with other measures of noise, the isotype control antibodies to form the technical component as outlined below. This makes dsb insensitive to cells which may have true 3 or 4 modal distributions in protein levels because the background mean tends to be very robustly extracted by a 2 component mixture.
plot(
sapply(m.list, function(x) x$bic), type = 'b',
ylab = 'Mclust BIC -- higher is better',
xlab = 'mixing components k in Gaussian Mixture model',
main = paste0('single cell: ', cell.name),
col =r, pch = 18, cex = 1.4, cex.main = 0.8
)
formation of the noise matrix
First dsb extracts the k=2 component mixture µ1 value:
# extract mu 1 as a vector
.1 = unlist(
mulapply(
function(x){x$parameters$mean[1] }
cmd,
)
)
# check distribution of the fitted value for µ1 across cells
hist(mu.1, col = r,breaks = 30)
As indicated above, we do not use µ1 alone as a correction factor. Instead we utilize the correlation structure between isotype controls and µ1 to derive a technical factor for each cell that anchors the component of the background mean µ1 toward the component associated with technical noise. See the paper for further details.
Create noise matrix for calculating the technical component with µ1 and isotype values.
# define isotype controls
= c("Mouse IgG2bkIsotype_PROT", "MouseIgG1kappaisotype_PROT",
isotype.control.name.vec "MouseIgG2akappaisotype_PROT", "RatIgG2bkIsotype_PROT" )
# construct noise matrix
= rbind(mu.1, norm_adt[isotype.control.name.vec, ])
noise_matrix
# transpose to fit PC
= t(noise_matrix)
tmat
# view the noise matrix on which to calculate pc1 scores.
head(tmat)
#> mu.1 Mouse IgG2bkIsotype_PROT
#> GAACGGATCGAATCCA_H1B1ln2.1 0.424416035 2.0081678
#> ATAGACCAGTCCATAC_H1B1ln3.1 0.129457815 0.1938776
#> GAAGCAGAGTATCTCG_H1B1ln4.1 0.008699038 -0.8412382
#> CTACCCAAGCTACCGC_H1B1ln6.1 0.239007906 0.1938776
#> AGCCTAAAGGATATAC_H1B1ln2.1 -0.170614869 0.1938776
#> CGGACTGGTCCGCTGA_H1B1ln6.1 -0.039058826 -0.8412382
#> MouseIgG1kappaisotype_PROT
#> GAACGGATCGAATCCA_H1B1ln2.1 0.7409978
#> ATAGACCAGTCCATAC_H1B1ln3.1 -0.6802394
#> GAAGCAGAGTATCTCG_H1B1ln4.1 -0.6802394
#> CTACCCAAGCTACCGC_H1B1ln6.1 -0.6802394
#> AGCCTAAAGGATATAC_H1B1ln2.1 -0.6802394
#> CGGACTGGTCCGCTGA_H1B1ln6.1 0.7409978
#> MouseIgG2akappaisotype_PROT RatIgG2bkIsotype_PROT
#> GAACGGATCGAATCCA_H1B1ln2.1 2.1388457 -0.4511315
#> ATAGACCAGTCCATAC_H1B1ln3.1 1.4410413 0.7393731
#> GAAGCAGAGTATCTCG_H1B1ln4.1 -0.1319484 -0.4511315
#> CTACCCAAGCTACCGC_H1B1ln6.1 -0.1319484 -0.4511315
#> AGCCTAAAGGATATAC_H1B1ln2.1 1.4410413 0.7393731
#> CGGACTGGTCCGCTGA_H1B1ln6.1 -1.0293939 -0.4511315
calculate eigenvector through the noise matrix
Next we calculate PC1 score for cells based on their noise variables.
rotation = the loadings, x = the position for each observation along
each principal component (the PC scores). These PC scores are the dsb
technical component.
# calculate principal component 1
= prcomp(tmat, scale = TRUE)
g
# get the dsb technical component for each cell -- PC1 scores (position along PC1) for each cell
head(g$x)
#> PC1 PC2 PC3 PC4
#> GAACGGATCGAATCCA_H1B1ln2.1 -2.0901957 2.0749807 -0.64175456 0.05764503
#> ATAGACCAGTCCATAC_H1B1ln3.1 -0.6273177 1.1997987 1.50456704 0.11944273
#> GAAGCAGAGTATCTCG_H1B1ln4.1 1.0445432 0.5002462 0.08893046 -0.05208374
#> CTACCCAAGCTACCGC_H1B1ln6.1 0.2679178 0.4873479 -0.24739624 0.58365532
#> AGCCTAAAGGATATAC_H1B1ln2.1 -0.2224726 1.1748842 1.54873673 0.06409935
#> CGGACTGGTCCGCTGA_H1B1ln6.1 0.8920804 -0.7322072 -0.69982410 -0.78758017
#> PC5
#> GAACGGATCGAATCCA_H1B1ln2.1 -0.9566846
#> ATAGACCAGTCCATAC_H1B1ln3.1 -0.3505210
#> GAAGCAGAGTATCTCG_H1B1ln4.1 0.6115644
#> CTACCCAAGCTACCGC_H1B1ln6.1 0.5164147
#> AGCCTAAAGGATATAC_H1B1ln2.1 -0.8441784
#> CGGACTGGTCCGCTGA_H1B1ln6.1 0.4273887
head(g$rotation)
#> PC1 PC2 PC3 PC4
#> mu.1 -0.6297845 0.03875738 -0.06871119 0.086093199
#> Mouse IgG2bkIsotype_PROT -0.4949070 -0.03401375 -0.32125325 0.630195454
#> MouseIgG1kappaisotype_PROT -0.3896649 -0.31341086 -0.39186390 -0.728176340
#> MouseIgG2akappaisotype_PROT -0.3140363 0.83497701 0.28022795 -0.255256528
#> RatIgG2bkIsotype_PROT -0.3286045 -0.44936395 0.81239775 -0.006706201
#> PC5
#> mu.1 0.7679427
#> Mouse IgG2bkIsotype_PROT -0.5035476
#> MouseIgG1kappaisotype_PROT -0.2571707
#> MouseIgG2akappaisotype_PROT -0.2459898
#> RatIgG2bkIsotype_PROT -0.1733667
= g$x[ ,1]
dsb.technical.component
hist(dsb.technical.component, breaks = 40, col = r)
Note – the direction of the technical component is arbitrary; only the relative values are important as these are used in a linear model and the residuals are calculated (see below).
Fit linear model regressing protein expression on the technical component
We now fit a linear model to each cell regressing each protein’s expression on the dsb technical component. dsb uses the residual of this fit plus the intercept for each protein as the dsb normalized values.
# constructs a matrix of 1 by number of columns in norm_adt1
= as.matrix(dsb.technical.component)
covariate = matrix(1, ncol(norm_adt), 1)
design
# fit a linear model to solve the coefficient for each protein with QR decomposition.
<- limma::lmFit(norm_adt, cbind(design, covariate))
fit
# this subset just extracts the beta for each protein.
<- fit$coefficients[, -(1:ncol(design)), drop = FALSE]
beta is.na(beta)] <- 0
beta[
# beta is the coefficient for each prot.
plot(beta, col = r , pch = 16, xlab = 'protein')
calculate residual of linear model fit Finally we ‘regress out’ the technical component. These are the final dsb normalized values.
= as.matrix(norm_adt) - beta %*% t(covariate) denoised_adt_2
Note, the code above was modified from the limma function
removeBatchEffect
. Internally dsb uses this function to
regress out the technical component for convenience because it is very
robust and efficient.
# regress out the technical component using limma
# note this is how limma (and dsb) calculates this
= limma::removeBatchEffect(norm_adt, covariates = dsb.technical.component) denoised_adt
We can confirm these steps above are equivalent to what is output from the dsb function.
# default dsb call
= DSBNormalizeProtein(cell_protein_matrix = cell,
denoised_adt_3 empty_drop_matrix = neg,
denoise.counts = TRUE,
isotype.control.name.vec = isotype.control.name.vec)
#> [1] "correcting ambient protein background noise"
#> [1] "some proteins with low background variance detected check raw and normalized distributions. protein stats can be returned with return.stats = TRUE"
#> [1] "CD185_PROT"
#> [1] "fitting models to each cell for dsb technical component and removing cell to cell technical noise"
You can confirm that these are all the same
all.equal(denoised_adt, denoised_adt_2, denoised_adt_3)
Because of the transformation in step 1 , the scale of the values can be interpreted similar to signal to ambient noise ratios. With step 1 + 2, we can interpret the values as signal to noise ratios with technical, non-biological cell-to cell variations removed. This means we can set principled cutoffs, rather than just drawing arbitrary lines as what is positive and negative, we can say with confidence any cells with a protein level below, 3 or 3.5 (you can pick based on the distribution of values in your dataset) are negative for the protein.
These dsb normalized values can be used directly (for example, without further standardizing) protein based or joint protein and RNA clustering (see main vignette). We have also found these values are highly interpretable for use in machine learning and other regression approaches for modeling protein expression in single cell data measuring ADTs.
qplot(as.data.frame(t(denoised_adt))$CD3_PROT,
as.data.frame(t(denoised_adt))$CD19_PROT) +
+
plist ggtitle('dsb step 1 + 2 normalized and denoised') +
geom_vline(xintercept = 3.5, color = 'red') +
geom_hline(yintercept = 3.5, color = 'red')