fdrci: FDR selection and adjustment - HGSOC

Joshua Millstein

6/4/2022

Introduction

In the context of high grade serous ovarian cancer (HGSOC), we might be interested in more fully understanding the relationship between age, cancer stage and gene expression in tumor tissue. These questions could be addressed in data hosted by Gene Expression Omnibus (GEO) and collected as part of our study (Millstein et al. 2020) involving 513 gene expression features measured in formalin-fixed paraffin-embedded (FFPE) tumor tissue from 3,769 women with HGSOC.

Gene expression is often approximately normally distributed in population data, so it would make sense to apply linear regression in a separate model for each gene, where expression is the outcome and age and stage are the predictors. If we are interested in the question of whether there is effect modification, for example, if the effect of stage on expression varies across different age groups, we could include an interaction term between age and stage. A p-value for the interaction could be computed for each gene using an F-test, and multiplicity in testing could be address by computing FDR, such as the Benjamini and Hochberg (BH) approach (Benjamini and Hochberg 1995).

However, we might be concerned about the assumption of normality of error, and with many genes, this is could be hard to verify. To avoid this assumption, we can take a non-parametric permutation-based approach (Millstein and Volfson 2013) by randomly permuting the age*stage product term to generate sets of results distributed approximately under the null hypothesis. The fdrci R package, which includes an approach to account for multiplicity in FDR CIs (Millstein et al. 2022), could then be applied. Essentially, the investigator chooses a series of possible discovery thresholds, and at each threshold FDR is computed along with a p-value that tests \(\text{H}_0: FDR = 1\) against the alternative, \(\text{H}_1: FDR < 1\). The Benjamini and Yekutieli (BY) approach (Benjamini and Yekutieli 2005) via the BH approach is used to select thresholds corresponding to rejected null hypotheses and to adjust selected FDR CIs for multiplicity. Thus, the investigator is free to choose post hoc from any one (or multiple) of the selected discovery thresholds while maintaining adequate control of FDR CI coverage.

Organization

  1. Download HGSOC data from GEO
  2. Conduct analysis of effect modification between age and stage
  3. Generate multiple replicates of results under the null by permutation
  4. Compute FDR using the fdrci package with the “BH” option to account for multiplicity
  5. Final thoughts

First we load the libraries

# download and install fdrci from github or CRAN
# https://uscbiostats.github.io/fdrci
# library(devtools)
# install_github("USCbiostats/fdrci")
library(fdrci)
# if (!requireNamespace("BiocManager", quietly = TRUE))
#  install.packages("BiocManager")
# BiocManager::install("GEOquery")
library(GEOquery)
#> Warning: package 'GEOquery' was built under R version 4.1.2
#> Loading required package: Biobase
#> Loading required package: BiocGenerics
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     Filter, Find, Map, Position, Reduce, anyDuplicated, append,
#>     as.data.frame, basename, cbind, colnames, dirname, do.call,
#>     duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#>     lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#>     pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,
#>     tapply, union, unique, unsplit, which.max, which.min
#> Welcome to Bioconductor
#> 
#>     Vignettes contain introductory material; view with
#>     'browseVignettes()'. To cite Bioconductor, see
#>     'citation("Biobase")', and for packages 'citation("pkgname")'.
#> Setting options('download.file.method.GEOquery'='auto')
#> Setting options('GEOquery.inmemory.gpl'=FALSE)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following object is masked from 'package:Biobase':
#> 
#>     combine
#> The following objects are masked from 'package:BiocGenerics':
#> 
#>     combine, intersect, setdiff, union
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#library(stringr)
library(foreach)
#> Warning: package 'foreach' was built under R version 4.1.2
library(ggplot2)
#> Warning: package 'ggplot2' was built under R version 4.1.2

set.seed(052422)

Download data from GEO

The GEO id for the dataset is GSE132342. We can use it along with the R package GEOquery to get the Nanostring panel of gene expression features as well as some limited patient information including age group (based on quartiles of the training dataset with groups aged <53, 53-59, 60-66, and >=67), stage (dichotomized into early (FIGO stage I/II) and advanced (FIGO stage III/IV)).

gset <- getGEO("GSE132342", GSEMatrix =TRUE, getGPL=TRUE)
#> Found 1 file(s)
#> GSE132342_series_matrix.txt.gz
gset = gset$GSE132342_series_matrix.txt.gz

pdat = phenoData(gset)
pdat.p = pData(pdat)
pdat.m = varMetadata(pdat)

fdat = featureData(gset)
fdat.n = featureNames(fdat)
fdat.m = varMetadata(fdat)
fdat.p = pData(fdat) # includes gene id's, target sequence, and 
#                      gene names (mostly HUGO gene symbols)

edat = exprs(gset) # normalized gene expression, genes in rows 
#                    samples in columns

rownames(edat) = fdat.p$Customer.Identifier

# Transform and combine phenotype data with expression data
pemat <- cbind(pdat.p, t(edat)) %>%
  mutate(time.os = as.numeric(`time.last.fu:ch1`),
         event.os = as.numeric(`status:ch1`),
         site = factor(`site:ch1`),
         stage = `Stage:ch1`,
         age = factor(`age:ch1`)) %>% 
  filter(stage %in% c(1, 2)) %>% 
  select(time.os, event.os, site, stage, age, fdat.p$Customer.Identifier[1]:fdat.p$Customer.Identifier[513])

Age*stage effect modification analysis

Here we fit a linear model for each of the 513 gene expression features. The outcome is expression and the covariates are site (an indicator of the component study where a portion of the data were collected), age group, stage group, and an interaction between age and stage as defined by a factor variable we create by combining levels of age and stage. The statistic of interest is a p-value from an F-test of the interaction term. The output is a dataframe, rslts, that includes the gene name in the first column and the p-value in the second column.

# create interaction term manually
pemat[, "ageStage"] = factor(paste(as.character(pemat[,"age"]), as.character(pemat[,"stage"]), sep="_"), levels=c("q4_2", "q1_1", "q1_2", "q2_1", "q2_2", "q3_1", "q3_2", "q4_1"))
# Run interaction scan
rslts <- foreach(j = 1:length(fdat.p$Customer.Identifier), .combine = rbind) %do% {
  fmla = paste(fdat.p$Customer.Identifier[j], "~ site + age + stage + ageStage")
  fit = lm(fmla, data=pemat)
  c(fdat.p$Customer.Identifier[j], anova(fit)["ageStage", "Pr(>F)"])
}
colnames(rslts) = c("gene", "p_value")
rslts = as.data.frame(rslts)
rslts[,"p_value"] = as.numeric(rslts[,"p_value"])
rslts.obs = rslts

Compute BH FDR

We can use the p.adjust() function to compute a q-value, which essentially treats each p-value as a discovery threshold for FDR (Storey and Tibshirani 2003). Typically, a priori values such as 0.05 or 0.1 are then used to determine which q-values correspond to false null hypotheses, here, genes that should be considered discoveries. We see that no genes meet q < 0.05 but two genes meet the q < 0.1 level.

rslts.obs[,"q_value"] = p.adjust(rslts.obs[,"p_value"], method="BH")
knitr::kable(rslts.obs[rslts.obs$q_value < 0.1,], caption = "Parametic approach: BH FDR < 0.1", digits=4, row.names=FALSE)
Parametic approach: BH FDR < 0.1
gene p_value q_value
IGFBP4 3e-04 0.0884
CAV1 1e-04 0.0705

This result raises a troubling issue for this more conventional approach. The investigator will not know a priori which threshold, 0.1 or 0.05, is more appropriate, yet to be formally correct they need to choose before the analysis is conducted. If they are lucky enough to choose 0.1, the results yield two significant genes, however, if they happen to choose 0.05, then they must conclude that there are no discoveries. This highlights an important motivation for the proposed approach, which allows the investigator to choose the discovery threshold post hoc.

Repeatly conduct analysis following permutation

Another motivation for the proposed approach is it’s non-parametric construction, which allows the investigator to relax distributional assumptions. Rather than evaluate the observed results against a parametric distribution, we approximate the distribution under the null hypothesis using the data itself.

Here the null hypothesis is that the age-stage interaction term is independent of gene expression conditional on the covariates site, age, and stage. Below, we generate p-values consistent with this null by randomly permuting the ageStage interation term. Thus, 100 sets of results distributed approximately under the null are generated.

n.perm = 100
perml = vector('list', n.perm)
tmpvar = factor(paste(as.character(pemat[,"age"]), as.character(pemat[,"stage"]), sep="_"),
                levels=c("q4_2", "q1_1", "q1_2", "q2_1", "q2_2", "q3_1", "q3_2", "q4_1"))
for(perm in 1:n.perm){
  rslts <- foreach(j = 1:length(fdat.p$Customer.Identifier), .combine = rbind) %do% {
    pemat[, "ageStage"] = sample(tmpvar)
    fmla = paste(fdat.p$Customer.Identifier[j], "~ site + age + stage + ageStage")
    fit = lm(fmla, data=pemat)
    c(fdat.p$Customer.Identifier[j], anova(fit)["ageStage", "Pr(>F)"])
  }
  colnames(rslts) = c("gene", "p_value")
  rslts = as.data.frame(rslts)
  rslts[,"p_value"] = as.numeric(rslts[,"p_value"])
  perml[[perm]] = rslts
}

Compute FDR using the fdrci package

To use the fdrci package to compute FDR, the investigator must provide candidate discovery thresholds. In this case the test statistic is the p-value itself, and we choose values that correspond to the series, \(-log_{10}(p) = 2,2.1,2.2,...,5\). Each row of the output table includes the permutation-based FDR estimate, 95% CIs, \(\pi_0\), over-dispersion estimate, number of discoveries in the observed data, and total number of discoveries in the permuted data (summed over all permutation results).

First we will estimate FDR and 95% CIs without accounting for multiplicity

We see that for all of our candidate discovery thresholds the upper FDR CI bound is less than one, indicating evidence of true alternative hypotheses within all of these sets. However, note that there is only one test corresponding to \(FDR < 0.1\). If all parametric assumptions are satisfied this approach should be less conservative than BH FDR. The fact that it is more conservative here may indicated mild departures from parametric assumptions.

We are still faced with the issue of selecting one or more FDR CIs without accounting for the selection process. Thus, we recompute using the “BH” option.

mytbl = fdrTbl(rslts.obs$p_value,perml,"p_value",nrow(rslts.obs),2,5)
mytbl = mytbl[!is.na(mytbl$fdr), ]
knitr::kable(mytbl, caption = "FDR: no multiplicity adjustment",
             digits=3, row.names=FALSE)
FDR: no multiplicity adjustment
threshold fdr ll ul pi0 odp S Sp
2.0 0.474 0.253 0.886 0.989 1.076 11 527
2.1 0.461 0.227 0.937 0.991 1.131 9 419
2.2 0.377 0.193 0.735 0.989 1.000 9 343
2.3 0.349 0.172 0.708 0.990 1.000 8 282
2.4 0.324 0.152 0.691 0.991 1.000 7 229
2.5 0.252 0.117 0.540 0.990 1.009 7 178
2.6 0.195 0.089 0.427 0.989 1.054 7 138
2.7 0.187 0.082 0.426 0.990 1.000 6 113
2.8 0.214 0.078 0.585 0.994 1.000 4 86
2.9 0.184 0.067 0.505 0.994 1.000 4 74
3.0 0.209 0.065 0.668 0.995 1.000 3 63
3.1 0.249 0.060 1.000 0.997 1.000 2 50
3.2 0.189 0.046 0.787 0.997 1.000 2 38
3.3 0.140 0.033 0.594 0.997 1.016 2 28
3.4 0.120 0.026 0.546 0.997 1.105 2 24
3.5 0.200 0.024 1.000 0.998 1.112 1 20
3.6 0.130 0.017 0.994 0.998 1.000 1 13
3.7 0.110 0.014 0.852 0.998 1.000 1 11
3.8 0.080 0.010 0.640 0.998 1.000 1 8

Next we use the BH option for the FDR selection and adjustment approach described in our paper (Millstein et al. 2022).

We are not restricted to the 0.1 threshold, so for example, we may decide that FDR around 0.2 is low enough to be worth following up, in which case, we could use the threshold of \(-log_{10}(p) = 2.6\), yielding \(FDR = 0.195 (0.089, 0.427)\) and 7 discoveries.

mytbl1 = fdrTbl(rslts.obs$p_value,perml,"p_value",nrow(rslts.obs),2,5,correct="BH")
mytbl1 = mytbl1[!is.na(mytbl1$fdr), ]
knitr::kable(mytbl1, caption = "FDR: with multiplicity adjustment",
             digits=3, row.names=FALSE)
FDR: with multiplicity adjustment
threshold fdr ll ul pi0 odp S Sp
2.0 0.474 0.251 0.892 0.989 1.076 11 527
2.1 0.461 0.225 0.944 0.991 1.131 9 419
2.2 0.377 0.192 0.741 0.989 1.000 9 343
2.3 0.349 0.170 0.714 0.990 1.000 8 282
2.4 0.324 0.151 0.697 0.991 1.000 7 229
2.5 0.252 0.116 0.545 0.990 1.009 7 178
2.6 0.195 0.088 0.431 0.989 1.054 7 138
2.7 0.187 0.081 0.430 0.990 1.000 6 113
2.8 0.214 0.077 0.591 0.994 1.000 4 86
2.9 0.184 0.066 0.511 0.994 1.000 4 74
3.0 0.209 0.065 0.677 0.995 1.000 3 63
3.1 0.249 0.059 1.000 0.997 1.000 2 50
3.2 0.189 0.045 0.800 0.997 1.000 2 38
3.3 0.140 0.032 0.604 0.997 1.016 2 28
3.4 0.120 0.026 0.556 0.997 1.105 2 24
3.5 0.200 NA NA 0.998 1.112 1 20
3.6 0.130 0.017 1.000 0.998 1.000 1 13
3.7 0.110 0.014 0.873 0.998 1.000 1 11
3.8 0.080 0.010 0.655 0.998 1.000 1 8
knitr::kable(rslts.obs[rslts.obs$p_value < 10^(-2.6),], caption = "Parametic approach: BH FDR < 0.2", digits=4, row.names=FALSE)
Parametic approach: BH FDR < 0.2
gene p_value q_value
HNF1B 0.0021 0.1552
SORBS3 0.0017 0.1418
ANXA4 0.0017 0.1418
IGFBP4 0.0003 0.0884
CAV1 0.0001 0.0705
QPRT 0.0010 0.1418
NUP85 0.0012 0.1418

Why is there no difference between the adjusted and unadjusted results?

The proposed selection and adjustment approach relies on the BH method in selection of discovery threshold FDR confidence intervals. Recall that one of the properties of the BH approach is that if all p-values are less than the target type I error, , then the adjusted significance level is the same as the unadjusted level. We are observing this dynamic here. That is, all candidate thresholds have sufficient evidence of \(FDR < 1\) to justify selection, thus there is no adjustment in this case.

Final thoughts

If we look at violin plots of expression of the most significant gene, Caveolin-1 (CAV1), against age and stage groups, we see clear differences by stage, with expression in stage = 1 tumors lower than expression in stage = 2 tumors. However, there is also a suggestion of trends with age that appears to reverse across stage groups. For stage = 1 tumors, expression tends to decrease with age, whereas for stage = 2 tumors, expression tends to increase with age.

p1 = ggplot( pemat, aes( x=stage, y=CAV1, fill=age) ) + 
  xlab("stage.age") + ylab("CAV1 expression") + theme_bw() +
  geom_violin(position=position_dodge(width = 0.8)) + 
  geom_jitter(aes(group=age), size=.02, position=position_jitterdodge(dodge.width=0.8, jitter.width=0.05)) + 
  labs(title="") + 
  geom_boxplot(width=0.1, notch=TRUE, outlier.shape = NA, position=position_dodge(width = 0.8)) +
  theme(axis.text.x=element_text(angle=90, hjust=1))
p1

sessionInfo()
#> R version 4.1.1 (2021-08-10)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Big Sur 10.16
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] ggplot2_3.3.6       foreach_1.5.2       dplyr_1.0.7        
#> [4] GEOquery_2.62.2     Biobase_2.54.0      BiocGenerics_0.40.0
#> [7] fdrci_2.3          
#> 
#> loaded via a namespace (and not attached):
#>  [1] tidyselect_1.1.1  xfun_0.31         bslib_0.3.1       purrr_0.3.4      
#>  [5] colorspace_2.0-3  vctrs_0.3.8       generics_0.1.0    htmltools_0.5.2  
#>  [9] yaml_2.2.1        utf8_1.2.2        rlang_1.0.2       R.oo_1.24.0      
#> [13] jquerylib_0.1.4   pillar_1.6.4      R.utils_2.11.0    glue_1.4.2       
#> [17] withr_2.4.2       lifecycle_1.0.1   stringr_1.4.0     munsell_0.5.0    
#> [21] gtable_0.3.0      R.methodsS3_1.8.1 codetools_0.2-18  evaluate_0.14    
#> [25] labeling_0.4.2    knitr_1.36        tzdb_0.3.0        fastmap_1.1.0    
#> [29] curl_4.3.2        fansi_0.5.0       highr_0.9         readr_2.1.2      
#> [33] scales_1.2.0      limma_3.50.3      jsonlite_1.7.2    farver_2.1.0     
#> [37] hms_1.1.1         digest_0.6.28     stringi_1.7.5     grid_4.1.1       
#> [41] cli_3.0.1         tools_4.1.1       magrittr_2.0.1    sass_0.4.1       
#> [45] tibble_3.1.5      crayon_1.4.1      tidyr_1.2.0       pkgconfig_2.0.3  
#> [49] ellipsis_0.3.2    data.table_1.14.2 xml2_1.3.2        rmarkdown_2.14   
#> [53] iterators_1.0.14  R6_2.5.1          compiler_4.1.1
Benjamini, Yoav, and Yosef Hochberg. 1995. “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” Journal Article. Journal of the Royal Statistical Society: Series B (Methodological) 57.1: 289–300.
Benjamini, Yoav, and Daniel Yekutieli. 2005. “False Discovery Rateadjusted Multiple Confidence Intervals for Selected Parameters.” Journal Article. Journal of the American Statistical Association 100 (469): 71–81.
Millstein, Joshua, Francesca Battaglin, Hiroyuki Arai, Wu Zhang, Priya Jayachandran, Shivani Soni, Aparna R Parikh, Christoph Mancao, and Heinz-Josef Lenz. 2022. Fdrci: FDR Confidence Interval Selection and Adjustment for Large-Scale Hypothesis Testing. Bioinformatics Advances.
Millstein, Joshua, Timothy Budden, Ellen L Goode, Michael S Anglesio, Aline Talhouk, Maria P Intermaggio, HS Leong, et al. 2020. “Prognostic Gene Expression Signature for High-Grade Serous Ovarian Cancer.” Annals of Oncology 31 (9): 1240–50.
Millstein, Joshua, and Dmitri Volfson. 2013. “Computationally Efficient Permutation-Based Confidence Interval Estimation for Tail-Area FDR.” Frontiers in Genetics 4: 179.
Storey, John D, and Robert Tibshirani. 2003. “Statistical Significance for Genomewide Studies.” Proceedings of the National Academy of Sciences 100 (16): 9440–45.