snp_modifyBuild()
.Better snp_ldsplit()
: - also return $cost2
,
the sum of squared sizes of the blocks, - for equivalent splits (with
the same cost), now return the one that also minimizes cost2, - now
return unique splits only (e.g. could get equivalent splits with
different max_size
).
delta
from
c(0.001, 0.005, 0.02, 0.1, 0.6, 3)
to
c(0.001, 0.01, 0.1, 1)
,nlambda
from 20 to 30,maxiter
from 500 to 1000.snp_modifyBuild()
: more variants should be
mapped + add some QC on the mapping (a position is not mapped to more
than one, the chromosome is the same, and possibly check whether we can
go back to the initial position ->
cf. https://doi.org/10.1093/nargab/lqaa054).snp_ldsplit()
:
max_r2
, the maximum squared correlation allowed outside
blocks, and max_cost
, the maximum cost of reported
solutions (i.e. the sum of all squared correlations outside blocks).
Using max_r2
offers an extra guarantee that the splitting
is very good, and makes the function much faster by discarding lots of
possible splits.LDpred2-grid does not use OpenMP for parallelism anymore, it now simply uses multiple R processes.
LDpred2-grid and LDpred2-auto can now make use of
set.seed()
to get reproducible results. Note that
LDpred2-inf and lassosum2 do not use any sampling.
scipen = 50
when writing files to turn off
scientific format (e.g. for physical positions stored as
double
).$add_columns()
).snp_readBGI()
when using an outdated
version of package {bit64}.snp_cor()
and bed_cor()
now use less
memory.Remove parameter info
from snp_cor()
and bed_cor()
because this correction is not useful after
all.
snp_cor()
and bed_cor()
now return NaNs
when e.g. the standard deviation is 0 (and warn about it). Before, these
values were not reported (i.e. treated as 0).
snp_readBGI()
.snp_manhattan()
when non-ordered (chr, pos) are
provided.snp_ancestry_summary()
by allowing to
estimate ancestry proportions after PCA projection (instead of directly
using the allele frequencies).Add function bed_cor()
(similar to
snp_cor()
but with bed files/objects directly).
Add functions snp_ld_scores()
and
bed_ld_scores()
.
snp_ancestry_summary()
to estimate
ancestry proportions from a cohort using only its summary allele
frequencies.snp_scaleAlpha()
, which is similar to
snp_scaleBinom()
, but has a parameter alpha
that controls the relation between the scaling and the allele
frequencies.snp_cor()
now also uses the upper triangle
(@uplo = "U"
) when the sparse correlation matrix is
diagonal, so that it is easier to use with
e.g. as_SFBM()
.type
in snp_asGeneticPos()
to also be able to use interpolated genetic maps from here.return_flip_and_rev
to
snp_match()
for whether to return internal boolean
variables "_FLIP_"
and "_REV_"
.$perc_kept
in the output of
snp_ldsplit()
, the percentage of initial non-zero values
kept within the blocks defined.snp_prodBGEN()
.snp_prodBGEN()
to compute a matrix product
between BGEN files and a matrix (or a vector). This removes the need to
read an intermediate FBM object with snp_readBGEN()
to
compute the product. Moreover, when using dosages, they are not rounded
to two decimal places anymore.Trade new parameter num_iter_change
for a simpler
allow_jump_sign
.
Change defaults in LDpred2-auto to use 500 burn-in iterations (was 1000 before) followed by 200 iterations (500 before). Such a large number of iterations is usually not really needed.
as_SFBM(corr0, compact = TRUE)
. Make sure to reinstall
{bigsnpr} after updating to {bigsparser} v0.5.Prepare for incoming paper on (among other things) improved robustness of LDpred2-auto:
add parameter shrink_corr
to shrink off-diagonal
elements of the LD matrix,
add parameter num_iter_change
to control when
starting to shrink the variants that change sign too much,
also return corr_est
, the “imputed” correlations
between variants and phenotypes, which can be used for post-QCing
variants by comparing those to
beta / sqrt(n_eff * beta_se^2 + beta^2)
.
Replace parameter s
by delta
in
snp_lassosum2()
. This new parameter delta
better reflects that the lassosum model also uses L2-regularization
(therefore, elastic-net regularization).
Now detect strong divergence in lassosum2 and LDpred2-grid, and return missing values for the corresponding effect sizes.
snp_ldsc()
when using blocks with different sizes.info
to snp_cor()
to correct
correlations when they are computed from imputed dosage data.snp_readBGEN()
now also returns frequencies
and imputation INFO scores.rsid
to snp_asGeneticPos()
to also allow matching with rsIDs.snp_lassosum2()
to train the lassosum
models using the exact same input data as LDpred2.report_step
in
snp_ldpred2_auto()
to report some of the internal sampling
betas.snp_readBGEN()
when using BGEN files
containing ~
.thr_r2
in snp_cor()
.snp_ldsplit()
. Instead, report
the best splits for a range of numbers of blocks desired.snp_ldsplit()
now makes more sense.
Also fix a small bug that prevented splitting the last block in some
cases.snp_ldsplit()
for optimally splitting
variants in nearly independent blocks of LD.file.type = "--gzvcf"
for using gzipped VCF
in snp_plinkQC()
.snp_assocBGEN()
; prefer reading
small parts with snp_readBGEN()
as a temporary
bigSNP
object and do the association test with
e.g. big_univLinReg()
.snp_thr_correct()
for correcting for
winner’s curse in summary statistics when using p-value
thresholding.Use a better formula for the scale in LDpred2, useful when there are some variants with very large effects (e.g. explaining more than 10% phenotypic variance).
Simplify LDpred2; there was not really any need for initialization and ordering of the Gibbs sampler.
return_sampling_betas
in
snp_ldpred2_grid()
to return all sampling betas (after
burn-in), which is useful for assessing the uncertainty of the PRS at
the individual level (see
https://doi.org/10.1101/2020.11.30.403188).$postp_est
, $h2_init
and
$p_init
in LDpred2-auto.snp_readBGEN()
to make sure of
the expected format.snp_fst()
for computing Fst.could not find function "ldpred2_gibbs_auto"
.as_SFBM(corr0)
instead of
bigsparser::as_SFBM(as(corr0, "dgCMatrix"))
. This should
also use less memory and be faster.sparse
to enable getting also a sparse
solution in LDpred2-auto.bigsparser::as_SFBM()
.01
or 1
for chromosomes
in BGI files.snp_match()
. Also now remove duplicates by
default.All 3 LDpred2 functions now use an SFBM as input format for the correlation matrix.
Allow for multiple initial values for p in
snp_ldpred2_auto()
.
Add function coef_to_liab()
for e.g. converting
heritability to the liability scale.
alpha
of function
snp_cor()
to 1
.Add functions snp_ldpred2_inf()
,
snp_ldpred2_grid()
and snp_ldpred2_auto()
for
running the new LDpred2-inf, LDpred2-grid and LDpred2-auto.
Add functions snp_ldsc()
and
snp_ldsc2()
for performing LD score regression.
Add function snp_asGeneticPos()
for transforming
physical positions to genetic positions.
Add function snp_simuPheno()
for simulating
phenotypes.
snp_pcadapt()
, bed_pcadapt()
,
snp_readBGEN()
and
snp_fastImputeSimple()
.Parallelization of clumping algorithms has been modified. Before, chromosomes were imputed in parallel. Now, chromosomes are processed sequentially, but computations within each chromosome are performed in parallel thanks to OpenMP. This should prevent major slowdowns for very large samples sizes (due to swapping).
Use OpenMP to parallelize other functions as well (possibly only sequential until now).
Can now run snp_cor()
in parallel.
Parallelization of snp_fastImpute()
has been
modified. Before this version, chromosomes were imputed in parallel.
Now, chromosomes are processed sequentially, but computation of
correlation between variants and XGBoost models are performed using
parallelization.
snp_subset()
as alias of method
subset()
for subsetting bigSNP
objects.bed_light
internally to make parallel
algorithms faster because they have to transfer less data to clusters.
Also define differently functions used in big_parallelize()
for the same reason.object 'obj.bed' not found
in
snp_readBed2()
.backingfile
to
subset.bigSNP()
.byrow
to bed_counts()
.Add memory-mapping on PLINK (.bed) files with missing values + new functions:
bed()
bed_MAF()
bed_autoSVD()
bed_clumping()
bed_counts()
bed_cprodVec()
bed_pcadapt()
bed_prodVec()
bed_projectPCA()
bed_projectSelfPCA()
bed_randomSVD()
bed_scaleBinom()
bed_tcrossprodSelf()
download_1000G()
snp_modifyBuild()
snp_plinkKINGQC()
snp_readBed2()
sub_bed()
Add 3 parameters to autoSVD()
:
alpha.tukey
, min.mac
and
max.iter
.
Remove option for changing ploidy (that was only partially supported).
Automatically apply snp_gc()
to
pcadapt
.
snp_fastImputeSimple()
: fast imputation via mode,
mean or sampling according to allele frequencies.snp_readBGEN()
that could not handle
duplicated variants or individuals.snp_grid_PRS()
, it now stores not only the
FBM, but also the input parameters as attributes (the whole result
basically).Add 3 SCT functions snp_grid_*()
to improve from
Clumping and Thresholding (preprint coming soon).
Add snp_match()
function to match between summary
statistics and some SNP information.
is.size.in.bp
is deprecated.read_as
for snp_readBGEN()
.
It is now possible to sample BGEN probabilities as random hard calls
using read_as = "random"
. Default remains reading
probabilities as dosages.For memory-mapping, now use mio instead of boost.
snp_clumping()
(and snp_autoSVD()
) now
has a size
that is inversely proportional to
thr.r2
.
snp_pruning()
is deprecated (and will be removed
someday); now always use snp_clumping()
.
snp_assocBGEN()
for computing quick
association tests from BGEN files. Could be useful for quick screening
of useful SNPs to read in bigSNP format. This function might be improved
in the future.snp_readBGEN()
to read UK Biobank BGEN
files in bigSNP
format.Add parameter is.size.in.bp
to
snp_autoSVD()
for the clumping part.
Change the threshold of outlier detection in
snp_autoSVD()
(it now detects less outliers). See the
documentation details if you don’t have any information about
SNPs.
snp_gene
(as a gist) to get genes
corresponding to ‘rs’ SNP IDs thanks to package {rsnps} from rOpenSci.
See README.snp_fastImpute
. Also store
information in an FBM (instead of a data frame) so that imputation can
be done by parts (you can stop the imputation by killing the R processes
and come back to it later). Note that the defaults used in the
Bioinformatics paper were alpha = 0.02
and
size = 500
(instead of 1e-4
and
200
now, respectively). These new defaults are more
stringent on the SNPs that are used, which makes the imputation faster
(30 min instead of 42-48 min), without impacting accuracy (still
4.7-4.8% of errors).