similarity_graph
. If you are more
interested in the high-dimensional graph/fuzzy simplicial set
representation of your input data, and don’t care about the low
dimensional approximation, the similarity_graph
function
offers a similar API to umap
, but neither the
initialization nor optimization of low-dimensional coordinates will be
performed. The return value is the same as that which would be returned
in the results list as the fgraph
member if you had
provided ret_extra = c("fgraph")
. Compared to getting the
same result via running umap
, this function is a bit more
convenient to use, makes your intention clearer if you would be
discarding the embedding, and saves a small amount of time. A
t-SNE/LargeVis similarity graph can be returned by setting
method = "largevis"
.umap_transform
with
pre-generated nearest neighbors (also the error message was completely
useless).Thank you to AustinHartman for reporting
this (https://github.com/jlmelville/uwot/issues/97).fuzzy_simplicial_set
) refactored to behave more like that
of previous versions. This change was breaking the behavior of the CRAN
package bbknnR.dens_weight
. If set to a value between 0
and 1, an attempt is made to include the relative local densities of the
input data in the output coordinates. This is an approximation to the densMAP method. A
large value of dens_weight
will use a larger range of
output densities to reflect the input data. If the data is too spread
out, reduce the value of dens_weight
. For more information
see the documentation at
the uwot repo.binary_edge_weights
. If set to
TRUE
, instead of smoothed knn distances, non-zero edge
weights all have a value of 1. This is how PaCMAP works and
there is practical and theoretical
reasons to believe this won’t have a big effect on UMAP but you can try
it yourself.ret_extra
:
"sigma"
: the return value will contain a
sigma
entry, a vector of the smooth knn distance scaling
normalization factors, one for each observation in the input data. A
small value indicates a high density of points in the local neighborhood
of that observation. For lvish
the equivalent bandwidths
calculated for the input perplexity is returned.rho
will be exported, which is the
distance to the nearest neighbor after the number of neighbors specified
by the local_connectivity
. Only applies for
umap
and tumap
."localr"
: exports a vector of the local radii, the sum
of sigma
and rho
and used to scale the output
coordinates when dens_weight
is set. Even if not using
dens_weight
, visualizing the output coordinates using a
color scale based on the value of localr
can reveal regions
of the input data with different densities.umap
and tumap
only: new
data type for precomputed nearest neighbor data passed as the
nn_method
parameter: you may use a sparse distance matrix
of format dgCMatrix
with dimensions N x N
where N
is the number of observations in the input data.
Distances should be arranged by column, i.e. a non-zero entry in row
j
of the i
th column indicates that the
j
th observation in the input data is a nearest neighbor of
the i
th observation with the distance given by the value of
that element. Note that this is a different format to the sparse
distance matrix that can be passed as input to X
: notably,
the matrix is not assumed to be symmetric. Unlike other input formats,
you may have a different number of neighbors for each observation (but
there must be at least one neighbor defined per observation).umap_transform
can also take a sparse distance matrix
as its nn_method
parameter if precomputed nearest neighbor
data is used to generate an initial model. The format is the same as for
the nn_method
with umap
. Because distances are
arranged by columns, the expected dimensions of the sparse matrix is
N_model x N_new
where N_model
is the number of
observations in the original data and N_new
is the number
of observations in the data to be transformed.n_components = 100
or higher),
RSpectra is recommended and will likely out-perform irlba even if you
have installed a good linear algebra library.init = "laplacian"
returned the wrong coordinates
because of a slightly subtle issue around how to order the eigenvectors
when using the random walk transition matrix rather than normalized
graph laplacians.init_sdev
parameter was ignored when the
init
parameter was a user-supplied matrix. Now the input
will be scaled.bandwidth
parameter has been
changed to give results more like the current version (0.5.2) of the
Python UMAP implementation. This is likely to be a breaking change for
non-default settings of bandwidth
, but this is not a
parameter which is actually exposed by the Python UMAP public API any
more, so is on the road to deprecation in uwot too and I don’t recommend
you change this.batch
. If TRUE
, then
results are reproducible when n_sgd_threads > 1
(as long
as you use set.seed
). The price to be paid is that the
optimization is slightly less efficient (because coordinates are not
updated as quickly and hence gradients are staler for longer), so it is
highly recommended to set n_epochs = 500
or higher. Thank
you to Aaron Lun who not only came
up with a way to implement this feature, but also wrote an entire C++ implementation of UMAP
which does it (https://github.com/jlmelville/uwot/issues/83).opt_args
. The default optimization
method when batch = TRUE
is Adam. You can control its
parameters by passing them in the opt_args
list. As Adam is
a momentum-based method it requires extra storage of previous gradient
data. To avoid the extra memory overhead you can also use
opt_args = list(method = "sgd")
to use a stochastic
gradient descent method like that used when
batch = FALSE
.epoch_callback
. You may now pass a
function which will be invoked at the end of each epoch. Mainly useful
for producing an image of the state of the embedding at different points
during the optimization. This is another feature taken from umappp.pca_method
, used when the
pca
parameter is supplied to reduce the initial
dimensionality of the data. This controls which method is used to carry
out the PCA and can be set to one of:
"irlba"
which uses irlba::irlba
to
calculate a truncated SVD. If this routine deems that you are trying to
extract 50% or more of the singular vectors, you will see a warning to
that effect logged to the console."rsvd"
, which uses irlba::svdr
for
truncated SVD. This method uses a small number of iterations which
should give an accuracy/speed up trade-off similar to that of the scikit-learn
TruncatedSVD method. This can be much faster than using
"irlba"
but potentially at a cost in accuracy. However, for
the purposes of dimensionality reduction as input to nearest neighbor
search, this doesn’t seem to matter much."bigstatsr"
, which uses the bigstatsr
package will be used. Note: that this is not a
dependency of uwot
. If you want to use
bigstatsr
, you must install it yourself. On platforms
without easy access to fast linear algebra libraries (e.g. Windows),
using bigstatsr
may give a speed up to PCA
calculations."svd"
, which uses base::svd
.
Warning: this is likely to be very slow for most
datasets and exists as a fallback for small datasets where the
"irlba"
method would print a warning."auto"
(the default) which uses "irlba"
to
calculate a truncated SVD, unless you are attempting to extract 50% or
more of the singular vectors, in which case "svd"
is
used.ret_nn = TRUE
. If the
names exist in more than one of the input data parameters listed above,
but are inconsistent, no guarantees are made about which names will be
used. Thank you jwijffels for
reporting this.umap_transform
, the learning rate is now down-scaled
by a factor of 4, consistent with the Python implementation of UMAP. If
you need the old behavior back, use the (newly added)
learning_rate
parameter in umap_transform
to
set it explicitly. If you used the default value in umap
when creating the model, the correct setting in
umap_transform
is learning_rate = 1.0
.nn_method = "annoy"
and
verbose = TRUE
would lead to an error with datasets with
fewer than 50 items in them.umap_transform
(this was incorrectly
documented to work).umap_transform
was wrong in other ways: it has now been
corrected to indicate that there should be neighbor data for each item
in the test data, but the neighbors and distances should refer to items
in training data (i.e. the data used to build the model).n_neighbors
parameter is now correctly ignored in model
generation if pre-calculated nearest neighbor data is provided.grain_size
didn’t do
anything.This release is mainly to allow for some internal changes to keep compatibility with RcppAnnoy, used for the nearest neighbor calculations.
umap
and
tumap
now note that the contents of the model
list are subject to change and not intended to be part of the uwot
public API. I recommend not relying on the structure of the
model
, especially if your package is intended to appear on
CRAN or Bioconductor, as any breakages will delay future releases of
uwot to CRAN.metric = "correlation"
a distance based on
the Pearson correlation (https://github.com/jlmelville/uwot/issues/22).
Supporting this required a change to the internals of how nearest
neighbor data is stored. Backwards compatibility with models generated
by previous versions using ret_model = TRUE
should have
been preserved.nn_method
, for
umap_transform
: pass a list containing pre-computed nearest
neighbor data (identical to that used in the umap
function). You should not pass anything to the X
parameter
in this case. This extends the functionality for transforming new points
to the case where nearest neighbor data between the original data and
new data can be calculated external to uwot
. Thanks to Yuhan Hao for contributing the PR
(https://github.com/jlmelville/uwot/issues/63 and https://github.com/jlmelville/uwot/issues/64).init
, for umap_transform
:
provides a variety of options for initializing the output coordinates,
analogously to the same parameter in the umap
function (but
without as many options currently). This is intended to replace
init_weighted
, which should be considered deprecated, but
won’t be removed until uwot 1.0 (whenever that is). Instead of
init_weighted = TRUE
, use init = "weighted"
;
replace init_weighted = FALSE
with
init = "average"
. Additionally, you can pass a matrix to
init
to act as the initial coordinates.umap_transform
: previously, setting
n_epochs = 0
was ignored: at least one iteration of
optimization was applied. Now, n_epochs = 0
is respected,
and will return the initialized coordinates without any further
optimization.verbose = TRUE
: the progress bar calculations
were taking up a detectable amount of time and has now been fixed. With
very small data sets (< 50 items) the progress bar will no longer
appear when building the index.n_threads
is now NULL
to
provide a bit more protection from changing dependencies.grain_size
parameter has been undeprecated. As the
version that deprecated this never made it to CRAN, this is unlikely to
have affected many people.grain_size
parameter is now ignored and remains to
avoid breaking backwards compatibility only.ret_extra
, a vector which can contain
any combination of: "model"
(same as
ret_model = TRUE
), "nn"
(same as
ret_nn = TRUE
) and fgraph
(see below).ret_extra
vector contains
"fgraph"
, the returned list will contain an
fgraph
item representing the fuzzy simplicial input graph
as a sparse N x N matrix. For lvish
, use "P"
instead of "fgraph
” (https://github.com/jlmelville/uwot/issues/47). Note that
there is a further sparsifying step where edges with a very low
membership are removed if there is no prospect of the edge being sampled
during optimization. This is controlled by n_epochs
: the
smaller the value, the more sparsifying will occur. If you are only
interested in the fuzzy graph and not the embedded coordinates, set
n_epochs = 0
.unload_uwot
, to unload the Annoy nearest
neighbor indices in a model. This prevents the model from being used in
umap_transform
, but allows for the temporary working
directory created by both save_uwot
and
load_uwot
to be deleted. Previously, both
load_uwot
and save_uwot
were attempting to
delete the temporary working directories they used, but would always
silently fail because Annoy is making use of files in those
directories.init = "spca"
, fixed values of a
and
b
(rather than allowing them to be calculated through
setting min_dist
and spread
) and
approx_pow = TRUE
. Using the tumap
method with
init = "spca"
is probably the most robust approach.n_epochs = 0
. This used to behave
like (n_epochs = NULL
) and gave a default number of epochs
(dependent on the number of vertices in the dataset). Now it more
usefully carries out all calculations except optimization, so the
returned coordinates are those specified by the init
parameter, so this is an easy way to access e.g. the spectral or PCA
initialization coordinates. If you want the input fuzzy graph
(ret_extra
vector contains "fgraph"
), this
will also prevent the graph having edges with very low membership being
removed. You still get the old default epochs behavior by setting
n_epochs = NULL
or to a negative value.save_uwot
and load_uwot
have been updated
with a verbose
parameter so it’s easier to see what
temporary files are being created.save_uwot
has a new parameter, unload
,
which if set to TRUE
will delete the working directory for
you, at the cost of unloading the model, i.e. it can’t be used with
umap_transform
until you reload it with
load_uwot
.save_uwot
now returns the saved model with an extra
field, mod_dir
, which points to the location of the
temporary working directory, so you should now assign the result of
calling save_uwot
to the model you saved, e.g.
model <- save_uwot(model, "my_model_file")
. This field
is intended for use with unload_uwot
.load_uwot
also returns the model with a
mod_dir
item for use with unload_uwot
.save_uwot
and load_uwot
were not correctly
handling relative paths.load_uwot
in uwot 0.1.4 to work
with newer versions of RcppAnnoy (https://github.com/jlmelville/uwot/issues/31) failed in
the typical case of a single metric for the nearest neighbor search
using all available columns, giving an error message along the lines of:
Error: index size <size> is not a multiple of vector size <size>
.
This has now been fixed, but required changes to both
save_uwot
and load_uwot
, so existing saved
models must be regenerated. Thank you to reporter OuNao.n_threads
caused a crash. This was particularly insidious
if running with a system with only one default thread available as the
default n_threads
becomes 0.5
. Now
n_threads
(and n_sgd_threads
) are rounded to
the nearest integer.ERROR: there is already an InterruptableProgressMonitor instance defined
.verbose = TRUE
, the a
, b
curve parameters are now logged.Even with a fix for the bug mentioned above, if the nearest neighbor
index file is larger than 2GB in size, Annoy may not be able to read the
data back in. This should only occur with very large or high-dimensional
datasets. The nearest neighbor search will fail under these conditions.
A work-around is to set n_threads = 0
, because the index
will not be written to disk and re-loaded under these circumstances, at
the cost of a longer search time. Alternatively, set the
pca
parameter to reduce the dimensionality or lower
n_trees
, both of which will reduce the size of the index on
disk. However, either may lower the accuracy of the nearest neighbor
results.
Initial CRAN release.
tmpdir
, which allows the user to specify
the temporary directory where nearest neighbor indexes will be written
during Annoy nearest neighbor search. The default is
base::tempdir()
. Only used if n_threads > 1
and nn_method = "annoy"
.Fixed an issue with lvish
where there was an
off-by-one error when calculating input probabilities.
Added a safe-guard to lvish
to prevent the gaussian
precision, beta, becoming overly large when the binary search fails
during perplexity calibration.
The lvish
perplexity calibration uses the
log-sum-exp trick to avoid numeric underflow if beta becomes
large.
pcg_rand
. If TRUE
(the
default), then a random number generator from the PCG family is used during the
stochastic optimization phase. The old PRNG, a direct translation of an
implementation of the Tausworthe “taus88” PRNG used in the Python
version of UMAP, can be obtained by setting
pcg_rand = FALSE
. The new PRNG is slower, but is likely
superior in its statistical randomness. This change in behavior will be
break backwards compatibility: you will now get slightly different
results even with the same seed.fast_sgd
. If TRUE
, then the
following combination of parameters are set:
n_sgd_threads = "auto"
, pcg_rand = FALSE
and
approx_pow = TRUE
. These will result in a substantially
faster optimization phase, at the cost of being slightly less accurate
and results not being exactly repeatable. fast_sgd = FALSE
by default but if you are only interested in visualization, then
fast_sgd
gives perfectly good results. For more generic
dimensionality reduction and reproducibility, keep
fast_sgd = FALSE
.init_sdev
which specifies how large the
standard deviation of each column of the initial coordinates should be.
This will scale any input coordinates (including user-provided matrix
coordinates). init = "spca"
can now be thought of as an
alias of init = "pca", init_sdev = 1e-4
. This may be too
aggressive scaling for some datasets. The typical UMAP spectral
initializations tend to result in standard deviations of around
2
to 5
, so this might be more appropriate in
some cases. If spectral initialization detects multiple components in
the affinity graph and falls back to scaled PCA, it uses
init_sdev = 1
.init_sdev
, the init
options sspectral
, slaplacian
and
snormlaplacian
have been removed (they weren’t around for
very long anyway). You can get the same behavior by e.g.
init = "spectral", init_sdev = 1e-4
.
init = "spca"
is sticking around because I use it a
lot.init = "spca"
.<random>
header. This
breaks backwards compatibility even if you set
pcg_rand = FALSE
.metric = "cosine"
results were incorrectly using the
unmodified Annoy angular distance.categorical
metric (fixes https://github.com/jlmelville/uwot/issues/20).n_components
(e.g. approximately 50% faster optimization
time with MNIST and n_components = 50
).pca_center
, which controls whether to
center the data before applying PCA. It would be typical to set this to
FALSE
if you are applying PCA to binary data (although note
you can’t use this with setting with
metric = "hamming"
)metric
is
"manhattan"
and "cosine"
. It’s still
not applied when using "hamming"
(data still needs
to be in binary format, not real-valued).pca
and
pca_center
parameter values for a given data block by using
a list for the value of the metric, with the column ids/names as an
unnamed item and the overriding values as named items, e.g. instead of
manhattan = 1:100
, use
manhattan = list(1:100, pca_center = FALSE)
to turn off PCA
centering for just that block. This functionality exists mainly for the
case where you have mixed binary and real-valued data and want to apply
PCA to both data types. It’s normal to apply centering to real-valued
data but not to binary data.umap_transform
, where negative
sampling was over the size of the test data (should be the training
data).verbose = TRUE
, log the Annoy recall accuracy,
which may help tune values of n_trees
and
search_k
.n_sgd_threads
, which controls the number
of threads used in the stochastic gradient descent. By default this is
now single-threaded and should result in reproducible results when using
set.seed
. To get back the old, less consistent, but faster
settings, set n_sgd_threads = "auto"
.alpha
is now learning_rate
.gamma
is now repulsion_strength
.laplacian
and normlaplacian
).init
options: sspectral
,
snormlaplacian
and slaplacian
. These are like
spectral
, normlaplacian
,
laplacian
respectively, but scaled so that each dimension
has a standard deviation of 1e-4. This is like the difference between
the pca
and spca
options.pca
: set this to a positive integer to
reduce matrix of data frames to that number of columns using PCA. Only
works if metric = "euclidean"
. If you have > 100
columns, this can substantially improve the speed of the nearest
neighbor search. t-SNE implementations often set this value to 50.metric
:
instead of specifying a single metric name
(e.g. metric = "euclidean"
), you can pass a list, where the
name of each item is the metric to use and the value is a vector of the
names of the columns to use with that metric, e.g.
metric = list("euclidean" = c("A1", "A2"), "cosine" = c("B1", "B2", "B3"))
treats columns A1
and A2
as one block, using
the Euclidean distance to find nearest neighbors, whereas
B1
, B2
and B3
are treated as a
second block, using the cosine distance.categorical
.y
may now be a data frame or matrix if multiple target
data is available.target_metric
, to specify the distance
metric to use with numerical y
. This has the same
capabilities as metric
.scale = "Z"
To Z-scale each column of input (synonym
for scale = TRUE
or scale = "scale"
).scale = "colrange"
to scale columns
in the range (0, 1).y
, you may pass
nearest neighbor data directly, in the same format as that supported by
X
-related nearest neighbor data. This may be useful if you
don’t want to use Euclidean distances for the y
data, or if
you have missing data (and have a way to assign nearest neighbors for
those cases, obviously). See the Nearest
Neighbor Data Format section for details.ret_nn
: when TRUE
returns
nearest neighbor matrices as a nn
list: indices in item
idx
and distances in item dist
. Embedded
coordinates are in embedding
. Both ret_nn
and
ret_model
can be TRUE
, and should not cause
any compatibility issues with supervised embeddings.nn_method
can now take precomputed nearest neighbor
data. Must be a list of two matrices: idx
, containing
integer indexes, and dist
containing distances. By no
coincidence, this is the format return by ret_nn
.n_components = 1
was broken (https://github.com/jlmelville/uwot/issues/6)init
parameter were being
modified, in defiance of basic R pass-by-copy semantics.metric = "cosine"
is working again for
n_threads
greater than 0
(https://github.com/jlmelville/uwot/issues/5)August 5 2018. You can now use an existing embedding to
add new points via umap_transform
. See the example section
below.
August 1 2018. Numerical vectors are now supported for supervised dimension reduction.
July 31 2018. (Very) initial support for supervised
dimension reduction: categorical data only at the moment. Pass in a
factor vector (use NA
for unknown labels) as the
y
parameter and edges with bad (or unknown) labels are
down-weighted, hopefully leading to better separation of classes. This
works remarkably well for the Fashion MNIST dataset.
July 22 2018. You can now use the cosine and Manhattan
distances with the Annoy nearest neighbor search, via
metric = "cosine"
and metric = "manhattan"
,
respectively. Hamming distance is not supported because RcppAnnoy
doesn’t yet support it.