Nearest neighbour searches
Nearest neighbour searches in manifoldsR
kNN graphs are the core of different manifold learning methods. This vignette gives you an intro in the provided methods, plus the advantages and disadvantages of the different methods. The wrappers can be also used to avoid recomputing the kNN graph if you wish to test different parameters.
# number neighbours
k <- 10L
# number samples
n_samples <- 1000LSynthetic data
We will generate some synthetic data and test the different methods out.
cluster_data <- manifold_synthetic_data(
type = "cluster",
n_samples = n_samples
)Exhaustive search
This is simplest search type… It uses an exhaustive search index under the hood with SIMD acceleration over a flat structure for cache locality. On small data sets, this one might in times be even faster than the approximate searches we will discuss subsequently.
exhaustive <- generate_knn_graph(
data = cluster_data$data,
k = k,
knn_method = "exhaustive"
)
exhaustive
#> NearestNeighbours
#> n_samples: 1000
#> k_neighbours: 10There is a set of getters for these classes, see below. An important note is that the indices are 0-based (from Rust), so if you wish to use them in R, you need to be aware of this!
# flat representation of the indices
indices_flat <- get_idx_flat(exhaustive)
# as N x idx nearest neigbours matrix
indices_mat <- get_idx_mat(exhaustive)The distances can also be extracted easily:
# flat representation of the distances
dist_flat <- get_dist_flat(exhaustive)
# as N x idx nearest neigbours matrix
dist_mat <- get_dist_mat(exhaustive)Approximate searches
This package is designed to deal with large datasets and leverages under the hood the ann-search-rs crate which provides various approximate nearest neighbour searches. The 4 methods exposed in this package are (+ the exhaustive version above):
| Method | Features | Class | Use case |
|---|---|---|---|
| BallTree | Simple method, good for small data sets (≤ 100,000) | Tree-based | Small data sets, small dimensionality |
| Annoy | Classic in single cell, scales well to data sets of up to 500,000 | Tree-based | Large data sets, small dimensionality |
| HNSW | Powerful on large datasets (≥ 500,000), slightly slower index build type, fast queries | Graph-based | Large data sets, small to large dimensionality |
| NNDescent | Another graph-based version, good all-rounder when certain sizes are reached (≥ 100,000) | Graph-based | Medium data sets, small to large dimensionality |
manifoldsR defaults to Balltree, but you can play around with the parameters & methods. The package was written with massive scale in mind, but if you are looking at smaller data sets (like in the vignettes), BallTree does the job (fast). If you have high dimensional data and/or massive amounts of samples, some of the other methods will shine.
Speed
Let’s just have a look on how they perform on different sizes
microbenchmark::microbenchmark(
exhaustive = {
generate_knn_graph(
data = cluster_data$data,
k = k,
knn_method = "exhaustive",
.verbose = FALSE
)
},
annoy = {
generate_knn_graph(
data = cluster_data$data,
k = k,
knn_method = "annoy",
.verbose = FALSE
)
},
hnsw = {
generate_knn_graph(
data = cluster_data$data,
k = k,
knn_method = "hnsw",
.verbose = FALSE
)
},
balltree = {
generate_knn_graph(
data = cluster_data$data,
k = k,
knn_method = "balltree",
.verbose = FALSE
)
},
nndescent = {
generate_knn_graph(
data = cluster_data$data,
k = k,
knn_method = "nndescent",
.verbose = FALSE
)
},
times = 1L # single comparison for speed
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> exhaustive 4.847738 4.847738 4.847738 4.847738 4.847738 4.847738 1
#> annoy 17.371247 17.371247 17.371247 17.371247 17.371247 17.371247 1
#> hnsw 35.922275 35.922275 35.922275 35.922275 35.922275 35.922275 1
#> balltree 3.586603 3.586603 3.586603 3.586603 3.586603 3.586603 1
#> nndescent 16.986848 16.986848 16.986848 16.986848 16.986848 16.986848 1On small datasets, the exhaustive search beats everything. We need to build the indices in the other methods, a cost that only makes sense when we have sufficient samples to do this. Otherwise, modern CPUs with SIMD can do the necessary O(N^2) in no time on smaller datasets. What about 50k cells?
benchmark_data <- manifold_synthetic_data(
type = "cluster",
n_samples = 50000L
)
microbenchmark::microbenchmark(
exhaustive = {
generate_knn_graph(
data = benchmark_data$data,
k = k,
knn_method = "exhaustive",
.verbose = FALSE
)
},
annoy = {
generate_knn_graph(
data = benchmark_data$data,
k = k,
knn_method = "annoy",
.verbose = FALSE
)
},
hnsw = {
generate_knn_graph(
data = benchmark_data$data,
k = k,
knn_method = "hnsw",
.verbose = FALSE
)
},
balltree = {
generate_knn_graph(
data = benchmark_data$data,
k = k,
knn_method = "balltree",
.verbose = FALSE
)
},
nndescent = {
generate_knn_graph(
data = benchmark_data$data,
k = k,
knn_method = "nndescent",
.verbose = FALSE
)
},
times = 1L # single comparison for speed
)
#> Unit: seconds
#> expr min lq mean median uq max neval
#> exhaustive 8.390668 8.390668 8.390668 8.390668 8.390668 8.390668 1
#> annoy 4.198860 4.198860 4.198860 4.198860 4.198860 4.198860 1
#> hnsw 4.018074 4.018074 4.018074 4.018074 4.018074 4.018074 1
#> balltree 1.454881 1.454881 1.454881 1.454881 1.454881 1.454881 1
#> nndescent 2.274216 2.274216 2.274216 2.274216 2.274216 2.274216 1We start observing the first pattern, that the approximate nearest neighbour searches are faster than the exhaustive search. The delta becomes more pronounced with more samples:
benchmark_data <- manifold_synthetic_data(
type = "cluster",
n_samples = 100000L
)
# this will take some time...
microbenchmark::microbenchmark(
exhaustive = {
generate_knn_graph(
data = benchmark_data$data,
k = k,
knn_method = "exhaustive",
.verbose = FALSE
)
},
annoy = {
generate_knn_graph(
data = benchmark_data$data,
k = k,
knn_method = "annoy",
.verbose = FALSE
)
},
hnsw = {
generate_knn_graph(
data = benchmark_data$data,
k = k,
knn_method = "hnsw",
.verbose = FALSE
)
},
balltree = {
generate_knn_graph(
data = benchmark_data$data,
k = k,
knn_method = "balltree",
.verbose = FALSE
)
},
nndescent = {
generate_knn_graph(
data = benchmark_data$data,
k = k,
knn_method = "nndescent",
.verbose = FALSE
)
},
times = 1L # single comparison for speed
)
#> Unit: seconds
#> expr min lq mean median uq max neval
#> exhaustive 32.355299 32.355299 32.355299 32.355299 32.355299 32.355299 1
#> annoy 12.737893 12.737893 12.737893 12.737893 12.737893 12.737893 1
#> hnsw 8.421055 8.421055 8.421055 8.421055 8.421055 8.421055 1
#> balltree 4.636256 4.636256 4.636256 4.636256 4.636256 4.636256 1
#> nndescent 5.494115 5.494115 5.494115 5.494115 5.494115 5.494115 1Precision
In the end, these are approximate methods. So, we need to understand how good they are at recovering the true neighbours. A typical metric is Recall@k neighbours. Let’s just have a look at the smaller data sets here:
samples <- 25000L
benchmark_data <- manifold_synthetic_data(
type = "cluster",
n_samples = samples
)
exhaustive <- generate_knn_graph(
data = benchmark_data$data,
k = k,
knn_method = "exhaustive"
)
indices <- c("hnsw", "annoy", "nndescent", "balltree")
for (idx in indices) {
idx_i <- generate_knn_graph(
data = benchmark_data$data,
k = k,
knn_method = idx,
.verbose = FALSE
)
recall_i <- sum(get_idx_flat(exhaustive) == get_idx_flat(idx_i)) /
(k * samples)
print(
sprintf("ANN method %s achieves a Recall of %.3f.", idx, recall_i)
)
}
#> [1] "ANN method hnsw achieves a Recall of 0.994."
#> [1] "ANN method annoy achieves a Recall of 1.000."
#> [1] "ANN method nndescent achieves a Recall of 0.992."
#> [1] "ANN method balltree achieves a Recall of 0.980."As you can see, all of these methods are able to (mostly) identify the right neighbours (in the right order!). If you wish to understand more how these methods work, how they compare, etc., please check this out.