Nearest neighbour searches
2026-05-08
Nearest neighbour searches in manifoldsR
kNN graphs are the core of different manifold learning methods. This vignette gives you an intro in the provided methods, plus the advantages and disadvantages of the different methods. The wrappers can be also used to avoid recomputing the kNN graph if you wish to test different parameters.
# number neighbours
k <- 10L
# number samples
n_samples <- 1000LSynthetic data
We will generate some synthetic data and test the different methods out.
cluster_data <- manifold_synthetic_data(
type = "cluster",
n_samples = n_samples
)Exhaustive search
This is simplest search type… It uses an exhaustive search index under the hood with SIMD acceleration over a flat structure for cache locality. On small data sets, this one might in times be even faster than the approximate searches we will discuss subsequently.
exhaustive <- generate_knn_graph(
data = cluster_data$data,
k = k,
knn_method = "exhaustive"
)
exhaustive
#> NearestNeighbours
#> n_samples: 1000
#> k_neighbours: 10There is a set of getters for these classes, see below. An important note is that the indices are 0-based (from Rust), so if you wish to use them in R, you need to be aware of this!
# flat representation of the indices
indices_flat <- get_idx_flat(exhaustive)
# as N x idx nearest neigbours matrix
indices_mat <- get_idx_mat(exhaustive)The distances can also be extracted easily:
# flat representation of the distances
dist_flat <- get_dist_flat(exhaustive)
# as N x idx nearest neigbours matrix
dist_mat <- get_dist_mat(exhaustive)An alternative, exact version is the KmKnn (k-means for k-nearest neighbour). It leverages k-means and triangle inequality to reduce the number of searches - a (very) fast, exact index that works well on low-dimensional data that has strong structure.
kmknn <- generate_knn_graph(
data = cluster_data$data,
k = k,
knn_method = "kmknn"
)
kmknn
#> NearestNeighbours
#> n_samples: 1000
#> k_neighbours: 10
all(get_idx_flat(kmknn) == get_idx_flat(exhaustive))
#> [1] TRUEApproximate searches
This package is designed to deal with large datasets and leverages under the hood the ann-search-rs crate which provides various approximate nearest neighbour searches. The 4 methods exposed in this package are (+ the exhaustive version above):
| Method | Features | Class | Use case |
|---|---|---|---|
| BallTree | Simple method, good for small data sets (≤ 100,000) | Tree-based | Small data sets, small dimensionality |
| Annoy | Classic in single cell, scales well to data sets of up to 500,000 | Tree-based | Large data sets, small dimensionality |
| HNSW | Powerful on large datasets (≥ 500,000), slightly slower index build type, fast queries. The implementation here uses a benign race condition during index building. | Graph-based | Large data sets, small to large dimensionality |
| NNDescent | Another graph-based version, good all-rounder when certain sizes are reached (≥ 100,000) | Graph-based | Medium data sets, small to large dimensionality |
| IVF | A fast cluster-based index powering libraries like FAISS. | Cluster-based | Large data sets, any dimensionality. You can squeeze out aggressively speed with the n_probe parameter if need be. Be careful to set n_list pending your data structure. If you have few well defined data clusters, it is better to set it lower for this method. |
manifoldsR defaults to kMkNN, but you can play around with the parameters & methods. The package was written with massive scale in mind, but if you are looking at smaller data sets with clear structure and low dimensionality (like in this vignettes), kMkNN does the job (fast). If you have high dimensional data and/or massive amounts of samples, some of the other methods will shine.
Speed
Let’s just have a look on how they perform on different sizes
microbenchmark::microbenchmark(
exhaustive = {
generate_knn_graph(
data = cluster_data$data,
k = k,
knn_method = "exhaustive",
.verbose = FALSE
)
},
kmknn = {
generate_knn_graph(
data = cluster_data$data,
k = k,
knn_method = "kmknn",
.verbose = FALSE
)
},
annoy = {
generate_knn_graph(
data = cluster_data$data,
k = k,
knn_method = "annoy",
.verbose = FALSE
)
},
hnsw = {
generate_knn_graph(
data = cluster_data$data,
k = k,
knn_method = "hnsw",
.verbose = FALSE
)
},
balltree = {
generate_knn_graph(
data = cluster_data$data,
k = k,
knn_method = "balltree",
.verbose = FALSE
)
},
nndescent = {
generate_knn_graph(
data = cluster_data$data,
k = k,
knn_method = "nndescent",
.verbose = FALSE
)
},
ivf = {
generate_knn_graph(
data = cluster_data$data,
k = k,
knn_method = "ivf",
nn_params = params_nn(
n_list = 25L
),
.verbose = FALSE
)
},
times = 1L # single comparison for speed
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> exhaustive 6.739509 6.739509 6.739509 6.739509 6.739509 6.739509 1
#> kmknn 6.033993 6.033993 6.033993 6.033993 6.033993 6.033993 1
#> annoy 28.330005 28.330005 28.330005 28.330005 28.330005 28.330005 1
#> hnsw 18.956202 18.956202 18.956202 18.956202 18.956202 18.956202 1
#> balltree 3.957832 3.957832 3.957832 3.957832 3.957832 3.957832 1
#> nndescent 42.345866 42.345866 42.345866 42.345866 42.345866 42.345866 1
#> ivf 5.836295 5.836295 5.836295 5.836295 5.836295 5.836295 1On small datasets, the exhaustive search beats everything. We need to build the indices in the other methods, a cost that only makes sense when we have sufficient samples to do this. Otherwise, modern CPUs with SIMD can do the necessary O(N^2) in no time on smaller datasets. What about 50k cells?
benchmark_data <- manifold_synthetic_data(
type = "cluster",
n_samples = 50000L
)
microbenchmark::microbenchmark(
exhaustive = {
generate_knn_graph(
data = benchmark_data$data,
k = k,
knn_method = "exhaustive",
.verbose = FALSE
)
},
kmknn = {
generate_knn_graph(
data = benchmark_data$data,
k = k,
knn_method = "kmknn",
.verbose = FALSE
)
},
annoy = {
generate_knn_graph(
data = benchmark_data$data,
k = k,
knn_method = "annoy",
.verbose = FALSE
)
},
hnsw = {
generate_knn_graph(
data = benchmark_data$data,
k = k,
knn_method = "hnsw",
.verbose = FALSE
)
},
balltree = {
generate_knn_graph(
data = benchmark_data$data,
k = k,
knn_method = "balltree",
.verbose = FALSE
)
},
nndescent = {
generate_knn_graph(
data = benchmark_data$data,
k = k,
knn_method = "nndescent",
.verbose = FALSE
)
},
ivf = {
generate_knn_graph(
data = benchmark_data$data,
k = k,
knn_method = "ivf",
nn_params = params_nn(
n_list = 25L
),
.verbose = FALSE
)
},
times = 1L # single comparison for speed
)
#> Unit: seconds
#> expr min lq mean median uq max neval
#> exhaustive 11.703229 11.703229 11.703229 11.703229 11.703229 11.703229 1
#> kmknn 2.661916 2.661916 2.661916 2.661916 2.661916 2.661916 1
#> annoy 3.467459 3.467459 3.467459 3.467459 3.467459 3.467459 1
#> hnsw 2.577452 2.577452 2.577452 2.577452 2.577452 2.577452 1
#> balltree 1.563946 1.563946 1.563946 1.563946 1.563946 1.563946 1
#> nndescent 2.914105 2.914105 2.914105 2.914105 2.914105 2.914105 1
#> ivf 2.851967 2.851967 2.851967 2.851967 2.851967 2.851967 1We start observing the first pattern, that the approximate nearest neighbour searches are faster than the exhaustive search. The delta becomes more pronounced with more samples:
benchmark_data <- manifold_synthetic_data(
type = "cluster",
n_samples = 100000L
)
# this will take some time...
microbenchmark::microbenchmark(
exhaustive = {
generate_knn_graph(
data = benchmark_data$data,
k = k,
knn_method = "exhaustive",
.verbose = FALSE
)
},
kmknn = {
generate_knn_graph(
data = benchmark_data$data,
k = k,
knn_method = "kmknn",
.verbose = FALSE
)
},
annoy = {
generate_knn_graph(
data = benchmark_data$data,
k = k,
knn_method = "annoy",
.verbose = FALSE
)
},
hnsw = {
generate_knn_graph(
data = benchmark_data$data,
k = k,
knn_method = "hnsw",
.verbose = FALSE
)
},
balltree = {
generate_knn_graph(
data = benchmark_data$data,
k = k,
knn_method = "balltree",
.verbose = FALSE
)
},
nndescent = {
generate_knn_graph(
data = benchmark_data$data,
k = k,
knn_method = "nndescent",
.verbose = FALSE
)
},
ivf = {
generate_knn_graph(
data = benchmark_data$data,
k = k,
knn_method = "ivf",
nn_params = params_nn(
n_list = 25L
),
.verbose = FALSE
)
},
times = 1L # single comparison for speed
)
#> Unit: seconds
#> expr min lq mean median uq max neval
#> exhaustive 46.791012 46.791012 46.791012 46.791012 46.791012 46.791012 1
#> kmknn 7.943377 7.943377 7.943377 7.943377 7.943377 7.943377 1
#> annoy 8.783292 8.783292 8.783292 8.783292 8.783292 8.783292 1
#> hnsw 6.630387 6.630387 6.630387 6.630387 6.630387 6.630387 1
#> balltree 5.942741 5.942741 5.942741 5.942741 5.942741 5.942741 1
#> nndescent 6.636090 6.636090 6.636090 6.636090 6.636090 6.636090 1
#> ivf 10.882657 10.882657 10.882657 10.882657 10.882657 10.882657 1Precision
In the end, these are approximate methods. So, we need to understand how good they are at recovering the true neighbours. A typical metric is Recall@k neighbours. Let’s just have a look at the smaller data sets here:
samples <- 25000L
benchmark_data <- manifold_synthetic_data(
type = "cluster",
n_samples = samples
)
exhaustive <- generate_knn_graph(
data = benchmark_data$data,
k = k,
knn_method = "exhaustive"
)
indices <- c("hnsw", "annoy", "nndescent", "balltree", "ivf", "kmknn")
for (idx in indices) {
idx_i <- generate_knn_graph(
data = benchmark_data$data,
k = k,
nn_params = params_nn(n_list = 25L),
knn_method = idx,
.verbose = FALSE
)
recall_i <- sum(get_idx_flat(exhaustive) == get_idx_flat(idx_i)) /
(k * samples)
print(
sprintf("ANN method %s achieves a Recall of %.3f.", idx, recall_i)
)
}
#> [1] "ANN method hnsw achieves a Recall of 0.998."
#> [1] "ANN method annoy achieves a Recall of 1.000."
#> [1] "ANN method nndescent achieves a Recall of 1.000."
#> [1] "ANN method balltree achieves a Recall of 0.983."
#> [1] "ANN method ivf achieves a Recall of 1.000."
#> [1] "ANN method kmknn achieves a Recall of 1.000."As you can see, all of these methods are able to (mostly) identify the right neighbours (in the right order!). If you wish to understand more how these methods work, how they compare, etc., please check this out.