library(bixverse)Bixverse R <> Rust interface
Why Rust?
Honest answer? The creation of this package was based on the desire of the author to play with Rust. To make it useful, the initial playground was stuff relevant to day-to-day work in bioinformatics and computational biology: correlations, PCA, other matrix factorisation methods, and so on. It quickly became apparent that Rust delivered incredible speed gains over R — sometimes 100x — with millions of hypergeometric tests completing in mere seconds. The results were striking. Memory safety and the fantastic work by the rextendr team made interfacing R and Rust straightforward and, frankly, quite fun.
Correlations and covariance
One of the most typical operations in computational biology is computing correlations and covariances; and these were the first functions implemented, largely to understand faer and the Rust/R interface. What becomes apparent very quickly is the following. For 1000 x 1000 matrices on an M1 Max MacBook Pro (using Apple’s Accelerate framework for R BLAS):
Covariance
| Implementation | Median (ms, 20 runs) |
|---|---|
| base R | 515.73 |
| Rust | 18.66 |
A 27x increase in speed.
Pearson correlation
| Implementation | Median (ms, 20 runs) |
|---|---|
| base R | 512.83 |
| Rust | 20.90 |
A 24x increase in speed.
Spearman correlation
| Implementation | Median (ms, 20 runs) |
|---|---|
| base R | 603.00 |
| Rust | 29.47 |
A 20x increase in speed.
The gains are even more pronounced on larger matrices. For a 1000 x 5000 matrix:
Pearson correlation (1000 x 5000)
| Implementation | Median (ms, 20 runs) |
|---|---|
| base R | 12994.37 |
| Rust | 419.88 |
A 30x increase in speed.
Given that matrix algebra underlies a large proportion of computational bioinformatics workflows, this makes a compelling case for Rust under the hood. The rest of this vignette shows how to use the exposed Rust functions directly, with runnable benchmarks you can verify on your own machine.
Using the Rust functions in more general R code
bixverse exposes a number of R/Rust functions that are useful beyond the higher-level package functionality. This vignette covers:
- Correlations (via
rs_cor()). - Covariance (via
rs_covariance()). - Covariance to correlation (via
rs_cov2cor()). - Distance calculations (via
rs_dist()). - Correlations between two matrices (via
rs_cor2()). - Mutual information between columns (via
rs_mutual_information()). - Set similarities (via
rs_set_similarity_list()orrs_set_similarity()).
These may be useful for your own package development.
Note
The vignette was built on a GitHub runner which is a 2-core machine with ~8 GB of memory. The performance differences tend to be (even) more pronounced on machines with more cores and computational power.
Set-up
Any function prefixed with rs_ calls directly into Rust and can panic if misused — check inputs before passing them through. There is a LOT of these functions in the package, and they are all exported. Why? I want other people to be able to integrate the code into their own R packages, or if they feel brave, they can just use the Rust crate directly, see here.
Correlations, covariance, and cosine distance
Pearson’s correlation
Let’s check the speed.
microbenchmark::microbenchmark(
r = cor(random_data),
rust = rs_cor(random_data, spearman = FALSE),
times = 10L
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> r 135.129510 135.15287 135.361154 135.312126 135.345664 136.244074 10
#> rust 5.417325 5.56965 6.340263 5.889052 6.975938 8.116911 10Spearman’s correlation
What about ranked correlations?
And speed
microbenchmark::microbenchmark(
r = cor(random_data, method = "spearman"),
rust = rs_cor(random_data, spearman = TRUE),
times = 10L
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> r 165.711238 166.771830 168.341354 168.730592 169.899446 170.35679 10
#> rust 7.089851 7.322075 8.275321 7.545908 9.505185 10.68323 10Covariance
Covariance matrices are also very common…
r_covar_res <- cov(random_data)
rust_covar_res <- rs_covariance(random_data)
all.equal(r_covar_res, rust_covar_res, tolerance = 1e-15)
#> [1] TRUEAnd speed comparisons
microbenchmark::microbenchmark(
r = cov(random_data),
rust = rs_covariance(random_data),
times = 10L
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> r 134.859886 134.897467 135.313301 134.951678 135.114131 136.831961 10
#> rust 4.333822 4.366922 5.532887 5.149454 5.501282 8.346721 10Covariance to correlation
r_cor_from_covar <- cov2cor(r_covar_res)
rust_cor_from_covar <- rs_cov2cor(rust_covar_res)
all.equal(r_cor_from_covar, rust_cor_from_covar, tolerance = 1e-15)
#> [1] TRUEThis is where the Rust implementation shows the least improvement — the operation is already cheap in base R — but the interface remains consistent with the rest of the rs_ family.
microbenchmark::microbenchmark(
r = cov2cor(r_covar_res),
rust = rs_cov2cor(rust_covar_res),
times = 10L
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> r 2.201775 2.413260 3.398934 3.197761 4.204378 6.013450 10
#> rust 1.420715 1.533927 2.786392 2.006590 3.552870 6.808175 10Correlations between two matrices
So far, all of the functions did pairwise column-based correlations. What about two correlations against each other?
Let’s compare against R:
And speed:
microbenchmark::microbenchmark(
r = cor(random_data, random_data_2),
rust = rs_cor2(random_data, random_data_2, spearman = FALSE),
times = 10L
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> r 193.585846 193.748560 193.860604 193.798744 193.971397 194.475950 10
#> rust 4.735992 5.324241 6.118443 5.915485 6.956501 7.738373 10Distance metrics
Euclidean distance
Note that dist() computes row-wise distances, while rs_dist() expects columns to represent observations — hence the transpose.
And speed…
microbenchmark::microbenchmark(
r = dist(random_data, method = "euclidean"),
rust = rs_dist(random_data, distance_type = "euclidean"),
times = 10L
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> r 95.095295 95.226420 95.548990 95.396589 95.626549 97.11469 10
#> rust 4.565683 4.698441 4.892128 4.878208 5.145327 5.19537 10Manhattan distance
And speed… ?
microbenchmark::microbenchmark(
r = dist(random_data, method = "manhattan"),
rust = rs_dist(random_data, distance_type = "manhattan"),
times = 10L
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> r 89.36864 89.41936 89.62563 89.67866 89.77539 89.84988 10
#> rust 17.09855 17.16074 17.38307 17.25756 17.64508 17.87759 10Canberra distance
Same benchmarks as before…
microbenchmark::microbenchmark(
r = dist(random_data, method = "canberra"),
rust = rs_dist(random_data, distance_type = "canberra"),
times = 10L
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> r 134.76585 134.87735 135.24578 134.98257 135.16090 136.95542 10
#> rust 46.04899 46.31671 46.43284 46.43207 46.62209 46.69678 10Mutual information
The Rust version defaults to sqrt(nrow()) bins when n_bins = NULL. Two discretisation strategies are available.
Equal width:
rust_res_mi <- rs_mutual_info(
mat,
n_bins = NULL,
normalise = FALSE,
strategy = "equal_width"
)
rownames(rust_res_mi) <- colnames(rust_res_mi) <- colnames(mat)
infotheo_res_mi <- infotheo::mutinformation(infotheo::discretize(
mat,
disc = "equalwidth",
nbins = sqrt(nrow(mat))
))
all.equal(rust_res_mi, infotheo_res_mi)
#> [1] TRUE
microbenchmark::microbenchmark(
infotheo = infotheo::mutinformation(infotheo::discretize(
mat,
disc = "equalwidth",
nbins = sqrt(nrow(mat))
)),
rust = rs_mutual_info(
mat,
n_bins = NULL,
normalise = FALSE,
strategy = "equal_width"
),
times = 10L
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> infotheo 83.162443 83.4657 84.20452 84.098332 84.891725 85.428568 10
#> rust 3.307592 3.3649 3.47485 3.483766 3.564042 3.720183 10Equal frequency:
rust_res_mi <- rs_mutual_info(
mat,
n_bins = NULL,
normalise = FALSE,
strategy = "equal_freq"
)
rownames(rust_res_mi) <- colnames(rust_res_mi) <- colnames(mat)
infotheo_res_mi <- infotheo::mutinformation(infotheo::discretize(
mat,
disc = "equalfreq",
nbins = sqrt(nrow(mat))
))
all.equal(rust_res_mi, infotheo_res_mi)
#> [1] TRUE
microbenchmark::microbenchmark(
infotheo = infotheo::mutinformation(infotheo::discretize(
mat,
disc = "equalfreq",
nbins = sqrt(nrow(mat))
)),
rust = rs_mutual_info(
mat,
n_bins = NULL,
normalise = FALSE,
strategy = "equal_freq"
),
times = 10L
)
#> Unit: milliseconds
#> expr min lq mean median uq max
#> infotheo 101.116219 101.428322 102.046697 101.896457 102.770610 103.272028
#> rust 4.369887 4.415793 4.515948 4.561882 4.572656 4.626907
#> neval
#> 10
#> 10Set similarities
bixverse also exposes Rust-accelerated set similarity calculations. To illustrate the performance, below is a comparison against a naive R implementation of Jaccard similarity across two lists.
Starting with a sequential purrr approach as the baseline:
tictoc::tic()
r_results <- purrr::map(
random_sets_1,
\(x) {
purrr::map_dbl(random_sets_2, \(y) jaccard_sim(x, y))
},
.progress = TRUE
)
#> ■■■ 7% | ETA: 23s
#> ■■■■■■■ 19% | ETA: 20s
#> ■■■■■■■■■■■ 32% | ETA: 17s
#> ■■■■■■■■■■■■■■ 44% | ETA: 14s
#> ■■■■■■■■■■■■■■■■■■ 56% | ETA: 11s
#> ■■■■■■■■■■■■■■■■■■■■■■ 68% | ETA: 8s
#> ■■■■■■■■■■■■■■■■■■■■■■■■■ 81% | ETA: 5s
#> ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 93% | ETA: 2s
similarity_matrix <- matrix(
data = unlist(r_results),
nrow = length(random_sets_1),
ncol = length(random_sets_2),
byrow = TRUE
)
tictoc::toc()
#> 24.617 sec elapsedParallelising via furrr gives a meaningful speedup:
future::plan(strategy = future::multisession(workers = parallel::detectCores()))
tictoc::tic()
r_results <- furrr::future_map(
random_sets_1,
\(x) {
purrr::map_dbl(random_sets_2, \(y) jaccard_sim(x, y))
},
.progress = TRUE
)
similarity_matrix <- matrix(
data = unlist(r_results),
nrow = length(random_sets_1),
ncol = length(random_sets_2),
byrow = TRUE
)
tictoc::toc()
#> 13.838 sec elapsed
future::plan(strategy = future::sequential())mirai reduces process scheduling overhead (somtimes) compared to future and does a bit better:
mirai::daemons(parallel::detectCores())
tictoc::tic()
r_results_mirai <- mirai::mirai_map(
random_sets_1,
\(x, sets_2) {
purrr::map_dbl(sets_2, \(y) {
length(intersect(x, y)) / length(union(x, y))
})
},
.args = list(sets_2 = random_sets_2)
)[]
similarity_matrix <- matrix(
data = unlist(r_results_mirai),
nrow = length(random_sets_1),
ncol = length(random_sets_2),
byrow = TRUE
)
tictoc::toc()
#> 13.392 sec elapsed
mirai::daemons(0)Depending on your system this optimised parallel R version may be around 4x faster than the naive sequential baseline. Now compare against the Rust implementation:
tictoc::tic()
rust_res <- rs_set_similarity_list2(
s_1_list = random_sets_1,
s_2_list = random_sets_2,
overlap_coefficient = FALSE
)
tictoc::toc()
#> 0.046 sec elapsed
all.equal(similarity_matrix, rust_res, tolerance = 1e-15)
#> [1] TRUEThe margin here is not subtle. The functions shown in this vignette are a representative sample of what is exposed in bixverse: there are many more specialised functions throughout the package worth exploring (if you are brave and can deal with a panic here and there).