Bixverse R <> Rust interface • bixverse

Why Rust?

Honest answer? The creation of this package was based on the desire of the author to play with Rust. To make it useful, the initial playground was stuff relevant to day-to-day work in bioinformatics and computational biology: correlations, PCA, other matrix factorisation methods, and so on. It quickly became apparent that Rust delivered incredible speed gains over R — sometimes 100x — with millions of hypergeometric tests completing in mere seconds. The results were striking. Memory safety and the fantastic work by the rextendr team made interfacing R and Rust straightforward and, frankly, quite fun.

Correlations and covariance

One of the most typical operations in computational biology is computing correlations and covariances; and these were the first functions implemented, largely to understand faer and the Rust/R interface. What becomes apparent very quickly is the following. For 1000 x 1000 matrices on an M1 Max MacBook Pro (using Apple’s Accelerate framework for R BLAS):

Covariance

Implementation	Median (ms, 20 runs)
base R	515.73
Rust	18.66

A 27x increase in speed.

Pearson correlation

Implementation	Median (ms, 20 runs)
base R	512.83
Rust	20.90

A 24x increase in speed.

Spearman correlation

Implementation	Median (ms, 20 runs)
base R	603.00
Rust	29.47

A 20x increase in speed.

The gains are even more pronounced on larger matrices. For a 1000 x 5000 matrix:

Pearson correlation (1000 x 5000)

Implementation	Median (ms, 20 runs)
base R	12994.37
Rust	419.88

A 30x increase in speed.

Given that matrix algebra underlies a large proportion of computational bioinformatics workflows, this makes a compelling case for Rust under the hood. The rest of this vignette shows how to use the exposed Rust functions directly, with runnable benchmarks you can verify on your own machine.

Using the Rust functions in more general R code

bixverse exposes a number of R/Rust functions that are useful beyond the higher-level package functionality. This vignette covers:

Correlations (via rs_cor()).
Covariance (via rs_covariance()).
Covariance to correlation (via rs_cov2cor()).
Distance calculations (via rs_dist()).
Correlations between two matrices (via rs_cor2()).
Mutual information between columns (via rs_mutual_information()).
Set similarities (via rs_set_similarity_list() or rs_set_similarity()).

These may be useful for your own package development.

Note

The vignette was built on a GitHub runner which is a 2-core machine with ~8 GB of memory. The performance differences tend to be (even) more pronounced on machines with more cores and computational power.

Set-up

library(bixverse)

Any function prefixed with rs_ calls directly into Rust and can panic if misused — check inputs before passing them through. There is a LOT of these functions in the package, and they are all exported. Why? I want other people to be able to integrate the code into their own R packages, or if they feel brave, they can just use the Rust crate directly, see here.

Correlations, covariance, and cosine distance

no_rows <- 500L
no_cols <- 500L

set.seed(10101L)

random_data <- matrix(
  data = rnorm(no_rows * no_cols),
  ncol = no_cols,
  nrow = no_rows
)

Pearson’s correlation

r_pearsons_res <- cor(random_data)

rust_pearsons_res <- rs_cor(random_data, spearman = FALSE)

# tiny numerical differences do exist
all.equal(r_pearsons_res, rust_pearsons_res, tolerance = 1e-15)
#> [1] TRUE

Let’s check the speed.

microbenchmark::microbenchmark(
  r = cor(random_data),
  rust = rs_cor(random_data, spearman = FALSE),
  times = 10L
)
#> Unit: milliseconds
#>  expr        min        lq       mean     median         uq        max neval
#>     r 135.129510 135.15287 135.361154 135.312126 135.345664 136.244074    10
#>  rust   5.417325   5.56965   6.340263   5.889052   6.975938   8.116911    10

Spearman’s correlation

What about ranked correlations?

r_spearman_res <- cor(random_data, method = "spearman")

rust_spearman_res <- rs_cor(random_data, spearman = TRUE)

all.equal(r_spearman_res, rust_spearman_res, tolerance = 1e-15)
#> [1] TRUE

And speed

microbenchmark::microbenchmark(
  r = cor(random_data, method = "spearman"),
  rust = rs_cor(random_data, spearman = TRUE),
  times = 10L
)
#> Unit: milliseconds
#>  expr        min         lq       mean     median         uq       max neval
#>     r 165.711238 166.771830 168.341354 168.730592 169.899446 170.35679    10
#>  rust   7.089851   7.322075   8.275321   7.545908   9.505185  10.68323    10

Covariance

Covariance matrices are also very common…

r_covar_res <- cov(random_data)

rust_covar_res <- rs_covariance(random_data)

all.equal(r_covar_res, rust_covar_res, tolerance = 1e-15)
#> [1] TRUE

And speed comparisons

microbenchmark::microbenchmark(
  r = cov(random_data),
  rust = rs_covariance(random_data),
  times = 10L
)
#> Unit: milliseconds
#>  expr        min         lq       mean     median         uq        max neval
#>     r 134.859886 134.897467 135.313301 134.951678 135.114131 136.831961    10
#>  rust   4.333822   4.366922   5.532887   5.149454   5.501282   8.346721    10

Covariance to correlation

r_cor_from_covar <- cov2cor(r_covar_res)

rust_cor_from_covar <- rs_cov2cor(rust_covar_res)

all.equal(r_cor_from_covar, rust_cor_from_covar, tolerance = 1e-15)
#> [1] TRUE

This is where the Rust implementation shows the least improvement — the operation is already cheap in base R — but the interface remains consistent with the rest of the rs_ family.

microbenchmark::microbenchmark(
  r = cov2cor(r_covar_res),
  rust = rs_cov2cor(rust_covar_res),
  times = 10L
)
#> Unit: milliseconds
#>  expr      min       lq     mean   median       uq      max neval
#>     r 2.201775 2.413260 3.398934 3.197761 4.204378 6.013450    10
#>  rust 1.420715 1.533927 2.786392 2.006590 3.552870 6.808175    10

Correlations between two matrices

So far, all of the functions did pairwise column-based correlations. What about two correlations against each other?

no_rows_2 <- 500L
no_cols_2 <- 400L

set.seed(23L)

random_data_2 <- matrix(
  data = rnorm(no_rows_2 * no_cols_2, mean = 2, sd = 2),
  ncol = no_cols_2,
  nrow = no_rows_2
)

Let’s compare against R:

r_cor_2_matrices <- cor(random_data, random_data_2)

rust_cor_2_matrices <- rs_cor2(random_data, random_data_2, spearman = FALSE)

# small precision differences are further propagated here
all.equal(r_cor_2_matrices, rust_cor_2_matrices, tolerance = 1e-14)
#> [1] TRUE

And speed:

microbenchmark::microbenchmark(
  r = cor(random_data, random_data_2),
  rust = rs_cor2(random_data, random_data_2, spearman = FALSE),
  times = 10L
)
#> Unit: milliseconds
#>  expr        min         lq       mean     median         uq        max neval
#>     r 193.585846 193.748560 193.860604 193.798744 193.971397 194.475950    10
#>  rust   4.735992   5.324241   6.118443   5.915485   6.956501   7.738373    10

Distance metrics

Euclidean distance

Note that dist() computes row-wise distances, while rs_dist() expects columns to represent observations — hence the transpose.

r_euclidean_distance <- as.matrix(
  dist(random_data, method = "euclidean")
)

rs_euclidean_distance <- rs_dist(
  t(random_data),
  distance_type = "euclidean"
)

all.equal(
  r_euclidean_distance,
  rs_euclidean_distance,
  tolerance = 1e-14,
  check.attributes = FALSE
)
#> [1] TRUE

And speed…

microbenchmark::microbenchmark(
  r = dist(random_data, method = "euclidean"),
  rust = rs_dist(random_data, distance_type = "euclidean"),
  times = 10L
)
#> Unit: milliseconds
#>  expr       min        lq      mean    median        uq      max neval
#>     r 95.095295 95.226420 95.548990 95.396589 95.626549 97.11469    10
#>  rust  4.565683  4.698441  4.892128  4.878208  5.145327  5.19537    10

Manhattan distance

r_manhattan_distance <- as.matrix(
  dist(random_data, method = "manhattan")
)

rs_manhattan_distance <- rs_dist(
  t(random_data),
  distance_type = "manhattan"
)

all.equal(
  r_manhattan_distance,
  rs_manhattan_distance,
  tolerance = 1e-14,
  check.attributes = FALSE
)
#> [1] TRUE

And speed… ?

microbenchmark::microbenchmark(
  r = dist(random_data, method = "manhattan"),
  rust = rs_dist(random_data, distance_type = "manhattan"),
  times = 10L
)
#> Unit: milliseconds
#>  expr      min       lq     mean   median       uq      max neval
#>     r 89.36864 89.41936 89.62563 89.67866 89.77539 89.84988    10
#>  rust 17.09855 17.16074 17.38307 17.25756 17.64508 17.87759    10

Canberra distance

r_canberra_distance <- as.matrix(
  dist(random_data, method = "canberra")
)

rs_canberra_distance <- rs_dist(
  t(random_data),
  distance_type = "canberra"
)

all.equal(
  r_canberra_distance,
  rs_canberra_distance,
  tolerance = 1e-14,
  check.attributes = FALSE
)
#> [1] TRUE

Same benchmarks as before…

microbenchmark::microbenchmark(
  r = dist(random_data, method = "canberra"),
  rust = rs_dist(random_data, distance_type = "canberra"),
  times = 10L
)
#> Unit: milliseconds
#>  expr       min        lq      mean    median        uq       max neval
#>     r 134.76585 134.87735 135.24578 134.98257 135.16090 136.95542    10
#>  rust  46.04899  46.31671  46.43284  46.43207  46.62209  46.69678    10

Mutual information

set.seed(246L)

nrows <- 100
ncols <- 100

mat <- matrix(data = rnorm(nrows * ncols), nrow = nrows, ncol = ncols)
rownames(mat) <- sprintf("sample_%i", 1:nrows)
colnames(mat) <- sprintf("feature_%i", 1:ncols)

The Rust version defaults to sqrt(nrow()) bins when n_bins = NULL. Two discretisation strategies are available.

Equal width:

rust_res_mi <- rs_mutual_info(
  mat,
  n_bins = NULL,
  normalise = FALSE,
  strategy = "equal_width"
)
rownames(rust_res_mi) <- colnames(rust_res_mi) <- colnames(mat)

infotheo_res_mi <- infotheo::mutinformation(infotheo::discretize(
  mat,
  disc = "equalwidth",
  nbins = sqrt(nrow(mat))
))

all.equal(rust_res_mi, infotheo_res_mi)
#> [1] TRUE

microbenchmark::microbenchmark(
  infotheo = infotheo::mutinformation(infotheo::discretize(
    mat,
    disc = "equalwidth",
    nbins = sqrt(nrow(mat))
  )),
  rust = rs_mutual_info(
    mat,
    n_bins = NULL,
    normalise = FALSE,
    strategy = "equal_width"
  ),
  times = 10L
)
#> Unit: milliseconds
#>      expr       min      lq     mean    median        uq       max neval
#>  infotheo 83.162443 83.4657 84.20452 84.098332 84.891725 85.428568    10
#>      rust  3.307592  3.3649  3.47485  3.483766  3.564042  3.720183    10

Equal frequency:

rust_res_mi <- rs_mutual_info(
  mat,
  n_bins = NULL,
  normalise = FALSE,
  strategy = "equal_freq"
)
rownames(rust_res_mi) <- colnames(rust_res_mi) <- colnames(mat)

infotheo_res_mi <- infotheo::mutinformation(infotheo::discretize(
  mat,
  disc = "equalfreq",
  nbins = sqrt(nrow(mat))
))

all.equal(rust_res_mi, infotheo_res_mi)
#> [1] TRUE

microbenchmark::microbenchmark(
  infotheo = infotheo::mutinformation(infotheo::discretize(
    mat,
    disc = "equalfreq",
    nbins = sqrt(nrow(mat))
  )),
  rust = rs_mutual_info(
    mat,
    n_bins = NULL,
    normalise = FALSE,
    strategy = "equal_freq"
  ),
  times = 10L
)
#> Unit: milliseconds
#>      expr        min         lq       mean     median         uq        max
#>  infotheo 101.116219 101.428322 102.046697 101.896457 102.770610 103.272028
#>      rust   4.369887   4.415793   4.515948   4.561882   4.572656   4.626907
#>  neval
#>     10
#>     10

Set similarities

bixverse also exposes Rust-accelerated set similarity calculations. To illustrate the performance, below is a comparison against a naive R implementation of Jaccard similarity across two lists.

jaccard_sim <- function(x, y) {
  length(intersect(x, y)) / length(union(x, y))
}

set.seed(42L)

random_sets_1 <- purrr::map(
  1:1000L,
  ~ paste(sample(LETTERS, sample(1:20, 1)))
)

random_sets_2 <- purrr::map(
  1:500L,
  ~ paste(sample(LETTERS, sample(1:20, 1)))
)

Starting with a sequential purrr approach as the baseline:

tictoc::tic()

r_results <- purrr::map(
  random_sets_1,
  \(x) {
    purrr::map_dbl(random_sets_2, \(y) jaccard_sim(x, y))
  },
  .progress = TRUE
)
#>  ■■■                                7% |  ETA: 23s
#>  ■■■■■■■                           19% |  ETA: 20s
#>  ■■■■■■■■■■■                       32% |  ETA: 17s
#>  ■■■■■■■■■■■■■■                    44% |  ETA: 14s
#>  ■■■■■■■■■■■■■■■■■■                56% |  ETA: 11s
#>  ■■■■■■■■■■■■■■■■■■■■■■            68% |  ETA:  8s
#>  ■■■■■■■■■■■■■■■■■■■■■■■■■         81% |  ETA:  5s
#>  ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■     93% |  ETA:  2s

similarity_matrix <- matrix(
  data = unlist(r_results),
  nrow = length(random_sets_1),
  ncol = length(random_sets_2),
  byrow = TRUE
)

tictoc::toc()
#> 24.617 sec elapsed

Parallelising via furrr gives a meaningful speedup:

future::plan(strategy = future::multisession(workers = parallel::detectCores()))

tictoc::tic()

r_results <- furrr::future_map(
  random_sets_1,
  \(x) {
    purrr::map_dbl(random_sets_2, \(y) jaccard_sim(x, y))
  },
  .progress = TRUE
)

similarity_matrix <- matrix(
  data = unlist(r_results),
  nrow = length(random_sets_1),
  ncol = length(random_sets_2),
  byrow = TRUE
)

tictoc::toc()
#> 13.838 sec elapsed

future::plan(strategy = future::sequential())

mirai reduces process scheduling overhead (somtimes) compared to future and does a bit better:

mirai::daemons(parallel::detectCores())

tictoc::tic()

r_results_mirai <- mirai::mirai_map(
  random_sets_1,
  \(x, sets_2) {
    purrr::map_dbl(sets_2, \(y) {
      length(intersect(x, y)) / length(union(x, y))
    })
  },
  .args = list(sets_2 = random_sets_2)
)[]

similarity_matrix <- matrix(
  data = unlist(r_results_mirai),
  nrow = length(random_sets_1),
  ncol = length(random_sets_2),
  byrow = TRUE
)

tictoc::toc()
#> 13.392 sec elapsed

mirai::daemons(0)

Depending on your system this optimised parallel R version may be around 4x faster than the naive sequential baseline. Now compare against the Rust implementation:

tictoc::tic()

rust_res <- rs_set_similarity_list2(
  s_1_list = random_sets_1,
  s_2_list = random_sets_2,
  overlap_coefficient = FALSE
)

tictoc::toc()
#> 0.046 sec elapsed

all.equal(similarity_matrix, rust_res, tolerance = 1e-15)
#> [1] TRUE

The margin here is not subtle. The functions shown in this vignette are a representative sample of what is exposed in bixverse: there are many more specialised functions throughout the package worth exploring (if you are brave and can deal with a panic here and there).