scib-benchmark

Head-to-head benchmark comparing scib-metrics (JAX) and scib-rapids (CuPy/RAPIDS) for single-cell integration benchmarking metrics.

What this measures

Both libraries compute the same 12 single-cell integration benchmarking metrics on identical data. This benchmark measures:

Numerical equivalency — do both implementations produce the same metric values?
Runtime performance — how much faster is the GPU-accelerated RAPIDS implementation vs JAX?

Metrics benchmarked

Category	Metric	Input
Silhouette	`silhouette_label`	embeddings, labels
Silhouette	`silhouette_batch`	embeddings, labels, batch
Silhouette	`bras` (Batch Removal Adapted Silhouette)	embeddings, labels, batch
LISI	`ilisi_knn` (integration LISI)	kNN graph, batch
LISI	`clisi_knn` (cell-type LISI)	kNN graph, labels
Batch effect	`kbet`	kNN graph, batch
Batch effect	`kbet_per_label`	kNN graph, batch, labels
Clustering	`nmi_ari_cluster_labels_kmeans`	embeddings, labels
Clustering	`nmi_ari_cluster_labels_leiden`	kNN graph, labels
Integration	`isolated_labels`	embeddings, labels, batch
Integration	`graph_connectivity`	kNN graph, labels
Regression	`pcr_comparison`	pre/post embeddings, covariate

Methodology

Data

Real single-cell RNA-seq data from Tabula Sapiens v2 (548k cells, multi-tissue, multi-donor), accessed via a local TileDB-SOMA store. The benchmark subsamples to multiple dataset sizes (default: 1,000 / 20,000 cells). Larger sizes (e.g., 400k) are not run by default because scib-metrics becomes prohibitively slow at that scale.

Preprocessing pipeline:

Filter genes (min 3 cells)
Normalize to 10,000 counts per cell + log1p
Select 2,000 highly variable genes
PCA (50 components)

Pre-integration embeddings are simulated by adding batch-correlated shifts to the PCA coordinates.

Environment

Each library runs in its own isolated virtual environment created with uv:

venv_metrics: scib-metrics + jax[cuda12] (JAX with CUDA GPU support)
venv_rapids: scib-rapids + cupy-cuda12x

Both are editable installs from local repos (/home/inference/repos/scib-metrics and /home/inference/repos/scib-rapids).

Important: The JAX benchmark uses jax[cuda12], i.e., JAX with CUDA GPU acceleration — not jax[cpu]. This ensures a fair GPU-vs-GPU comparison.

Execution

Workers run sequentially on a single GPU (CUDA_VISIBLE_DEVICES=0). Each worker:

Loads the same .npy data arrays
Computes k-nearest neighbors (k=90) using PyNNDescent
Runs all 12 metrics, timing each independently
Saves results + timings as JSON

Equivalency criteria

Deterministic metrics: must match within 1% relative difference
Stochastic metrics (kmeans, leiden clustering): 10% tolerance due to PRNG differences

Usage

# Full run (setup venvs, download data, benchmark, compare)
./run_benchmark.sh

# Skip venv setup (reuse existing)
./run_benchmark.sh --skip-setup

# Skip data download (reuse existing)
./run_benchmark.sh --skip-data

# Custom dataset sizes
./run_benchmark.sh --sizes "1000 20000"

# Custom k-neighbors
./run_benchmark.sh --n-neighbors 90

Requirements

NVIDIA GPU with CUDA 12.x
uv package manager
Local clones of scib-metrics and scib-rapids at /home/inference/repos/
Local TileDB-SOMA data store (for download_data.py)

Output

Results are saved to benchmark_results/:

metrics_n{size}.json — scib-metrics results per dataset size
rapids_n{size}.json — scib-rapids results per dataset size
comparison.json — full equivalency + timing comparison
logs/ — stdout/stderr from each worker

File overview

File	Purpose
`run_benchmark.sh`	Main pipeline orchestrator
`setup_venvs.sh`	Creates isolated venvs with uv
`download_data.py`	Fetches and preprocesses real scRNA-seq data from SOMA
`generate_data.py`	Alternative synthetic data generator
`worker.py`	Runs all 12 metrics (auto-detects scib-metrics or scib-rapids)
`compare_results.py`	Compares metric values and timings, prints summary tables

Results

Results from the latest run on 1k and 20k cell subsets of Tabula Sapiens v2 (single NVIDIA GPU, k=90 neighbors).

Metric values

Raw metric values from each library.

n = 1,000 cells

Metric	scib-metrics	scib-rapids
`bras`	0.679313	0.679147
`clisi_knn`	0.992598	0.992598
`graph_connectivity`	0.956295	0.956295
`ilisi_knn`	0.208344	0.208285
`isolated_labels`	0.576349	0.576349
`kbet`	0.172000	0.172000
`kbet_per_label`	0.885625	0.885427
`nmi_ari_cluster_labels_kmeans_ari`	0.444720	0.427040
`nmi_ari_cluster_labels_kmeans_nmi`	0.790064	0.781690
`nmi_ari_cluster_labels_leiden_ari`	0.593712	0.553312
`nmi_ari_cluster_labels_leiden_nmi`	0.785297	0.773716
`pcr_comparison`	0.873575	0.873576
`silhouette_batch`	0.804671	0.804671
`silhouette_label`	0.536008	0.536008

n = 20,000 cells

Metric	scib-metrics	scib-rapids
`bras`	0.641492	0.641519
`clisi_knn`	0.999022	0.999023
`graph_connectivity`	0.870374	0.870374
`ilisi_knn`	0.092136	0.092127
`isolated_labels`	0.552903	0.552903
`kbet`	0.025750	0.025750
`kbet_per_label`	0.474108	0.468552
`nmi_ari_cluster_labels_kmeans_ari`	0.300553	0.311180
`nmi_ari_cluster_labels_kmeans_nmi`	0.717732	0.716465
`nmi_ari_cluster_labels_leiden_ari`	0.693071	0.684802
`nmi_ari_cluster_labels_leiden_nmi`	0.802485	0.797029
`pcr_comparison`	0.921949	0.921950
`silhouette_batch`	0.788884	0.788884
`silhouette_label`	0.511745	0.511745

Timing

Wall-clock seconds per metric, with speedup = metrics_time / rapids_time.

n = 1,000 cells

Metric	scib-metrics (s)	scib-rapids (s)	speedup
`bras`	20.7397	1.0738	19.31×
`clisi_knn`	0.2940	0.0023	127.83×
`graph_connectivity`	0.0487	0.0477	1.02×
`ilisi_knn`	1.5596	0.5072	3.07×
`isolated_labels`	0.0140	0.2588	0.05×
`kbet`	1.0525	0.9294	1.13×
`kbet_per_label`	18.9501	5.4061	3.51×
`nearest_neighbors`	19.5032	20.8665	0.93×
`nmi_ari_cluster_labels_kmeans`	3.2053	2.8090	1.14×
`nmi_ari_cluster_labels_leiden`	1.1334	1.5514	0.73×
`pcr_comparison`	1.2024	2.1607	0.56×
`silhouette_batch`	25.6472	0.1800	142.48×
`silhouette_label`	3.1603	4.0132	0.79×

n = 20,000 cells

Metric	scib-metrics (s)	scib-rapids (s)	speedup
`bras`	88.6623	0.8740	101.44×
`clisi_knn`	0.9002	0.0069	130.46×
`graph_connectivity`	0.1646	0.1586	1.04×
`ilisi_knn`	1.9619	0.0175	112.11×
`isolated_labels`	0.1285	4.9100	0.03×
`kbet`	2.2675	0.0224	101.23×
`kbet_per_label`	61.8097	18.2841	3.38×
`nearest_neighbors`	33.6002	32.8474	1.02×
`nmi_ari_cluster_labels_kmeans`	8.9912	8.0512	1.12×
`nmi_ari_cluster_labels_leiden`	44.6316	4.6684	9.56×
`pcr_comparison`	1.3026	0.2186	5.96×
`silhouette_batch`	106.9684	0.4200	254.69×
`silhouette_label`	646.0577	5.3100	121.67×

Highlights:

silhouette_batch, bras, silhouette_label, clisi_knn, ilisi_knn, and kbet show 1–2 orders of magnitude speedup at 20k cells.
isolated_labels is the one metric where scib-metrics is consistently faster (rapids path is ~30× slower at 20k).
nearest_neighbors (PyNNDescent) is shared code and runs at parity.

Full per-metric values and raw timings are in benchmark_results/comparison.json.

Notes on numerical differences

Most metrics agree exactly or within float-rounding noise. Three metrics drift:

kbet_per_label (~0.02% at 1k, ~1.17% at 20k): kbet_per_label internally calls a diffusion-map kNN step. scib-rapids (src/scib_rapids/utils/_diffusion_nn.py:52-64) passes a deterministic v0 = ones(n)/√n to scipy.sparse.linalg.eigsh and canonicalizes eigenvector signs (forces the largest-magnitude component positive). scib-metrics (src/scib_metrics/utils/_diffusion_nn.py:68) does neither, so Lanczos can sign-flip eigenvectors run-to-run, perturbing the diffusion-distance kNN graph that kbet_per_label consumes. scib-rapids also does the chi-square in float32 vs float64 in scib-metrics, contributing a small additional drift. Aggregate kbet matches exactly because it doesn't go through diffusion_nn. Adding v0 + sign canonicalization upstream in scib-metrics should make the two bit-identical here.
nmi_ari_cluster_labels_kmeans_* (up to ~4%): both libraries seed with 0, but scib-metrics uses jax.random.PRNGKey(0) while scib-rapids uses np.random.default_rng(0). Different PRNG streams → different k-means++ initializations → different cluster assignments → different NMI/ARI. Algorithmically equivalent; different random seeds in practice.
nmi_ari_cluster_labels_leiden_* (up to ~7% ARI at 1k, ~1.2% at 20k): scib-rapids now runs leiden via cugraph.leiden (Apache-2, GPU) while scib-metrics uses igraph.community_leiden (GPL-2, CPU). Different leiden implementations with different refinement and move strategies produce different partitions, so NMI/ARI differ modestly. The license and performance win (see timing table — nmi_ari_cluster_labels_leiden is ~10× faster at 20k) makes this the right trade-off; the two are not expected to agree bit-for-bit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scib-benchmark

What this measures

Metrics benchmarked

Methodology

Data

Environment

Execution

Equivalency criteria

Usage

Requirements

Output

File overview

Results

Metric values

n = 1,000 cells

n = 20,000 cells

Timing

n = 1,000 cells

n = 20,000 cells

Notes on numerical differences

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
README.md		README.md
benchmark_run.log		benchmark_run.log
compare_results.py		compare_results.py
download_data.py		download_data.py
generate_data.py		generate_data.py
run_benchmark.sh		run_benchmark.sh
setup_venvs.sh		setup_venvs.sh
worker.py		worker.py

Folders and files

Latest commit

History

Repository files navigation

scib-benchmark

What this measures

Metrics benchmarked

Methodology

Data

Environment

Execution

Equivalency criteria

Usage

Requirements

Output

File overview

Results

Metric values

n = 1,000 cells

n = 20,000 cells

Timing

n = 1,000 cells

n = 20,000 cells

Notes on numerical differences

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages