Head-to-head benchmark comparing scib-metrics (JAX) and scib-rapids (CuPy/RAPIDS) for single-cell integration benchmarking metrics.
Both libraries compute the same 12 single-cell integration benchmarking metrics on identical data. This benchmark measures:
- Numerical equivalency — do both implementations produce the same metric values?
- Runtime performance — how much faster is the GPU-accelerated RAPIDS implementation vs JAX?
| Category | Metric | Input |
|---|---|---|
| Silhouette | silhouette_label |
embeddings, labels |
| Silhouette | silhouette_batch |
embeddings, labels, batch |
| Silhouette | bras (Batch Removal Adapted Silhouette) |
embeddings, labels, batch |
| LISI | ilisi_knn (integration LISI) |
kNN graph, batch |
| LISI | clisi_knn (cell-type LISI) |
kNN graph, labels |
| Batch effect | kbet |
kNN graph, batch |
| Batch effect | kbet_per_label |
kNN graph, batch, labels |
| Clustering | nmi_ari_cluster_labels_kmeans |
embeddings, labels |
| Clustering | nmi_ari_cluster_labels_leiden |
kNN graph, labels |
| Integration | isolated_labels |
embeddings, labels, batch |
| Integration | graph_connectivity |
kNN graph, labels |
| Regression | pcr_comparison |
pre/post embeddings, covariate |
Real single-cell RNA-seq data from Tabula Sapiens v2 (548k cells, multi-tissue, multi-donor), accessed via a local TileDB-SOMA store. The benchmark subsamples to multiple dataset sizes (default: 1,000 / 20,000 cells). Larger sizes (e.g., 400k) are not run by default because scib-metrics becomes prohibitively slow at that scale.
Preprocessing pipeline:
- Filter genes (min 3 cells)
- Normalize to 10,000 counts per cell + log1p
- Select 2,000 highly variable genes
- PCA (50 components)
Pre-integration embeddings are simulated by adding batch-correlated shifts to the PCA coordinates.
Each library runs in its own isolated virtual environment created with uv:
- venv_metrics:
scib-metrics+jax[cuda12](JAX with CUDA GPU support) - venv_rapids:
scib-rapids+cupy-cuda12x
Both are editable installs from local repos (/home/inference/repos/scib-metrics and /home/inference/repos/scib-rapids).
Important: The JAX benchmark uses jax[cuda12], i.e., JAX with CUDA GPU acceleration — not jax[cpu]. This ensures a fair GPU-vs-GPU comparison.
Workers run sequentially on a single GPU (CUDA_VISIBLE_DEVICES=0). Each worker:
- Loads the same
.npydata arrays - Computes k-nearest neighbors (k=90) using PyNNDescent
- Runs all 12 metrics, timing each independently
- Saves results + timings as JSON
- Deterministic metrics: must match within 1% relative difference
- Stochastic metrics (kmeans, leiden clustering): 10% tolerance due to PRNG differences
# Full run (setup venvs, download data, benchmark, compare)
./run_benchmark.sh
# Skip venv setup (reuse existing)
./run_benchmark.sh --skip-setup
# Skip data download (reuse existing)
./run_benchmark.sh --skip-data
# Custom dataset sizes
./run_benchmark.sh --sizes "1000 20000"
# Custom k-neighbors
./run_benchmark.sh --n-neighbors 90- NVIDIA GPU with CUDA 12.x
- uv package manager
- Local clones of
scib-metricsandscib-rapidsat/home/inference/repos/ - Local TileDB-SOMA data store (for
download_data.py)
Results are saved to benchmark_results/:
metrics_n{size}.json— scib-metrics results per dataset sizerapids_n{size}.json— scib-rapids results per dataset sizecomparison.json— full equivalency + timing comparisonlogs/— stdout/stderr from each worker
| File | Purpose |
|---|---|
run_benchmark.sh |
Main pipeline orchestrator |
setup_venvs.sh |
Creates isolated venvs with uv |
download_data.py |
Fetches and preprocesses real scRNA-seq data from SOMA |
generate_data.py |
Alternative synthetic data generator |
worker.py |
Runs all 12 metrics (auto-detects scib-metrics or scib-rapids) |
compare_results.py |
Compares metric values and timings, prints summary tables |
Results from the latest run on 1k and 20k cell subsets of Tabula Sapiens v2 (single NVIDIA GPU, k=90 neighbors).
Raw metric values from each library.
| Metric | scib-metrics | scib-rapids |
|---|---|---|
bras |
0.679313 | 0.679147 |
clisi_knn |
0.992598 | 0.992598 |
graph_connectivity |
0.956295 | 0.956295 |
ilisi_knn |
0.208344 | 0.208285 |
isolated_labels |
0.576349 | 0.576349 |
kbet |
0.172000 | 0.172000 |
kbet_per_label |
0.885625 | 0.885427 |
nmi_ari_cluster_labels_kmeans_ari |
0.444720 | 0.427040 |
nmi_ari_cluster_labels_kmeans_nmi |
0.790064 | 0.781690 |
nmi_ari_cluster_labels_leiden_ari |
0.593712 | 0.553312 |
nmi_ari_cluster_labels_leiden_nmi |
0.785297 | 0.773716 |
pcr_comparison |
0.873575 | 0.873576 |
silhouette_batch |
0.804671 | 0.804671 |
silhouette_label |
0.536008 | 0.536008 |
| Metric | scib-metrics | scib-rapids |
|---|---|---|
bras |
0.641492 | 0.641519 |
clisi_knn |
0.999022 | 0.999023 |
graph_connectivity |
0.870374 | 0.870374 |
ilisi_knn |
0.092136 | 0.092127 |
isolated_labels |
0.552903 | 0.552903 |
kbet |
0.025750 | 0.025750 |
kbet_per_label |
0.474108 | 0.468552 |
nmi_ari_cluster_labels_kmeans_ari |
0.300553 | 0.311180 |
nmi_ari_cluster_labels_kmeans_nmi |
0.717732 | 0.716465 |
nmi_ari_cluster_labels_leiden_ari |
0.693071 | 0.684802 |
nmi_ari_cluster_labels_leiden_nmi |
0.802485 | 0.797029 |
pcr_comparison |
0.921949 | 0.921950 |
silhouette_batch |
0.788884 | 0.788884 |
silhouette_label |
0.511745 | 0.511745 |
Wall-clock seconds per metric, with speedup = metrics_time / rapids_time.
| Metric | scib-metrics (s) | scib-rapids (s) | speedup |
|---|---|---|---|
bras |
20.7397 | 1.0738 | 19.31× |
clisi_knn |
0.2940 | 0.0023 | 127.83× |
graph_connectivity |
0.0487 | 0.0477 | 1.02× |
ilisi_knn |
1.5596 | 0.5072 | 3.07× |
isolated_labels |
0.0140 | 0.2588 | 0.05× |
kbet |
1.0525 | 0.9294 | 1.13× |
kbet_per_label |
18.9501 | 5.4061 | 3.51× |
nearest_neighbors |
19.5032 | 20.8665 | 0.93× |
nmi_ari_cluster_labels_kmeans |
3.2053 | 2.8090 | 1.14× |
nmi_ari_cluster_labels_leiden |
1.1334 | 1.5514 | 0.73× |
pcr_comparison |
1.2024 | 2.1607 | 0.56× |
silhouette_batch |
25.6472 | 0.1800 | 142.48× |
silhouette_label |
3.1603 | 4.0132 | 0.79× |
| Metric | scib-metrics (s) | scib-rapids (s) | speedup |
|---|---|---|---|
bras |
88.6623 | 0.8740 | 101.44× |
clisi_knn |
0.9002 | 0.0069 | 130.46× |
graph_connectivity |
0.1646 | 0.1586 | 1.04× |
ilisi_knn |
1.9619 | 0.0175 | 112.11× |
isolated_labels |
0.1285 | 4.9100 | 0.03× |
kbet |
2.2675 | 0.0224 | 101.23× |
kbet_per_label |
61.8097 | 18.2841 | 3.38× |
nearest_neighbors |
33.6002 | 32.8474 | 1.02× |
nmi_ari_cluster_labels_kmeans |
8.9912 | 8.0512 | 1.12× |
nmi_ari_cluster_labels_leiden |
44.6316 | 4.6684 | 9.56× |
pcr_comparison |
1.3026 | 0.2186 | 5.96× |
silhouette_batch |
106.9684 | 0.4200 | 254.69× |
silhouette_label |
646.0577 | 5.3100 | 121.67× |
Highlights:
silhouette_batch,bras,silhouette_label,clisi_knn,ilisi_knn, andkbetshow 1–2 orders of magnitude speedup at 20k cells.isolated_labelsis the one metric where scib-metrics is consistently faster (rapids path is ~30× slower at 20k).nearest_neighbors(PyNNDescent) is shared code and runs at parity.
Full per-metric values and raw timings are in benchmark_results/comparison.json.
Most metrics agree exactly or within float-rounding noise. Three metrics drift:
kbet_per_label(~0.02% at 1k, ~1.17% at 20k):kbet_per_labelinternally calls a diffusion-map kNN step.scib-rapids(src/scib_rapids/utils/_diffusion_nn.py:52-64) passes a deterministicv0 = ones(n)/√ntoscipy.sparse.linalg.eigshand canonicalizes eigenvector signs (forces the largest-magnitude component positive).scib-metrics(src/scib_metrics/utils/_diffusion_nn.py:68) does neither, so Lanczos can sign-flip eigenvectors run-to-run, perturbing the diffusion-distance kNN graph thatkbet_per_labelconsumes.scib-rapidsalso does the chi-square in float32 vs float64 inscib-metrics, contributing a small additional drift. Aggregatekbetmatches exactly because it doesn't go throughdiffusion_nn. Addingv0+ sign canonicalization upstream inscib-metricsshould make the two bit-identical here.nmi_ari_cluster_labels_kmeans_*(up to ~4%): both libraries seed with0, butscib-metricsusesjax.random.PRNGKey(0)whilescib-rapidsusesnp.random.default_rng(0). Different PRNG streams → different k-means++ initializations → different cluster assignments → different NMI/ARI. Algorithmically equivalent; different random seeds in practice.nmi_ari_cluster_labels_leiden_*(up to ~7% ARI at 1k, ~1.2% at 20k):scib-rapidsnow runs leiden viacugraph.leiden(Apache-2, GPU) whilescib-metricsusesigraph.community_leiden(GPL-2, CPU). Different leiden implementations with different refinement and move strategies produce different partitions, so NMI/ARI differ modestly. The license and performance win (see timing table —nmi_ari_cluster_labels_leidenis ~10× faster at 20k) makes this the right trade-off; the two are not expected to agree bit-for-bit.