VCBench

VCBench is a capability-stratified benchmark for single-cell foundation models, evaluating five models against pre-registered baselines across five dimensions.

First-time visitors — start here. The v1.0.0 release reconciles VCBench's evaluator with upstream cell-eval to numerical precision via an explicit anchor-convention parameter. The Arc State checkpoint lives publicly at huggingface.co/appliedscientific/arc-state-norman-gears-corrected — with paste-able reproduction snippets that recover the headline numbers in <5 min on CPU.

Repository access: The canonical company-org repository at AppliedScientific/VCBench is private during peer review. Contact the corresponding author for read access (URLs of the form github.com/AppliedScientific/VCBench/... will return 404 without an invite). The companion HuggingFace artefact + reproduction snippets are public and require no access request.

What it produces

The headline result is a 5-model × 5-dimension capability matrix scored against pre-registered trivial and strongest non-FM baselines, plus three reusable methodological tools that work independently of the specific models evaluated here.

Model	A: PRR	B: MacroF1 (native)	C: AUROC	C: AUPRC	C: EPR	D: Pearson	E: τ-b
Geneformer V2-316M	0.627 (FT+D)	0.181 (ZS)	0.626 (IE)	0.001 (IE)	0.000 (IE)	0.001 (ZS+D)	-0.017 (ZS)
scGPT (fine-tuned)	0.503 (FT)	0.012 (ZS)	0.519 (IE)	0.003 (IE)	20.05 (IE)	0.064 (ZS+D)	-0.057 (ZS)
UCE 33-layer	N/A	0.027 (ZS)	N/A	N/A	N/A	0.132 (ZS+D)	0.136 (ZS)
TranscriptFormer	-0.174 (ZS+D)	0.156 (ZS)	N/A	N/A	N/A	0.232 (ZS+D)	0.046 †
Arc State	0.402 (FT) ‡	N/A	N/A	N/A	N/A	DNR	DNR
Baselines
Additive (Ahlmann-Eltze)	0.890	—	—	—	—	—	—
Mean-prediction	0.579	—	—	—	—	0.152	—
No-change	0.000	—	—	—	—	—	—
PCA + kNN	—	0.166	—	—	—	—	—
Co-expression	—	—	0.558	0.004	15.50	—	—
pySCENIC	—	—	0.501	0.0011	3.50	—	—
scLinear	—	—	—	—	—	0.129	—
Mean-celltype	—	—	—	—	—	0.152	—
PCA + DPT	—	—	—	—	—	—	0.190

† TranscriptFormer Dim E = unweighted mean of sci-fate (0.051) and Weinreb (0.041 ± 0.078); the Weinreb component is itself a bootstrap mean over 10 random 5K-cell subsamples (ARPACK non-convergence on the full 49K-cell graph). ‡ Arc State Norman evaluated on the disjoint GEARS train/test split (seed=1); PRR = 0.402, VC Level 1.

VC Level outcomes (binding scoring uses the common-label-set protocol on Dim B):

Geneformer V2-316M, scGPT, UCE, Arc State → Level 1 (clears trivial baseline on at least one dim)
TranscriptFormer → Level 2 (clears mean-celltype on Dim D: 0.232 > 0.152)
No model achieves Level 3.

How to run it

git clone https://github.com/AppliedScientific/VCBench.git
cd VCBench
git checkout v1.0.0  # pin to the release tag
pip install -e .
make tests           # full test suite (unit + integration + correctness), ~3s, no GPU needed
make all             # full reproduction from cached HF embeddings (CPU-OK)

The three reusable methodological tools work independently of make all:

from vcbench.protocols import common_label_set        # Eq. 5
from vcbench.probes    import spread_error_correlation # Eq. 10
from vcbench.contamination import ContaminationManifest, validate_manifest

Hardware

Stage	CPU?	Wall-clock on H200
Cached-embedding reproduction (`make all`)	✅	~30 min on a Mac
Baselines (additive, PCA+kNN, co-expression, mean-celltype, PCA+DPT)	✅	~1 h CPU
FM fine-tuning (Geneformer 5.7h + scGPT 2.75h + Arc State ~3h)	GPU	~12 h
FM embedding extraction at scale (TranscriptFormer Dim B)	GPU	~2 h
Total fresh reproduction (`make fresh`)	GPU	~18 h H200

Where next

Install paths: INSTALL.md (Docker / conda / pip)
Reusable methodological tools: the vcbench.protocols / vcbench.probes / vcbench.contamination packages each have full numpy-style docstrings + worked examples
Manuscript ↔ code traceability: every Eq. (1–9) of VCBench (2026) is implemented in src/vcbench/dimensions/dim_*/metrics.py with the equation number and reference values in the module docstring
Reference-value drift detection: tests/reference_values.json locks every §I.4 capability-matrix cell; tests/unit/ contains drift detectors that fire if the on-disk JSONs / CSVs ever diverge

Note on the two `src/` trees

The repository carries two source trees:

src/vcbench/ — the canonical Python package, pip install -e .-able. This is what model developers and downstream tooling import from. It owns the three reusable methodological tools, the per-dimension evaluation modules, the FoundationModel ABC + per-model wrappers (incl. Arc State), the python -m vcbench CLI, the contamination schema, and the test suite.
src/{baselines,data,evaluation,models,utils}/ — the legacy pipeline tree that produced the on-disk reference artefacts (results/dim_*/, results/baselines/, etc.). Its step1–stepN runtime functions are the verified path that produced every §I.4 reference value. The new vcbench.models.* wrappers compose these step* functions in their run_dim_a() orchestration so end-to-end execution still flows through proven code while the public ABI lives in the new package.

Import only from vcbench.*; treat src.{baselines,data,evaluation,models,utils}.* as internal implementation detail.

Artifacts & reproducibility

Per-dimension reference outputs live under results/; the published supplementary tables are in tables/.

Trained model checkpoints and embedding tensors are archived on HuggingFace Hub — see docs/MANIFEST.md for the file-by-file index. Three HF repos:

Repo	Type	Contents
`arc-state-norman-gears-corrected`	model	Arc State checkpoint + eval CSVs + training config
`vcbench-geneformer-perturbation`	model	Geneformer V2-316M fine-tuned classifier
`vcbench-embeddings`	dataset	Cell/gene embeddings across Dim A–E

All three repos are publicly available on HuggingFace.

from huggingface_hub import snapshot_download

snapshot_download("appliedscientific/arc-state-norman-gears-corrected", repo_type="model")
snapshot_download("appliedscientific/vcbench-geneformer-perturbation", repo_type="model")
snapshot_download("appliedscientific/vcbench-embeddings", repo_type="dataset")

Environments

3 GPU environments grouped by compatible PyTorch versions, plus 2 CPU environments:

Environment	Python	PyTorch	Models
`vcbench-analysis`	3.10	-	Baselines, evaluation, probes, assembly
`vcbench-scenic`	3.8	-	pySCENIC (dependency conflicts)
`vcbench-pt118`	3.10	CUDA 11.8	Geneformer, UCE
`vcbench-pt212`	3.9	2.1.2	scGPT
`vcbench-pt25`	3.11	<=2.5.1	TranscriptFormer, State

Models

Model	Dim A	Dim B	Dim C	Dim D	Dim E	Environment
Geneformer V2-316M	Embedding shift + decoder	Ortholog remap + embed	Attention layer 13	Embedding probe	DPT probe	vcbench-pt118
scGPT (fine-tuned)	Fine-tune + predict	Ortholog remap + embed	Gene embedding similarity	Embedding probe	DPT probe	vcbench-pt212
UCE 33-layer	N/A	Native cross-species	N/A	Embedding probe	DPT probe	vcbench-pt118
TranscriptFormer	Autoregressive generation	Native cross-species	Gene prompting	Embedding probe	DPT probe	vcbench-pt25
Arc State	Train from scratch	N/A	N/A	Embedding probe	DPT probe	vcbench-pt25

Metrics

Dimension	Metrics	Baselines
A: Perturbation	PRR (Pearson R on Δ-expression), DES, MAE, Composite	Additive, Mean, No-change
B: Cross-Species	Macro F1, Weighted F1	PCA + kNN
C: GRN	AUROC, AUPRC, EPR	Co-expression, pySCENIC
D: Cross-Modal	Pearson R, RMSE	PCA + ridge, scLinear, Mean celltype
E: Temporal	Kendall tau, kNN balanced accuracy	PCA + DPT

Requirements

RAM: 32-64 GB recommended
Disk: 100 GB+ for raw datasets + model weights
GPU: A100 80GB recommended (required for UCE; 40GB sufficient for others)
Baseline construction: CPU-only, no GPU needed
Estimated GPU time: ~75 GPU-hours total

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
configs		configs
data/static		data/static
docs		docs
results		results
src		src
submissions		submissions
tables		tables
tests		tests
tools/vcbench_contamination_check		tools/vcbench_contamination_check
.dockerignore		.dockerignore
.gitignore		.gitignore
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
INSTALL.md		INSTALL.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
environment.yml		environment.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VCBench

What it produces

How to run it

Hardware

Where next

Note on the two `src/` trees

Artifacts & reproducibility

Environments

Models

Metrics

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VCBench

What it produces

How to run it

Hardware

Where next

Note on the two src/ trees

Artifacts & reproducibility

Environments

Models

Metrics

Requirements

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Note on the two `src/` trees

Packages