Tutorial: Benchmark Against SOTA
This tutorial shows how to run GARAGE alongside state-of-the-art generative models and compare their synthetic data quality using the multi-metric evaluation framework.
Why Benchmark?
Benchmarking against SOTA models establishes GARAGE’s relative performance. The multi-metric approach (WD/MMD/SWD + ARI/NMI/F1) provides a quantitative basis for claims about distributional fidelity and biological utility.
Step 1: Generate Data with All Models
Run each model on all 4 datasets (Yan, Pollen, CBMC, Muraro):
GARAGE
for d in yan pollen cbmc muraro; do
python -m data_generation.garage --dataset $d
done
General-purpose SOTA baselines
for d in yan pollen cbmc muraro; do
python -m benchmarking.sota.gan --dataset $d
python -m benchmarking.sota.wgan --dataset $d
python -m benchmarking.sota.fgan --dataset $d
python -m benchmarking.sota.vae --dataset $d
python -m benchmarking.sota.lsh_gan --dataset $d
done
scRNA-seq-specific baselines
for d in yan pollen cbmc muraro; do
python -m benchmarking.scrna_seq_specific.scgan --dataset $d
python -m benchmarking.scrna_seq_specific.scvae --dataset $d
python -m benchmarking.scrna_seq_specific.scdiffusion --dataset $d
python -m benchmarking.scrna_seq_specific.gan_ros --dataset $d
python -m benchmarking.scrna_seq_specific.vae_ros --dataset $d
done
Note: This generates \(11 \times 4 = 44\) synthetic datasets. Expect 2–4 hours on GPU depending on hardware.
Step 2: Compute Distributional Metrics
# Wasserstein Distance (all datasets)
for d in yan pollen cbmc muraro; do
python -m data_generation.wasserstein_distance \
--dataset $d \
--gen_csv data/gen_data/${d}_data_mixdata_iter3_top_426.csv
done
# MMD and Sliced Wasserstein Distance (all methods × all datasets)
python analysis/distribution_metrics.py
# Individual SWD analysis
python analysis/swd_analysis.py
python analysis/mmd_analysis.py
Step 3: Compute Clustering Metrics
# Feature selection + Leiden clustering for all generated files
python analysis/clustering_evaluation.py
# scRNA-seq-specific benchmark metrics
python analysis/sc_specific_benchmark.py
Step 4: Build Summary Tables
# Aggregate all losses into a single CSV
python analysis/aggregate_losses.py
# Build mean ± std summary tables (WD/MMD/SWD per method)
python analysis/build_summary_tables.py
# Marker gene clustering grid search
python analysis/marker_clustering_grid.py
Step 5: Generate Publication-Quality Figures
# Wasserstein vs. leakage fraction plot
python analysis/plot_wasserstein_vs_leakage.py
Check results/ for the generated figures.
Step 6: Interpret the Results
Output files
File |
Description |
|---|---|
|
Mean WD per method per dataset |
|
Mean ARI/NMI/F1 per method per dataset |
|
Mean MMD/SWD per method per dataset |
|
Raw losses from all training runs |
|
Leakage fraction sweep figure |
How to compare
A typical comparison table looks like:
Method |
WD (↓) |
MMD (↓) |
ARI (↑) |
NMI (↑) |
F1 (↑) |
|---|---|---|---|---|---|
GAN |
0.045 |
0.312 |
0.412 |
0.523 |
0.398 |
WGAN |
0.038 |
0.287 |
0.438 |
0.551 |
0.421 |
F-GAN |
0.041 |
0.301 |
0.425 |
0.540 |
0.410 |
VAE |
0.092 |
0.456 |
0.318 |
0.441 |
0.305 |
LSH-GAN |
0.012 |
0.178 |
0.521 |
0.634 |
0.498 |
scGAN |
0.033 |
0.265 |
0.475 |
0.582 |
0.452 |
scVAE |
0.058 |
0.389 |
0.395 |
0.508 |
0.381 |
scDiffusion |
0.041 |
0.310 |
0.442 |
0.549 |
0.433 |
GAN-ROS |
0.028 |
0.252 |
0.488 |
0.591 |
0.461 |
VAE-ROS |
0.051 |
0.365 |
0.408 |
0.522 |
0.395 |
GARAGE |
0.007 |
0.142 |
0.584 |
0.672 |
0.545 |
(Numbers above are illustrative.)
Key comparison dimensions
Distributional fidelity (WD, MMD, SWD): Does the synthetic data match the real data in gene-expression space? Lower is better.
Biological utility (ARI, NMI, F1): Does the synthetic data cluster into biologically meaningful groups? Higher is better.
Rare-cell preservation: The macro-F1 is particularly important — it equally weights each cell type, so rare types are not averaged out. GARAGE should show the highest macro-F1 relative to its ARI.
Reproducing the Paper’s Figures
If you are writing a paper that compares GARAGE to other methods, you can build summary tables with mean ± std:
python analysis/build_summary_tables.py
This produces CSV files with columns like:
method,dataset,wd_mean,wd_std,ari_mean,ari_std,nmi_mean,nmi_std,f1_mean,f1_std
GARAGE,yam,0.0067,0.0012,0.724,0.031,0.801,0.025,0.694,0.028
These are directly importable into R (read.csv) or Python (pandas.read_csv) for plotting with ggplot2 or matplotlib/seaborn.
Adding a New Baseline
To add a new model to the benchmark:
Write a new Python file in
benchmarking/sota/orbenchmarking/scrna_seq_specific/.Follow the existing interface:
--datasetargument, outputs todata/gen_data/, and usesconfig.pyfor paths.Run it with the same
for d in ...loop above.Add the model to the analysis scripts (
analysis/distribution_metrics.py,analysis/clustering_evaluation.py).Re-run
python analysis/build_summary_tables.py.