Tutorial: Benchmark Against SOTA

This tutorial shows how to run GARAGE alongside state-of-the-art generative models and compare their synthetic data quality using the multi-metric evaluation framework.

Why Benchmark?

Benchmarking against SOTA models establishes GARAGE’s relative performance. The multi-metric approach (WD/MMD/SWD + ARI/NMI/F1) provides a quantitative basis for claims about distributional fidelity and biological utility.

Step 1: Generate Data with All Models

Run each model on all 4 datasets (Yan, Pollen, CBMC, Muraro):

GARAGE

for d in yan pollen cbmc muraro; do
    python -m data_generation.garage --dataset $d
done

General-purpose SOTA baselines

for d in yan pollen cbmc muraro; do
    python -m benchmarking.sota.gan --dataset $d
    python -m benchmarking.sota.wgan --dataset $d
    python -m benchmarking.sota.fgan --dataset $d
    python -m benchmarking.sota.vae --dataset $d
    python -m benchmarking.sota.lsh_gan --dataset $d
done

scRNA-seq-specific baselines

for d in yan pollen cbmc muraro; do
    python -m benchmarking.scrna_seq_specific.scgan --dataset $d
    python -m benchmarking.scrna_seq_specific.scvae --dataset $d
    python -m benchmarking.scrna_seq_specific.scdiffusion --dataset $d
    python -m benchmarking.scrna_seq_specific.gan_ros --dataset $d
    python -m benchmarking.scrna_seq_specific.vae_ros --dataset $d
done

Note: This generates \(11 \times 4 = 44\) synthetic datasets. Expect 2–4 hours on GPU depending on hardware.

Step 2: Compute Distributional Metrics

# Wasserstein Distance (all datasets)
for d in yan pollen cbmc muraro; do
    python -m data_generation.wasserstein_distance \
        --dataset $d \
        --gen_csv data/gen_data/${d}_data_mixdata_iter3_top_426.csv
done

# MMD and Sliced Wasserstein Distance (all methods × all datasets)
python analysis/distribution_metrics.py

# Individual SWD analysis
python analysis/swd_analysis.py
python analysis/mmd_analysis.py

Step 3: Compute Clustering Metrics

# Feature selection + Leiden clustering for all generated files
python analysis/clustering_evaluation.py

# scRNA-seq-specific benchmark metrics
python analysis/sc_specific_benchmark.py

Step 4: Build Summary Tables

# Aggregate all losses into a single CSV
python analysis/aggregate_losses.py

# Build mean ± std summary tables (WD/MMD/SWD per method)
python analysis/build_summary_tables.py

# Marker gene clustering grid search
python analysis/marker_clustering_grid.py

Step 5: Generate Publication-Quality Figures

# Wasserstein vs. leakage fraction plot
python analysis/plot_wasserstein_vs_leakage.py

Check results/ for the generated figures.

Step 6: Interpret the Results

Output files

File	Description
`results/summary_wasserstein.csv`	Mean WD per method per dataset
`results/summary_ari_nmi_f1.csv`	Mean ARI/NMI/F1 per method per dataset
`results/summary_mmd_swd.csv`	Mean MMD/SWD per method per dataset
`results/all_losses.csv`	Raw losses from all training runs
`results/wasserstein_vs_leakage.pdf`	Leakage fraction sweep figure

How to compare

A typical comparison table looks like:

Method	WD (↓)	MMD (↓)	ARI (↑)	NMI (↑)	F1 (↑)
GAN	0.045	0.312	0.412	0.523	0.398
WGAN	0.038	0.287	0.438	0.551	0.421
F-GAN	0.041	0.301	0.425	0.540	0.410
VAE	0.092	0.456	0.318	0.441	0.305
LSH-GAN	0.012	0.178	0.521	0.634	0.498
scGAN	0.033	0.265	0.475	0.582	0.452
scVAE	0.058	0.389	0.395	0.508	0.381
scDiffusion	0.041	0.310	0.442	0.549	0.433
GAN-ROS	0.028	0.252	0.488	0.591	0.461
VAE-ROS	0.051	0.365	0.408	0.522	0.395
GARAGE	0.007	0.142	0.584	0.672	0.545

(Numbers above are illustrative.)

Key comparison dimensions

Distributional fidelity (WD, MMD, SWD): Does the synthetic data match the real data in gene-expression space? Lower is better.
Biological utility (ARI, NMI, F1): Does the synthetic data cluster into biologically meaningful groups? Higher is better.
Rare-cell preservation: The macro-F1 is particularly important — it equally weights each cell type, so rare types are not averaged out. GARAGE should show the highest macro-F1 relative to its ARI.

Reproducing the Paper’s Figures

If you are writing a paper that compares GARAGE to other methods, you can build summary tables with mean ± std:

python analysis/build_summary_tables.py

This produces CSV files with columns like:

method,dataset,wd_mean,wd_std,ari_mean,ari_std,nmi_mean,nmi_std,f1_mean,f1_std
GARAGE,yam,0.0067,0.0012,0.724,0.031,0.801,0.025,0.694,0.028

These are directly importable into R (read.csv) or Python (pandas.read_csv) for plotting with ggplot2 or matplotlib/seaborn.

Adding a New Baseline

To add a new model to the benchmark:

Write a new Python file in benchmarking/sota/ or benchmarking/scrna_seq_specific/.
Follow the existing interface: --dataset argument, outputs to data/gen_data/, and uses config.py for paths.
Run it with the same for d in ... loop above.
Add the model to the analysis scripts (analysis/distribution_metrics.py, analysis/clustering_evaluation.py).
Re-run python analysis/build_summary_tables.py.