Quickstart

This guide gets you from zero to your first GARAGE-generated synthetic dataset in under 5 minutes.

Prerequisites

Python 3.12.5 installed with Conda.
The repository cloned and dependencies installed (requirements_garage.txt).
See Installation if you haven’t done this yet.

1. Activate the Environment

conda activate venv_garage
cd GARAGE

2. Verify Your Setup

python -c "import torch; print('PyTorch', torch.__version__); print('CUDA:', torch.cuda.is_available())"
python -c "from config import DATASET_CONFIG, GARAGE_DEFAULTS; print(list(DATASET_CONFIG))"

Expected output (roughly):

PyTorch 2.4.0
CUDA: True
['yan', 'pollen', 'cbmc', 'muraro']

3. Run GARAGE on a Small Dataset (Yan)

Yan is the smallest dataset (124 cells, 6 types) — it trains in ~2 minutes on CPU:

python -m data_generation.garage --dataset yan

What this does:

Loads the Yan expression matrix and cell-type labels.
Stage 1: Trains a GAT classifier on a KNN cell-cell graph, identifies seed cells.
Stage 2: Trains a GAN with hybrid noise+seed input batches (20,001 iterations).
Saves the generated data to data/gen_data/ (e.g., yan_data_mixdata_iter3_top_426.csv).

The console output will show loss values for both stages. Watch for the GAT loss decreasing and the GAN generator/discriminator losses converging.

4. Validate the Generated Data

python -m data_validation.data_validation \
    --dataset yan \
    --gen_csv data/gen_data/yan_data_mixdata_iter3_top_426.csv \
    --method cv2 \
    --plot_umap

What this does:

Loads the generated and real data.
Applies CV² feature selection (100 genes).
Runs Leiden clustering over a resolution sweep.
Reports ARI, NMI, and macro-F1 scores.
Optionally saves UMAP plots to results/.

Expected output (approximate):

Dataset: yan | Feature selection: cv2 | top_genes: 100
Best resolution: 1.05
ARI:  0.7243
NMI:  0.8011
F1:   0.6938

5. Check the Wasserstein Distance

python -m data_generation.wasserstein_distance \
    --dataset yan \
    --gen_csv data/gen_data/yan_data_mixdata_iter3_top_426.csv

A good Wasserstein distance for Yan is \(< 0.01\).

6. View the Results

ls results/

Typical output:

yan_cv2_ari_nmi_f1.csv
yan_cv2_umap_real.pdf
yan_cv2_umap_gen.pdf
yan_wasserstein_distance.csv

Next Steps

Now that you’ve seen the basic workflow:

Run on larger datasets: Replace --dataset yan with pollen, cbmc, or muraro.
Plug in your own data: See Preparing Your Data.
Full end-to-end tutorial: Tutorial: End-to-End Guide.
Biological validation: Tutorial: Biological Validation.
Benchmark against SOTA: Tutorial: Benchmark Against SOTA.

Common First-Time Issues

Problem	Solution
`ImportError: No module named 'torch_geometric'`	Install PyG: `pip install torch_geometric` (see `requirements_garage.txt`).
`CUDA out of memory`	Reduce `leakage_fraction` in `config.py` or run on CPU.
`Generated file not found`	Check the filename pattern in the console output; it depends on the iteration and dataset.
`Leiden returned 1 cluster`	Increase the resolution sweep range in `data_validation.py`.
`ARI = 0.0`	Try a different feature selection method (`--method pca` or `--method fano`).